Converting a string to an array of tokens

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :

Regexp::escape(t) }.join(“|”))
string = self

    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0
        string = match.post_match
    end

    yield string if string.length > 0
    self
end

end

Question 1:

Except for the pre_match, isn’t this just doing the same thing as scan. And
if that’s the case, why not just include the expected prematch patterns in
the list of regexps and use scan directly?

Question 2:

Ben Tilly pointed out (a long time ago) that the “string =
string.post_match” type of statement is enormously inefficient for large
strings because amount of string copying involved. In ruby-talk:89747 Nobu
Nakada indicated that string tails could be shared and use copy-on-write
(COW). In current Ruby, are the strings shared with COW semantics, or was
Nobu just speculating on possible implementations?

···


– Jim Weirich / Compuware
– FWP Capture Services
– Phone: 859-386-8855

“Weirich, James” wrote:

Question 1:

Except for the pre_match, isn’t this just doing the same thing as scan.
And
if that’s the case, why not just include the expected prematch patterns in
the list of regexps and use scan directly?

It’s the unexpected that I’m thinking about. How do you make it so that it
will match anything other than your token? This doesn’t seem to work:

regex = Regexp.new(
“(?m:” << tokens.map { |t|
Regexp::escape(t)
}.join(“|”) << “|.*)”
)
scan(regex)

Question 2:

Ben Tilly pointed out (a long time ago) that the “string =
string.post_match” type of statement is enormously inefficient for large
strings because amount of string copying involved. In ruby-talk:89747
Nobu
Nakada indicated that string tails could be shared and use copy-on-write
(COW). In current Ruby, are the strings shared with COW semantics, or was
Nobu just speculating on possible implementations?

I’m open to other suggestions, but the one presented seems the most elegant
so far.

I’ve attached code and a TestCase for those who want to fool around with it.

tokens.rb (2.16 KB)

···


John Long
http://wiseheartdesign.com

Hi,

···

At Wed, 14 Jan 2004 00:52:02 +0900, Weirich, James wrote:

Question 2:

Ben Tilly pointed out (a long time ago) that the “string =
string.post_match” type of statement is enormously inefficient for large
strings because amount of string copying involved. In ruby-talk:89747 Nobu
Nakada indicated that string tails could be shared and use copy-on-write
(COW). In current Ruby, are the strings shared with COW semantics, or was
Nobu just speculating on possible implementations?

It has been implemented.


Nobu Nakada