Converting a string to an array of tokens

Is there a fast way to convert a string into a list of tokens?

Something like:

“a= c+ a”.tokenize(’ ', ‘=’, ‘+’) #=> [‘a’, ‘=’, ’ ', ‘c’, ‘+’, ‘a’]

···

John Long
www.wiseheartdesign.com

John W. Long wrote:

Is there a fast way to convert a string into a list of tokens?

Something like:

“a= c+ a”.tokenize(’ ', ‘=’, ‘+’) #=> [‘a’, ‘=’, ’ ', ‘c’, ‘+’, ‘a’]

irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)

Hal

Hal Fulton wrote:

irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)

Actually no, I’m not using it to parse Ruby. I’m just looking for something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.

···

John Long
www.wiseheartdesign.com

This is quick and dirty, but it demonstrates what I am looking for:

tokens = %w{

}

string = <<-HERE

test

wow it worked

HERE

class String
def tokenize(tokens)
tokens.each { |t| self.gsub!(/#{t}/, “<>#{t}<>”) }
self.gsub!(/\A<>/, ‘’)
split(’<>’)
end
end

p string.tokenize(tokens)

this should output:

["", "\n ", “”, "\n ", “”, “test”, “”, “\n “,
””, "\n ", “”, "\n ", “

”, “wow it worked”, “

”, “\n “,
””, “\n”, “”, “\n”]

Is there a better way?

···

John Long
www.wiseheartdesign.com

“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:00e401c3d804$93be7ff0$6601a8c0@jwldesktop…

Hal Fulton wrote:

irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)

Actually no, I’m not using it to parse Ruby. I’m just looking for
something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.

I’m not sure that I understand what you mean by “leaves the tokens in place
in the array”. String#scan is usually the method of choice:

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What are you missing?

Regards

robert

I believe this also works:

class String
def tokenize(*tokens)
regstr = “”

    regex = Regexp.new(tokens.map do |t|
        Regexp.escape(t)
    end.join("|"))
   
    do_tokenize(regex).delete_if { |str| str == "" }
end

def do_tokenize(regex)
    match = regex.match self
   
    if match
        [match.pre_match, match[0]] + 

match.post_match.do_tokenize(regex)
else
[]
end
end
end

The recursion could cause problems if you have really long strings, in
which case it’d probably
be wise to rewrite it as a loop (which is arguably somewhat uglier). You
might also want to
make #do_tokenize private. I don’t know if this is the best way, but
it’s a way.

  • Dan

What about this? It produces what the original poster asked for in his
original message.

“a= c+ a”.split //
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

···

On 1/11/2004 9:06 AM, Robert Klemme wrote:

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What are you missing?


Never trust a girl with your mother’s cow,
never let your trousers go falling down in the green grass…

Hi,

···

At Sun, 11 Jan 2004 23:06:39 +0900, Robert Klemme wrote:

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What about this?

“a= c+ a”.split(/\b/) # => [“a”, "= ", “c”, "+ ", “a”]


Nobu Nakada

“Joey Gibson” joey@joeygibson.com schrieb im Newsbeitrag
news:4002A7D6.4090100@joeygibson.com

···

On 1/11/2004 9:06 AM, Robert Klemme wrote:

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What are you missing?

What about this? It produces what the original poster asked for in his
original message.

“a= c+ a”.split //
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

It’s bad because it splits at every character and not tokens:

irb(main):001:0> “foo bar”.split //
=> [“f”, “o”, “o”, " ", “b”, “a”, “r”]

Definitely not a solution for the OP.

robert

nobu.nokada@softhome.net schrieb im Newsbeitrag
news:200401121711.i0CHBhuw004600@sharui.nakada.kanuma.tochigi.jp…

Hi,

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What about this?

“a= c+ a”.split(/\b/) # => [“a”, "= ", “c”, "+ ", “a”]

I guess the OP won’t like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into String:

irb(main):089:0> def tokenizer(*tokens)
irb(main):090:1> Regexp.new( tokens.map{|tk| tk.kind_of?( Regexp ) ? tk :
Regexp.escape(tk)}.join(‘|’) )
irb(main):091:1> end
=> nil
irb(main):092:0> def tokenize(str, *tokens)
irb(main):093:1> str.scan tokenizer(*tokens)
irb(main):094:1> end
=> nil
irb(main):095:0> tokenize( “a= c+ a”, “=”, “+”, /\w+/, /\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

That way one can reuse the tokenizer regexp for multiple passes.

Regards

robert
···

At Sun, 11 Jan 2004 23:06:39 +0900, > Robert Klemme wrote:

“Dan Doel” wrote:

I believe this also works:
…snip!..

This is almost exactly what I was looking for.

The recursion could cause problems if you have really
long strings, in which case it’d probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).

Depends on what you mean by ugly:

class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t| Regexp::escape(t) }.join(“|”))
string = self.dup
array =
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == “” }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end

Very nice. If only it would work with regular expressions as well.

I wonder what the odds are of getting this or something like this added to
the language. Seems like it would be a nice to have on the String class to
begin with and written in C for speed.

···

John Long
www.wiseheartdesign.com

Robert Klemme wrote:

I guess the OP won’t like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into
String:

Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don’t want to have to specify everything I’m
looking for. Just the significant tokens. Sometimes whitespace is
significant:

‘a = " this is a string "’.tokenize(‘"’, “=”)

should produce:

[“a”, " ", “=”, " ", “"”, " this is a string ", “"”]

···

John Long
www.wiseheartdesign.com

“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:010501c3d995$9f435390$6601a8c0@jwldesktop…

“Dan Doel” wrote:

I believe this also works:
…snip!..

This is almost exactly what I was looking for.

The recursion could cause problems if you have really
long strings, in which case it’d probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).

Depends on what you mean by ugly:

class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t|
Regexp::escape(t) }.join(“|”))
string = self.dup
array =
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == “” }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end

Very nice. If only it would work with regular expressions as well.

I wonder what the odds are of getting this or something like this added
to
the language. Seems like it would be a nice to have on the String class
to
begin with and written in C for speed.

Oh, we can still tweak the solution provided:

  • Use Array#push or Array#<< instead of “+=” which creates too much tmp
    instances

  • implement the iteration in each_token and make tokenize depend on that,
    so that tokenizing of large strings via each_token is more efficient
    because no array is needed then.

  • Don’t add empty strings to the array.

  • No need to dup.

That’s what I’d do:

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :

Regexp::escape(t) }.join(“|”))
string = self

    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0

        string = match.post_match
    end

    yield string if string.length > 0
    self
end

end

Kind regards

robert

class String
def tokenize(tokens)
tokens.map! {|t|
case t
when Regexp
s = t.to_s
s.gsub!(/\./m, ‘’) # Remove (
s.gsub!(/(?/, ‘’) # Remove (?
s =~ /(/ ? t : /(#{t})/ # Add capturing () if none.
when Symbol
/(\s
)(#{t})(\s*)/ # Significant whitespace.
else
/(#{t})/
end
}
split(/#{tokens * ‘|’}/).reject {|x| x.empty?}
end
end

str = ‘a = " this is a string "’

p str.tokenize(‘"’, ‘=’) # Whitespace not a token.
p str.tokenize(‘"’, /(\s*)(=)(\s*)/) # Significant whitespace.
p str.tokenize(‘"’, :‘=’) # Same but less typing.

···

“John W. Long” ng@johnwlong.com wrote:

Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don’t want to have to specify everything I’m
looking for. Just the significant tokens. Sometimes whitespace is
significant:

‘a = " this is a string "’.tokenize(‘"’, “=”)

should produce:

[“a”, " ", “=”, " ", “"”, " this is a string ", “"”]

“Robert Klemme” wrote:

That’s what I’d do:

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :

Regexp::escape(t) }.join(“|”))
string = self

    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0

        string = match.post_match
    end

    yield string if string.length > 0
    self
end

end

Good ideas. It would be fun to optimize this further for speed. Maybe even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.

···

John Long
www.wiseheartdesign.com

In mail “Re: Converting a string to an array of tokens”

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t : Regexp::escape(t) }.join("|"))
    string = self

    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0

        string = match.post_match
    end

    yield string if string.length > 0
    self
end

end

Using ruby 1.8.1:

~ % cat t
require ‘strscan’
require ‘enumerator’

class String
def tokenize(*patterns)
enum_for(:each_token, *patterns).map
end

def each_token(*patterns)
re = Regexp.union(*patterns)
s = StringScanner.new(self)
until s.eos?
break unless s.skip_until(re)
yield s[0]
end
end
end

p “def m(a) 1 + a end”.tokenize(‘def’, ‘end’, /[a-z_]\w*/i, /\d+/, /\S/)

~ % ruby -v t
ruby 1.9.0 (2004-01-12) [i686-linux]
[“def”, “m”, “(”, “a”, “)”, “1”, “+”, “a”, “end”]

– Minero Aoki

···

“Robert Klemme” bob.news@gmx.net wrote:

“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:001f01c3d9da$28eeea70$6601a8c0@jwldesktop…

“Robert Klemme” wrote:

That’s what I’d do:

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :

Regexp::escape(t) }.join(“|”))
string = self

    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0

        string = match.post_match
    end

    yield string if string.length > 0
    self
end

end

Good ideas. It would be fun to optimize this further for speed. Maybe
even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.

Btw, this would be easier if Regexp#match would support additional
arguments for start and end like

def match(str, start = 0, end = str.length)

end

and MatchData would expose the the index of the start element and the
indes of the first elem after the match. The loop above could then be
written as:

regex = …
start = 0

while ( match = regex.match(string, start) )
yield self[start, match.start_index - start] if match.start_index -
start > 0
yield match[0] if match[0].length > 0
start = match.end_index # index of 1st element after match
end

yield self[start,self.length] if self.length - start > 0

What do others think, is this a reasonable extension?

Kind regards

robert