Converting a string to an array of tokens

John_W_Long3 · 11 January 2004 05:23

Is there a fast way to convert a string into a list of tokens?

Something like:

“a= c+ a”.tokenize(’ ', ‘=’, ‘+’) #=> [‘a’, ‘=’, ’ ', ‘c’, ‘+’, ‘a’]

···

HAL_9000 · 11 January 2004 05:26

John W. Long wrote:

Is there a fast way to convert a string into a list of tokens?

Something like:

“a= c+ a”.tokenize(’ ', ‘=’, ‘+’) #=> [‘a’, ‘=’, ’ ', ‘c’, ‘+’, ‘a’]

irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)

Hal

John_Long2 · 11 January 2004 05:34

Hal Fulton wrote:

irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)

Actually no, I’m not using it to parse Ruby. I’m just looking for something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.

···

John Long
www.wiseheartdesign.com

John_Long2 · 11 January 2004 06:54

This is quick and dirty, but it demonstrates what I am looking for:


tokens = %w{         
 }
string = <<-HERE

 
  test
 
 
  wow it worked
 

HERE
class String

def tokenize(tokens)

tokens.each { |t| self.gsub!(/#{t}/, “<>#{t}<>”) }

self.gsub!(/\A<>/, ‘’)

split(’<>’)

end

end
p string.tokenize(tokens)

this should output:

["", "\n ", “”, "\n ", “”, “test”, “”, “\n “,
””, "\n ", “”, "\n ", “

”, “wow it worked”, “

”, “\n “,
””, “\n”, “”, “\n”]

Is there a better way?

···

John Long
www.wiseheartdesign.com

Robert · 11 January 2004 14:06

“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:00e401c3d804$93be7ff0$6601a8c0@jwldesktop…

Hal Fulton wrote:

irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)

Actually no, I’m not using it to parse Ruby. I’m just looking for
something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.

I’m not sure that I understand what you mean by “leaves the tokens in place
in the array”. String#scan is usually the method of choice:

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What are you missing?

Regards

robert

Dan_Doel · 11 January 2004 11:09

I believe this also works:

class String
def tokenize(*tokens)
regstr = “”

    regex = Regexp.new(tokens.map do |t|
        Regexp.escape(t)
    end.join("|"))
   
    do_tokenize(regex).delete_if { |str| str == "" }
end

def do_tokenize(regex)
    match = regex.match self
   
    if match
        [match.pre_match, match[0]] +

match.post_match.do_tokenize(regex)
else
[]
end
end
end

The recursion could cause problems if you have really long strings, in
which case it’d probably
be wise to rewrite it as a loop (which is arguably somewhat uglier). You
might also want to
make #do_tokenize private. I don’t know if this is the best way, but
it’s a way.

Dan

Joey_Gibson · 12 January 2004 13:57

What about this? It produces what the original poster asked for in his
original message.

“a= c+ a”.split //
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

···

On 1/11/2004 9:06 AM, Robert Klemme wrote:

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What are you missing?

–
Never trust a girl with your mother’s cow,
never let your trousers go falling down in the green grass…

Nobuyoshi_Nakada · 12 January 2004 17:11

Hi,

···

At Sun, 11 Jan 2004 23:06:39 +0900, Robert Klemme wrote:

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What about this?

“a= c+ a”.split(/\b/) # => [“a”, "= ", “c”, "+ ", “a”]

–
Nobu Nakada

Robert · 12 January 2004 14:21

“Joey Gibson” joey@joeygibson.com schrieb im Newsbeitrag
news:4002A7D6.4090100@joeygibson.com…

···

On 1/11/2004 9:06 AM, Robert Klemme wrote:

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What are you missing?

What about this? It produces what the original poster asked for in his
original message.

“a= c+ a”.split //
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

It’s bad because it splits at every character and not tokens:

irb(main):001:0> “foo bar”.split //
=> [“f”, “o”, “o”, " ", “b”, “a”, “r”]

Definitely not a solution for the OP.

robert

Robert · 12 January 2004 20:56

nobu.nokada@softhome.net schrieb im Newsbeitrag
news:200401121711.i0CHBhuw004600@sharui.nakada.kanuma.tochigi.jp…

Hi,

irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

What about this?

“a= c+ a”.split(/\b/) # => [“a”, "= ", “c”, "+ ", “a”]

I guess the OP won’t like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into String:

irb(main):089:0> def tokenizer(*tokens)
irb(main):090:1> Regexp.new( tokens.map{|tk| tk.kind_of?( Regexp ) ? tk :
Regexp.escape(tk)}.join(‘|’) )
irb(main):091:1> end
=> nil
irb(main):092:0> def tokenize(str, *tokens)
irb(main):093:1> str.scan tokenizer(*tokens)
irb(main):094:1> end
=> nil
irb(main):095:0> tokenize( “a= c+ a”, “=”, “+”, /\w+/, /\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]

That way one can reuse the tokenizer regexp for multiple passes.

Regards

robert

···

At Sun, 11 Jan 2004 23:06:39 +0900, > Robert Klemme wrote:

John_Long2 · 13 January 2004 05:25

“Dan Doel” wrote:

I believe this also works:
…snip!..

This is almost exactly what I was looking for.

The recursion could cause problems if you have really
long strings, in which case it’d probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).

Depends on what you mean by ugly:

class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t| Regexp::escape(t) }.join(“|”))
string = self.dup
array =
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == “” }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end

Very nice. If only it would work with regular expressions as well.

I wonder what the odds are of getting this or something like this added to
the language. Seems like it would be a nice to have on the String class to
begin with and written in C for speed.

···

John Long
www.wiseheartdesign.com

John_Long2 · 13 January 2004 03:21

Robert Klemme wrote:

I guess the OP won’t like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into
String:
…

Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don’t want to have to specify everything I’m
looking for. Just the significant tokens. Sometimes whitespace is
significant:

‘a = " this is a string "’.tokenize(‘"’, “=”)

should produce:

[“a”, " ", “=”, " ", “"”, " this is a string ", “"”]

···

John Long
www.wiseheartdesign.com

Robert · 13 January 2004 09:21

“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:010501c3d995$9f435390$6601a8c0@jwldesktop…

“Dan Doel” wrote:

I believe this also works:
…snip!..

This is almost exactly what I was looking for.

The recursion could cause problems if you have really
long strings, in which case it’d probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).

Depends on what you mean by ugly:

class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t|
Regexp::escape(t) }.join(“|”))
string = self.dup
array =
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == “” }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end

Very nice. If only it would work with regular expressions as well.

I wonder what the odds are of getting this or something like this added
to
the language. Seems like it would be a nice to have on the String class
to
begin with and written in C for speed.

Oh, we can still tweak the solution provided:

Use Array#push or Array#<< instead of “+=” which creates too much tmp
instances
implement the iteration in each_token and make tokenize depend on that,
so that tokenizing of large strings via each_token is more efficient
because no array is needed then.
Don’t add empty strings to the array.
No need to dup.

That’s what I’d do:

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :

Regexp::escape(t) }.join(“|”))
string = self

    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0

        string = match.post_match
    end

    yield string if string.length > 0
    self
end

end

Kind regards

robert

Sabbyxtabby · 13 January 2004 12:46

class String
def tokenize(tokens)
tokens.map! {|t|
case t
when Regexp
s = t.to_s
s.gsub!(/\./m, ‘’) # Remove (
s.gsub!(/(?/, ‘’) # Remove (?
s =~ /(/ ? t : /(#{t})/ # Add capturing () if none.
when Symbol
/(\s)(#{t})(\s*)/ # Significant whitespace.
else
/(#{t})/
end
}
split(/#{tokens * ‘|’}/).reject {|x| x.empty?}
end
end

str = ‘a = " this is a string "’

p str.tokenize(‘"’, ‘=’) # Whitespace not a token.
p str.tokenize(‘"’, /(\s*)(=)(\s*)/) # Significant whitespace.
p str.tokenize(‘"’, :‘=’) # Same but less typing.

···

“John W. Long” ng@johnwlong.com wrote:

Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don’t want to have to specify everything I’m
looking for. Just the significant tokens. Sometimes whitespace is
significant:

‘a = " this is a string "’.tokenize(‘"’, “=”)

should produce:

[“a”, " ", “=”, " ", “"”, " this is a string ", “"”]

John_Long2 · 13 January 2004 13:35

“Robert Klemme” wrote:

That’s what I’d do:

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end
def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join(“|”))
string = self
    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0

        string = match.post_match
    end

    yield string if string.length > 0
    self
end
end

Good ideas. It would be fun to optimize this further for speed. Maybe even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.

···

John Long
www.wiseheartdesign.com

Minero_Aoki · 13 January 2004 15:19

In mail “Re: Converting a string to an array of tokens”

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t : Regexp::escape(t) }.join("|"))
    string = self

    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0

        string = match.post_match
    end

    yield string if string.length > 0
    self
end

end

Using ruby 1.8.1:

~ % cat t
require ‘strscan’
require ‘enumerator’

class String
def tokenize(*patterns)
enum_for(:each_token, *patterns).map
end

def each_token(*patterns)
re = Regexp.union(*patterns)
s = StringScanner.new(self)
until s.eos?
break unless s.skip_until(re)
yield s[0]
end
end
end

p “def m(a) 1 + a end”.tokenize(‘def’, ‘end’, /[a-z_]\w*/i, /\d+/, /\S/)

~ % ruby -v t
ruby 1.9.0 (2004-01-12) [i686-linux]
[“def”, “m”, “(”, “a”, “)”, “1”, “+”, “a”, “end”]

– Minero Aoki

···

“Robert Klemme” bob.news@gmx.net wrote:

Robert · 13 January 2004 14:56

“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:001f01c3d9da$28eeea70$6601a8c0@jwldesktop…

“Robert Klemme” wrote:
That’s what I’d do:

class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end
def each_token(*tokens)
    regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join(“|”))
string = self
    while( match = regex.match(string) )
        yield match.pre_match if match.pre_match.length > 0
        yield match[0] if match[0].length > 0

        string = match.post_match
    end

    yield string if string.length > 0
    self
end
end
Good ideas. It would be fun to optimize this further for speed. Maybe
even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.

Btw, this would be easier if Regexp#match would support additional
arguments for start and end like

def match(str, start = 0, end = str.length)
…
end

and MatchData would expose the the index of the start element and the
indes of the first elem after the match. The loop above could then be
written as:

regex = …
start = 0

while ( match = regex.match(string, start) )
yield self[start, match.start_index - start] if match.start_index -
start > 0
yield match[0] if match[0].length > 0
start = match.end_index # index of 1st element after match
end

yield self[start,self.length] if self.length - start > 0

What do others think, is this a reasonable extension?

Kind regards

robert

Topic		Replies	Views
Converting a string to an array of tokens ruby-talk	2	100	14 January 2004
Converting a string to an array of tokens (TestCase Attached) ruby-talk	1	89	14 January 2004
Converting a string to an array of tokens (TestCase Attached) ruby-talk	1	108	14 January 2004
Scan for Tokens ruby-talk	2	74	11 November 2007
Tokenizing a large file ruby-talk	8	86	16 April 2009

Converting a string to an array of tokens

Related topics