Is there a fast way to convert a string into a list of tokens?
Something like:
“a= c+ a”.tokenize(’ ', ‘=’, ‘+’) #=> [‘a’, ‘=’, ’ ', ‘c’, ‘+’, ‘a’]
Is there a fast way to convert a string into a list of tokens?
Something like:
“a= c+ a”.tokenize(’ ', ‘=’, ‘+’) #=> [‘a’, ‘=’, ’ ', ‘c’, ‘+’, ‘a’]
John W. Long wrote:
Is there a fast way to convert a string into a list of tokens?
Something like:
“a= c+ a”.tokenize(’ ', ‘=’, ‘+’) #=> [‘a’, ‘=’, ’ ', ‘c’, ‘+’, ‘a’]
irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)
Hal
Hal Fulton wrote:
irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)
Actually no, I’m not using it to parse Ruby. I’m just looking for something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.
This is quick and dirty, but it demonstrates what I am looking for:
tokens = %w{
}
string = <<-HERE
test
wow it worked
HERE
class String
def tokenize(tokens)
tokens.each { |t| self.gsub!(/#{t}/, “<>#{t}<>”) }
self.gsub!(/\A<>/, ‘’)
split(’<>’)
end
end
p string.tokenize(tokens)
this should output:
["", "\n ", “”, "\n ", “”, “test”, “”, “\n “,
””, "\n ", “”, "\n ", “
”, “wow it worked”, “
”, “\n “,Is there a better way?
“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:00e401c3d804$93be7ff0$6601a8c0@jwldesktop…
Hal Fulton wrote:
irb comes with its own lexer. I’ve used that before.
(If it’s Ruby you’re tokenizing.)Actually no, I’m not using it to parse Ruby. I’m just looking for
something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.
I’m not sure that I understand what you mean by “leaves the tokens in place
in the array”. String#scan is usually the method of choice:
irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]
What are you missing?
Regards
robert
I believe this also works:
class String
def tokenize(*tokens)
regstr = “”
regex = Regexp.new(tokens.map do |t|
Regexp.escape(t)
end.join("|"))
do_tokenize(regex).delete_if { |str| str == "" }
end
def do_tokenize(regex)
match = regex.match self
if match
[match.pre_match, match[0]] +
match.post_match.do_tokenize(regex)
else
[]
end
end
end
The recursion could cause problems if you have really long strings, in
which case it’d probably
be wise to rewrite it as a loop (which is arguably somewhat uglier). You
might also want to
make #do_tokenize private. I don’t know if this is the best way, but
it’s a way.
What about this? It produces what the original poster asked for in his
original message.
“a= c+ a”.split //
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]
On 1/11/2004 9:06 AM, Robert Klemme wrote:
irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]What are you missing?
–
Never trust a girl with your mother’s cow,
never let your trousers go falling down in the green grass…
Hi,
At Sun, 11 Jan 2004 23:06:39 +0900, Robert Klemme wrote:
irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]
What about this?
“a= c+ a”.split(/\b/) # => [“a”, "= ", “c”, "+ ", “a”]
–
Nobu Nakada
“Joey Gibson” joey@joeygibson.com schrieb im Newsbeitrag
news:4002A7D6.4090100@joeygibson.com…
On 1/11/2004 9:06 AM, Robert Klemme wrote:
irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]What are you missing?
What about this? It produces what the original poster asked for in his
original message.“a= c+ a”.split //
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]
It’s bad because it splits at every character and not tokens:
irb(main):001:0> “foo bar”.split //
=> [“f”, “o”, “o”, " ", “b”, “a”, “r”]
Definitely not a solution for the OP.
robert
nobu.nokada@softhome.net schrieb im Newsbeitrag
news:200401121711.i0CHBhuw004600@sharui.nakada.kanuma.tochigi.jp…
Hi,
irb(main):002:0> “a= c+ a”.scan( /\w+|=|+/ )
=> [“a”, “=”, “c”, “+”, “a”]
irb(main):003:0> “a= c+ a”.scan( /\w+|=|+|\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]What about this?
“a= c+ a”.split(/\b/) # => [“a”, "= ", “c”, "+ ", “a”]
I guess the OP won’t like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into String:
irb(main):089:0> def tokenizer(*tokens)
irb(main):090:1> Regexp.new( tokens.map{|tk| tk.kind_of?( Regexp ) ? tk :
Regexp.escape(tk)}.join(‘|’) )
irb(main):091:1> end
=> nil
irb(main):092:0> def tokenize(str, *tokens)
irb(main):093:1> str.scan tokenizer(*tokens)
irb(main):094:1> end
=> nil
irb(main):095:0> tokenize( “a= c+ a”, “=”, “+”, /\w+/, /\s+/ )
=> [“a”, “=”, " ", “c”, “+”, " ", “a”]
That way one can reuse the tokenizer regexp for multiple passes.
Regards
robert
At Sun, 11 Jan 2004 23:06:39 +0900, > Robert Klemme wrote:
“Dan Doel” wrote:
I believe this also works:
…snip!..
This is almost exactly what I was looking for.
The recursion could cause problems if you have really
long strings, in which case it’d probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).
Depends on what you mean by ugly:
class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t| Regexp::escape(t) }.join(“|”))
string = self.dup
array =
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == “” }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end
Very nice. If only it would work with regular expressions as well.
I wonder what the odds are of getting this or something like this added to
the language. Seems like it would be a nice to have on the String class to
begin with and written in C for speed.
Robert Klemme wrote:
I guess the OP won’t like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into
String:
…
Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don’t want to have to specify everything I’m
looking for. Just the significant tokens. Sometimes whitespace is
significant:
‘a = " this is a string "’.tokenize(‘"’, “=”)
should produce:
[“a”, " ", “=”, " ", “"”, " this is a string ", “"”]
“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:010501c3d995$9f435390$6601a8c0@jwldesktop…
“Dan Doel” wrote:
I believe this also works:
…snip!..This is almost exactly what I was looking for.
The recursion could cause problems if you have really
long strings, in which case it’d probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).Depends on what you mean by ugly:
class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t|
Regexp::escape(t) }.join(“|”))
string = self.dup
array =
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == “” }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
endVery nice. If only it would work with regular expressions as well.
I wonder what the odds are of getting this or something like this added
to
the language. Seems like it would be a nice to have on the String class
to
begin with and written in C for speed.
Oh, we can still tweak the solution provided:
Use Array#push or Array#<< instead of “+=” which creates too much tmp
instances
implement the iteration in each_token and make tokenize depend on that,
so that tokenizing of large strings via each_token is more efficient
because no array is needed then.
Don’t add empty strings to the array.
No need to dup.
That’s what I’d do:
class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
end
def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join(“|”))
string = self
while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0
string = match.post_match
end
yield string if string.length > 0
self
end
end
Kind regards
robert
class String
def tokenize(tokens)
tokens.map! {|t|
case t
when Regexp
s = t.to_s
s.gsub!(/\./m, ‘’) # Remove (
s.gsub!(/(?/, ‘’) # Remove (?
s =~ /(/ ? t : /(#{t})/ # Add capturing () if none.
when Symbol
/(\s)(#{t})(\s*)/ # Significant whitespace.
else
/(#{t})/
end
}
split(/#{tokens * ‘|’}/).reject {|x| x.empty?}
end
end
str = ‘a = " this is a string "’
p str.tokenize(‘"’, ‘=’) # Whitespace not a token.
p str.tokenize(‘"’, /(\s*)(=)(\s*)/) # Significant whitespace.
p str.tokenize(‘"’, :‘=’) # Same but less typing.
“John W. Long” ng@johnwlong.com wrote:
Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don’t want to have to specify everything I’m
looking for. Just the significant tokens. Sometimes whitespace is
significant:‘a = " this is a string "’.tokenize(‘"’, “=”)
should produce:
[“a”, " ", “=”, " ", “"”, " this is a string ", “"”]
“Robert Klemme” wrote:
That’s what I’d do:
class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
enddef each_token(*tokens) regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join(“|”))
string = selfwhile( match = regex.match(string) ) yield match.pre_match if match.pre_match.length > 0 yield match[0] if match[0].length > 0 string = match.post_match end yield string if string.length > 0 self end
end
Good ideas. It would be fun to optimize this further for speed. Maybe even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.
In mail “Re: Converting a string to an array of tokens”
class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
enddef each_token(*tokens) regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t : Regexp::escape(t) }.join("|")) string = self while( match = regex.match(string) ) yield match.pre_match if match.pre_match.length > 0 yield match[0] if match[0].length > 0 string = match.post_match end yield string if string.length > 0 self end
end
Using ruby 1.8.1:
~ % cat t
require ‘strscan’
require ‘enumerator’
class String
def tokenize(*patterns)
enum_for(:each_token, *patterns).map
end
def each_token(*patterns)
re = Regexp.union(*patterns)
s = StringScanner.new(self)
until s.eos?
break unless s.skip_until(re)
yield s[0]
end
end
end
p “def m(a) 1 + a end”.tokenize(‘def’, ‘end’, /[a-z_]\w*/i, /\d+/, /\S/)
~ % ruby -v t
ruby 1.9.0 (2004-01-12) [i686-linux]
[“def”, “m”, “(”, “a”, “)”, “1”, “+”, “a”, “end”]
– Minero Aoki
“Robert Klemme” bob.news@gmx.net wrote:
“John W. Long” ng@johnwlong.com schrieb im Newsbeitrag
news:001f01c3d9da$28eeea70$6601a8c0@jwldesktop…
“Robert Klemme” wrote:
That’s what I’d do:
class String
def tokenize(*tokens)
array =
each_token(*tokens){|tk| array << tk}
array
enddef each_token(*tokens) regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join(“|”))
string = selfwhile( match = regex.match(string) ) yield match.pre_match if match.pre_match.length > 0 yield match[0] if match[0].length > 0 string = match.post_match end yield string if string.length > 0 self end
end
Good ideas. It would be fun to optimize this further for speed. Maybe
even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.
Btw, this would be easier if Regexp#match would support additional
arguments for start and end like
def match(str, start = 0, end = str.length)
…
end
and MatchData would expose the the index of the start element and the
indes of the first elem after the match. The loop above could then be
written as:
regex = …
start = 0
while ( match = regex.match(string, start) )
yield self[start, match.start_index - start] if match.start_index -
start > 0
yield match[0] if match[0].length > 0
start = match.end_index # index of 1st element after match
end
yield self[start,self.length] if self.length - start > 0
What do others think, is this a reasonable extension?
Kind regards
robert