thanks Sean and Park. its actually been interesting comparing these two
very different pieces of code that do the same thing. here are my
results thus far with my modification to both:
park, note removal of the for j loop.
really helped the speed!
def parks_tokenizer(string)
s = [‘<’,‘[’,‘{’,‘"’,‘‘]
e = [’>‘,’]‘,’}‘,’"‘,’’]
items =
i = 0
while i < string.length
if not s.include?(string[i,1])
j = i+1
j += 1 while j < string.length && !s.include?(string[j,1])
items.concat string[i…j-1].strip.split(’ ')
i = j
else
j = s.index(string[i,1])
if s[j] == ‘"’ || s[j] == ‘*’
k = string.index(e[j],i+1)
else
k = i
k += 1 while string[k,1] == s[j]
k = string.index(e[j],k)
k += 1 while k+1 < string.length && string[k+1,1] == e[j]
end
items << string[i…k]
i = k+1
end
end
return items
end
sean, i got rid of the perlish notation
and made the second part more like the first
def seans_tokenizer(string)
tokens = {‘[’=>‘]’, ‘<’=>‘>’, ‘"’=>‘"’, ‘{’=>‘}’, ‘‘=>’’, “'”=>“'”}
items =
while string.size > 0
if tokens.keys.include?(string[0,1])
end_index = string.index(tokens[string[0,1]], 1)
item = string[0…end_index]
items << item
string = string[end_index+1…-1]
while item.count(item[0,1]) > item.count(tokens[item[0,1]])
end_index = string.index(tokens[item[0,1]])
item << string[0…end_index]
string = string[end_index+1…-1]
end
else
end_index = string.index(/[[{<*"'\s]|\z/, 1)
item = string[0…end_index-1].strip
items << item if not item.empty?
string = string[end_index…-1]
end
end
items
end
looping over 100 itereations of each results in park’s version taking
~2.8 seconds and sean’s ~2.3, but i think park’s might have a little
more room for improvement. oddly the more i work with them, the more i
am beginning to see that they are, in effect, the same. let you know how
that progresses.
by the way, one of the reasons i brought this up (and thank god i did as
these pieces of code are so much better then mine) was to perhaps talk
about Regular Expressions and string parsing in general. Seems to me
that parsing text is like THE fundemental programming task. why hasn’t
any really awsome technologies come about to deal with this. in my
personal opinion Regexps are powerful but limited, as indicated by my
parsing problem. i remember hearing that a language called Snobol had
great string processing capabilities. does anyone know about that?
finally, a Steven J. Hunter sent me this Icon version:
l_ans :=
str_in ? until pos(0) do # Written by Steven J. Hunter
if close_delim_cs := \open2close_t[open_delim := move(1)]
then put(l_ans, open_delim||tab(1+bal(close_delim_cs, ‘<[{’,‘}]>’)))
else tab(many(’ ')) | put(l_ans, tab(upto(start_of_nxt_token_cs)|0))
a real mouthful, but quite compact. i haven’t fully digested this yet.
thanks for participation! this has turned out to be much more
interesting and fruitful then i expected.
~transami (tom)
···
On Thu, 2002-07-04 at 09:15, Sean Russell wrote:
Park Heesob wrote:
did you take a look at sean’s version, by the way?
a tad more elegent although he does use regexps.
Sean’s version fails at
str = ‘a[c]{d}"e"f {{g}} [[h]]i**j"k"l’
Adding two characters to the regexp fixes that. The regexp should be
string =~ /(.?)(?=[<[{"']|$)/
–
… “They that can give up essential liberty to obtain a little
<|> temporary safety deserve neither liberty nor safety.”
/|\ – Benjamin Franklin
/|