> I ended up refactoring, but earlier I was parsing some text by
> associating an array of attributes (like, [/a/, /b/, /c$/] that might
> match the first 3 words) with a block that processed the matching
> text, and then moved the position in the string forward by 3 words. I
> tried wanted to be able to do this like:
Did I understand that properly, you want to process three words at a
time and then the next three words? Then you could do...
I already have solved the problem, but maybe someone will find this
useful in the future. Basically we start with position=0 (position is
the index of the words array). Each "match" is tried until one
succeeds, and the position is incremented by the number of words it
operated on. So if I called match.call(/the/, /quick/, /brown/, /.*/,
/e$/), it would read 5 words starting at "position" and if all the
arguments matched the words, it would process the 5 words in some way
and then increment the position by 5.
In my application I'm not really using regexs though, my words are
tokens with various tags, and I'm matching based on the tags. This is
all being used to pase date strings like "Wed Aug 5th 2008" might be
matched by a rule like match.call(:weekday, :month, :ordinal, :year)
for example. Then there might be another rule like match.call(:num,
:num, :year) that would match "05 05 2005" and would decide how to
parse it.
It's still unclear to me how exactly you want the matching to work. Are
all your "attributes" matched against all three words? Do you
positional matches? In the code all rx's are matched against words in
the same position and if all match the block is invoked on the words.
Basically you have it right, the words have to match their
/respective/ attribute. But it's not a fixed number of words at a
time, because match.call(/the/) would only match one word (then
process it, then increment the position index by one).
Initially (it was late at night, mind you) I though having a closure
would work nicely because I could access position, words, and some
other variables in the caller's scope and wouldn't have to pass those
along every time. But it was too tricky/messy because I also needed to
restart at the beginning of the loop after a success (to start trying
all the patterns again), and I needed to know if anything had matched
(so I could increment position by 1, else have an infinite loop).
What I ended up doing was having a function to store the list of
attributes and the block that should be called to "process" the
matching words, and then another function that began scanning the word
list from position=0, testing all the attributes (like match.call
would've), and taking care of incrementing the position index the
right amount. Here's parts of the code:
def self.date_scanner *tags, &block
@@date_scanners << [tags, block]
end
def self.setup_date_scanners
@@date_scanners =
date_scanner(NLTime::Day, :time, :tz) do |d, t|
# two timezones were given, like 12:30:00 -0400 (EDT); ignore
rightmost one
d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
end
date_scanner(NLTime::Day, :time) do |d, t|
d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
end
date_scanner(:time, NLTime::Day) do |t, d|
d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
end
date_scanner(:month, :num, :time, :year) do |m, a, t, y|
# May 05 12:00:00 -0000 2005
day = NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
day.time(t.get_tag(NLTime::Time))
end
date_scanner(:year, :num, :num) do |y, a, b|
# 2005 05 05
NLTime::Day.civil(b.word, a.word, y.get_tag(NLTime::Year))
end
date_scanner(:year, :month, :num) do |y, m, a|
# 2005 May 05
NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
end
date_scanner(:month, :num, :year) do |m, a, y|
# May 05 2005
NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
end
# ...
end
def self.scan_dates tokens, order=:dm
# TODO:
# order=:dm assume day/month like american format
# order=:md assume month/day like european format
# processed tokens
ptokens = ; k = 0
while k < tokens.size
found = false
@@date_scanners.each do |tags, block|
if s = tokens[-tags.size-k..-1-k]
# assume success until one of the tags doesn't match
found = true
# match tags to tokens
s.zip(tags).each do |token, tag|
unless token.has_tag? tag
# not a match... next scanner, please
found = false
break
end
end
if found
# this scanner matches, have the tokens processed
if date = block.call(*s)
token = NLTime::Token.new(date.to_s, :entity, date)
ptokens.unshift token
# increment the position by number of tokens processed
by the block
k += tags.size
# don't try to match any more scanners
break
else
# the block failed, try the next scanner
found = false
end
end
end
end
unless found
# none of the scanners matched
ptokens.unshift tokens[-1-k]
k += 1
end
end
ptokens
end
The scan_dates operates on an array of NLTime::Tokens, which have
various tags. The tags can be symbols, which basically categorize
words (like "Jan" would have :month tag), or they can be objects (like
we might have tagged 2005 with a NLTime::Year object representing the
year 2005). This should "replace" sequences of tokens that were
matched by a scanner with a new token, tagged with an instance of
NLTime::Day or Time.
I still think you're not yet there.
Well, my code does what I want it to do... so I'm not sure what you mean?
···
On 6/10/07, Robert Klemme <shortcutter@googlemail.com> wrote:
On 10.06.2007 18:25, Erwin Abbott wrote:
Kind regards
robert