Scanning strings

I want to scan a string for breaks. I want to pick both the breaks and the
text in between.
If I use String#each I loose the space.
If I use String#scan, I get into a fairly involved expression with two
submatches
where the second submatch is almost exactly the negation of the first
submatch.

Is it it possible to break a string such that you get a both the match and
the text before last match, if any?

If not, wouldn’t that be a useful feature in Ruby?

One simple example is processing a line at a time, but you want to keep the
break style of the source text which could be , or .

Mikkel

You can use split with a regular expression. Or you can use a regular
expression with defined subexpressions. Either way you can get the results
you are looking for.

···

On Tuesday 11 March 2003 12:51 pm, MikkelFJ wrote:

I want to scan a string for breaks. I want to pick both the breaks and the
text in between.
If I use String#each I loose the space.
If I use String#scan, I get into a fairly involved expression with two
submatches
where the second submatch is almost exactly the negation of the first
submatch.

Is it it possible to break a string such that you get a both the match and
the text before last match, if any?

If not, wouldn’t that be a useful feature in Ruby?

One simple example is processing a line at a time, but you want to keep the
break style of the source text which could be , or .

Mikkel


Seth Kurtzberg
M. I. S. Corp.
480-661-1849
seth@cql.com

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message
news:3e6e3970$0$149$edfadb0f@dtext01.news.tele.dk…

If not, wouldn’t that be a useful feature in Ruby?

I imagined something like

string.scan2(/\s+/) |m| { process_text if m.pre_match; process_space if
m[0] }

Of course, what I would really like to have is:

string.scan([:symbol1, Regexp1], [:symbol2, Regexp2], …]) do |s, m|
case s
when :symbol1
# … process m
when :symbol2
# …
else # s == :default
# …
end

Initially I don’t care how inefficient it is, but long term, this would
allow the regex array to be compiled efficiently into a lexer. This would
seriously kick ass.
The match object would also be able to track linenumbers, such that the else
branch could produce a sensible error message.

I noticed Ruby 1.8 had named subexpression in regexp source, perhaps some of
that could be used?

For my original problem I could write.

string.scan([:space, /\s+/]) |s, m|
case s
when :space
process_space m[0]
else
process_text m[0]
end

but in this case it would be an overkill.

Mikkel

“Seth Kurtzberg” seth@cql.com wrote in message
news:200303111259.02823.seth@cql.com

You can use split with a regular expression. Or you can use a regular
expression with defined subexpressions. Either way you can get the
results
you are looking for.

Nope, split will not give me the text where the split takes place and it
will also not give me a match object such that I can use subexpressions.

String#scan allows me to use subexpressions as I mentioned, but I end up
with a very hairy expressions because you have to negate the second sub
match, plus I need to handle border conditions very carefully.

Of course I can manually iterate over a string using match, but that isn’t
really the Ruby way.

Mikkel

You can use split with a regular expression. Or you can use a
regular expression with defined subexpressions. Either way you
can get the results you are looking for.

Nope, split will not give me the text where the split takes place
and it will also not give me a match object such that I can use
subexpressions.

What did you test?

>ruby -ve 'p "abc def ghi".split(/(\s+)/)'
ruby 1.7.2 (2002-05-07) [i386-freebsd]
["abc", " ", "def", " ", "ghi"]

Hmm, by the way,

" abc def ghi ".split(/(\s+)/)

yields

["", " ", "abc", " ", "def", " ", "ghi", " "]

what criteria causes asymmetry of prefixing “” and no trailing that?
I forgot the thread on this issue…

···

In message 3e6e4260$0$132$edfadb0f@dtext01.news.tele.dk mikkelfj-anti-spam@bigfoot.com writes:


kjana@dm4lab.to March 12, 2003
Every body’s business is nobody’s business.

Hmm, by the way,
    " abc def ghi ".split(/(\s+)/)
yields
    ["", " ", "abc", " ", "def", " ", "ghi", " "]

pigeon% ruby -e 'p " abc def ghi ".split(/(\s+)/, -1)'
["", " ", "abc", " ", "def", " ", "ghi", " ", ""]
pigeon%

Guy Decoux

“YANAGAWA Kazuhisa” kjana@dm4lab.to wrote in message
news:20030312112248.DC4161EE12@milestones.dm4lab.to…

Nope, split will not give me the text where the split takes place
and it will also not give me a match object such that I can use
subexpressions.

What did you test?

>ruby -ve 'p "abc def ghi".split(/(\s+)/)'
ruby 1.7.2 (2002-05-07) [i386-freebsd]
["abc", " ", "def", " ", "ghi"]

I used 1.7.3, but mostly referred to the information provided by ri.
I assumed that matched string was removed, but this is apparently not always
the case:
From ri String#split

" now’s the time".split(/ /) #=> [“”, “now’s”, “”, “the”, “time”]
“1, 2.34,56, 7”.split(/,\s*/) #=> [“1”, “2.34”, “56”, “7”]

My own test in 1.7.3:

“abc def ghi”.split(/\s+/) #=> [“abc”, “def”, “ghi”]

I think my problem is important because it is typical for scanning tags
embedded in text.

On a related note: In the end I solved my problem using a while loop.
But since match do not take an offset as parameter I get to reallocate
strings all the time.

Based on memory, my solution was something like

s = “my string to scan”
RE = “\s+” # actually somewhat more complicated than this
while s.length > 0
m = RE.match s
if m
process_text m.pre_match
process_space m[0]
s = m.post_match
else
process_text s
break
end
end

This is lengthy and not very efficient. The s = m_post_match could result in
a lot of allocation on large strings, and you can’t operate directly on a
file.

Mikkel

“ts” decoux@moulon.inra.fr wrote in message
news:200303121125.h2CBPWd09137@moulon.inra.fr

Hmm, by the way,
" abc def ghi “.split(/(\s+)/)
yields
[”", " ", “abc”, " ", “def”, " ", “ghi”, " "]

pigeon% ruby -e ‘p " abc def ghi ".split(/(\s+)/, -1)’
[“”, " ", “abc”, " ", “def”, " ", “ghi”, " ", “”]
pigeon%

Arrgh - now I see. I need to create a submatch: /(+s)/, not /\s+/
I did use the -1 to handle border conditions.

Mikkel

Yes, I searched and got the discussion on it from the ruby-list
archive. At there incompatibility of Ruby and Perl on split (at that
moment) was discussed and matz said “methods introduced from Perl like
String#split should behave like so”.

That threads is on October 1998, how far we’ve gone…

Then remaining question is why Perl by default discards empty fields
at tail on split whereas preserved them at head, but that’s another
story :stuck_out_tongue:

Wow, I’ve completely forgotten the behavior of String#split given " "

(exact one space)…

···

In message 200303121125.h2CBPWd09137@moulon.inra.fr decoux@moulon.inra.fr writes:

Hmm, by the way,
" abc def ghi “.split(/(\s+)/)
yields
[”", " ", “abc”, " ", “def”, " ", “ghi”, " "]

pigeon% ruby -e ‘p " abc def ghi ".split(/(\s+)/, -1)’
[“”, " ", “abc”, " ", “def”, " ", “ghi”, " ", “”]
pigeon%


kjana@dm4lab.to March 13, 2003
Out of frying-pan, into the fire.