Some Regexp

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2

Array mast constists of “some text, tabs, CRLF etc 1”, “some text,
tabs, CRLF etc 2”

Can You help me?

Something like this should help:

open = “TEXT1”
close = “TEXT2”
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == [“\nsome text, tabs, CRLF etc 1\n”, “\nsome text, tabs, CRLF etc 2\n”, “\nsome text, tabs, CRLF etc 3\n”]
with your data.

···

On Wednesday 03 December 2003 17:32, Dmitry N Orlov wrote:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2

sdmitry -=- Dmitry V. Sabanin
MuraveyLabs.

Dmitry N Orlov wrote:

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2

Array mast constists of “some text, tabs, CRLF etc 1”, “some text,
tabs, CRLF etc 2”

You could do something like this (until, as I understand it, this
feature is to be removed):

arr = Array.new()
while (line = gets)
arr << line if (line =~ /TEXT1/) … (line =~ /TEXT2/)
end

(You can shorten that up even more, but I think this gets the point across.)

all fields are separated by either
TEXT2
TEXT1
or
TEXT1
as a special case

/tmp > cat foo.rb
txt = <<-txt
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
txt

p(txt.split(%r/(?:TEXT1)|(?:TEXT2$)?TEXT1/iom)[1…-1])

/tmp > ruby foo.rb
["\n some text, tabs, CRLF etc 1\n TEXT2\n ", "\n some text, tabs, CRLF etc 2\n TEXT2\n ", “\n some text, tabs, CRLF etc 3\n TEXT2\n”]

note that the first field is dropped, since it is empty.

-a

···

On 3 Dec 2003, Dmitry N Orlov wrote:

Date: 3 Dec 2003 02:28:24 -0800
From: Dmitry N Orlov orlovdn@rambler.ru
Newsgroups: comp.lang.ruby
Subject: Some Regexp

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2

Array mast constists of “some text, tabs, CRLF etc 1”, “some text,
tabs, CRLF etc 2”

Can You help me?

ATTN: please update your address books with address below!

===============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
STP :: Solar-Terrestrial Physics Data | NCEI
NGDC :: http://www.ngdc.noaa.gov/
NESDIS :: http://www.nesdis.noaa.gov/
NOAA :: http://www.noaa.gov/
US DOC :: http://www.commerce.gov/

The difference between art and science is that science is what we
understand well enough to explain to a computer.
Art is everything else.
– Donald Knuth, “Discover”

/bin/sh -c ‘for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done’
===============================================================================

Interesting. This doesn’t cope with nesting though. I’ve just
tried that out. Is there a good way to do that with scan?

    Hugh
···

On Wed, 3 Dec 2003, Dmitry V. Sabanin wrote:

Something like this should help:

open = “TEXT1”
close = “TEXT2”
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == [“\nsome text, tabs, CRLF etc 1\n”, “\nsome text, tabs, CRLF etc 2\n”, “\nsome text, tabs, CRLF etc 3\n”]
with your data.

“Dmitry V. Sabanin” sdmitry@lrn.ru schrieb im Newsbeitrag
news:200312031852.11934.sdmitry@lrn.ru

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
Something like this should help:

open = “TEXT1”
close = “TEXT2”
array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

or:

array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).map{|x|x[
0]}

Directly reading from a file:

IO.read(“file.txt”).scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close
)}/m).map{|x|x[0]}

Unfortunaltely these solutions require to read in the whole file at once.
If it is guaranteed that TEXT1 and TEXT2 are always on a line by themself
you can apply more efficient but a bit more comples solutions.

Regards

robert
···

On Wednesday 03 December 2003 17:32, Dmitry N Orlov wrote:

“Michael campbell” michael_s_campbell@yahoo.com schrieb im Newsbeitrag
news:3FCDF61E.2010906@yahoo.com

Dmitry N Orlov wrote:

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2

Array mast constists of “some text, tabs, CRLF etc 1”, “some text,
tabs, CRLF etc 2”

You could do something like this (until, as I understand it, this
feature is to be removed):

arr = Array.new()
while (line = gets)
arr << line if (line =~ /TEXT1/) … (line =~ /TEXT2/)
end

No that doesn’t work since for each line there is a new entry in the
array. But the OP wanted the texts to be in one string.

robert

“Hugh Sasse Staff Elec Eng” hgs@dmu.ac.uk schrieb im Newsbeitrag
news:Pine.GSO.4.58.0312031220260.10895@neelix…

Something like this should help:

open = “TEXT1”
close = “TEXT2”
array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == [“\nsome text, tabs, CRLF etc 1\n”, “\nsome text, tabs, CRLF
etc 2\n”, “\nsome text, tabs, CRLF etc 3\n”]
with your data.

Interesting. This doesn’t cope with nesting though. I’ve just
tried that out. Is there a good way to do that with scan?

Do you mean nesting of TEXT1…TEXT2 sections within each other? That
can’t be done with regexps. You need a context free parser for that.

Cheers

robert
···

On Wed, 3 Dec 2003, Dmitry V. Sabanin wrote:

How about this ?

···

On Wed, 03 Dec 2003 15:19:54 +0100, Robert Klemme wrote:

open = “TEXT1”
close = “TEXT2”
array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten


Simon Strandgaard

ruby a.rb
[“some text, tabs, CRLF etc 1”]
[“some text, tabs, CRLF etc 2”]
[“some text, tabs, CRLF etc 3”]
cat a.rb
text=<<EOT
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
EOT
re = /TEXT1$.+?(^.*?$).+?TEXT2/m
text.scan(re){|match| p match }

“Hugh Sasse Staff Elec Eng” hgs@dmu.ac.uk schrieb im Newsbeitrag
news:Pine.GSO.4.58.0312031220260.10895@neelix…

Interesting. This doesn’t cope with nesting though. I’ve just
tried that out. Is there a good way to do that with scan?

Do you mean nesting of TEXT1…TEXT2 sections within each other? That

Yes.

can’t be done with regexps. You need a context free parser for that.

Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?

Cheers

robert
    Hugh
···

On Wed, 3 Dec 2003, Robert Klemme wrote:

Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use \
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
"(an (example))". The delimiters being one character only couldn't
make a difference here, or could it?

Yes, this is because it use only *one* character for the delimiter that it
can make it work

Guy Decoux

The %bxy feature seems nice, I had to look it up in Lua’s manual:
http://www.lua.org/manual/5.0/manual.html#5.3

Maybe I should add it to my regexp engine ?

···

On Wed, 03 Dec 2003 23:46:03 +0900, Hugh Sasse Staff Elec Eng wrote:

I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?


Simon Strandgaard

BTW: I have just released regepx-engine 0.6
http://raa.ruby-lang.org/list.rhtml?name=regexp

Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?

Yes, this is because it use only one character for the delimiter that it
can make it work

OK, I’m probably going to regret asking this (because of the
complexity of Deterministic Finite Automata theory) but:

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

Guy Decoux

    Hugh
···

On Wed, 3 Dec 2003, ts wrote:

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.

This is why generally you see it implemented like this (delimiter with
only one character)

Guy Decoux

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

You have found the problem : the delimiter can’t be a regexp. You can have
string constants, this just make the implementation a little more complex
when it’s really easy to do it when you have only one character.

String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don’t think nesting
is actually respected in C comment blocks, but that’s another story.)

I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else? :slight_smile:

This is why generally you see it implemented like this (delimiter with
only one character)

“The simplest thing that could possibly work” :slight_smile:

Guy Decoux

    Thank you,
    Hugh
···

On Thu, 4 Dec 2003, ts wrote:

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200312031510.hB3FAds20117@moulon.inra.fr

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

You have found the problem : the delimiter can’t be a regexp. You can
have
string constants, this just make the implementation a little more
complex
when it’s really easy to do it when you have only one character.

IMHO this is not fully correct: the regexp engine of Lua must have a
special hack to support nesting (and apparently that for single chars
only). You can’t do that with regexp engines that stay on the grounds of
regular languages, because finite automata can’t count. (Ok, they can
count to a certain limit, but then you have to code the count into the
states which quite soon gets very messy.)

So, normally regexps can’t nest unless the regexp engine at hand has a
special hack for this implemented, which catapults the set of recognizable
languages out of the regular domain. :slight_smile:

Regards

robert

String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don't think nesting
is actually respected in C comment blocks, but that's another story.)

not really agree with you : you generally want to parse (), , <> more
often than string constant.

I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else? :slight_smile:

Probably I'm wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.

Guy Decoux

String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don’t think nesting
is actually respected in C comment blocks, but that’s another story.)

not really agree with you : you generally want to parse (), , <> more
often than string constant.

I may be wrong about the frequecy, but that one comes up in more
than one regexp faq, IIRC.

I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else? :slight_smile:

Probably I’m wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.

Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt’s carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where “worse is better”
sometimes.

Guy Decoux

    Hugh
···

On Thu, 4 Dec 2003, ts wrote:

I may be wrong about the frequecy, but that one comes up in more
than one regexp faq, IIRC.

because many persons use regexp even when they are not adapted (HTML, XML
are good examples for this)

Sometimes an imperfect solution is better than none.

Sometimes regexp are not adapted, and you must use another tool rather
than trying to add features which will give you only problems.

p.s. : a regexp engine is stupid, never forget it :slight_smile:

Guy Decoux

Hello,

Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt’s carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where “worse is better”
sometimes.

Excuse me to barge in your conversation, but why don’t you talk about
the correct solution of the [nested tags] problem, which in this case is
a school-book like simple solution using racc (or ryacc, last time I
looked).

file:
# empty production
>
file textblock
;

textblock:
TEXT1 othertext TEXT2
;

othertext:
# empty production
>
othertext TEXTLINE
;

Of course that’s just for the parsing. And I realise that there is more
than one way to write this. What I am trying to say is that if the book
says you best use this kind of tool, then why talk about imperfect
half-bread solutions ? The ‘perfect’ solution is not that far away !

kaspar