Some Regexp

Orlovdn · 3 December 2003 10:32

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
…

Array mast constists of “some text, tabs, CRLF etc 1”, “some text,
tabs, CRLF etc 2”

Can You help me?

Dmitry_V_Sabanin1 · 3 December 2003 11:55

Something like this should help:

open = “TEXT1”
close = “TEXT2”
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == [“\nsome text, tabs, CRLF etc 1\n”, “\nsome text, tabs, CRLF etc 2\n”, “\nsome text, tabs, CRLF etc 3\n”]
with your data.

···

On Wednesday 03 December 2003 17:32, Dmitry N Orlov wrote:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
–
sdmitry -=- Dmitry V. Sabanin
MuraveyLabs.

Michael_Campbell1 · 3 December 2003 14:41

Dmitry N Orlov wrote:

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
…

Array mast constists of “some text, tabs, CRLF etc 1”, “some text,
tabs, CRLF etc 2”

You could do something like this (until, as I understand it, this
feature is to be removed):

arr = Array.new()
while (line = gets)
arr << line if (line =~ /TEXT1/) … (line =~ /TEXT2/)
end

(You can shorten that up even more, but I think this gets the point across.)

Ara.T.Howard2 · 3 December 2003 17:37

all fields are separated by either
TEXT2
TEXT1
or
TEXT1
as a special case

/tmp > cat foo.rb
txt = <<-txt
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
txt

p(txt.split(%r/(?:TEXT1)|(?:TEXT2$)?TEXT1/iom)[1…-1])

/tmp > ruby foo.rb
["\n some text, tabs, CRLF etc 1\n TEXT2\n ", "\n some text, tabs, CRLF etc 2\n TEXT2\n ", “\n some text, tabs, CRLF etc 3\n TEXT2\n”]

note that the first field is dropped, since it is empty.

-a

···

On 3 Dec 2003, Dmitry N Orlov wrote:

Date: 3 Dec 2003 02:28:24 -0800
From: Dmitry N Orlov orlovdn@rambler.ru
Newsgroups: comp.lang.ruby
Subject: Some Regexp

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
…

Array mast constists of “some text, tabs, CRLF etc 1”, “some text,
tabs, CRLF etc 2”

Can You help me?

–

ATTN: please update your address books with address below!

===============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
STP :: Solar-Terrestrial Physics Data | NCEI
NGDC :: http://www.ngdc.noaa.gov/
NESDIS :: http://www.nesdis.noaa.gov/
NOAA :: http://www.noaa.gov/
US DOC :: http://www.commerce.gov/

The difference between art and science is that science is what we
understand well enough to explain to a computer.
Art is everything else.
– Donald Knuth, “Discover”

/bin/sh -c ‘for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done’
===============================================================================

Hugh_Sasse · 3 December 2003 12:43

Interesting. This doesn’t cope with nesting though. I’ve just
tried that out. Is there a good way to do that with scan?

    Hugh

···

On Wed, 3 Dec 2003, Dmitry V. Sabanin wrote:

Something like this should help:

open = “TEXT1”
close = “TEXT2”
array = data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == [“\nsome text, tabs, CRLF etc 1\n”, “\nsome text, tabs, CRLF etc 2\n”, “\nsome text, tabs, CRLF etc 3\n”]
with your data.

Robert · 3 December 2003 14:22

“Dmitry V. Sabanin” sdmitry@lrn.ru schrieb im Newsbeitrag
news:200312031852.11934.sdmitry@lrn.ru…

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
Something like this should help:

open = “TEXT1”
close = “TEXT2”
array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

or:

array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).map{|x|x[
0]}

Directly reading from a file:

IO.read(“file.txt”).scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close
)}/m).map{|x|x[0]}

Unfortunaltely these solutions require to read in the whole file at once.
If it is guaranteed that TEXT1 and TEXT2 are always on a line by themself
you can apply more efficient but a bit more comples solutions.

Regards

robert

···

On Wednesday 03 December 2003 17:32, Dmitry N Orlov wrote:

Robert · 3 December 2003 17:22

“Michael campbell” michael_s_campbell@yahoo.com schrieb im Newsbeitrag
news:3FCDF61E.2010906@yahoo.com…

Dmitry N Orlov wrote:

I want to get array from file like this:

TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
…

Array mast constists of “some text, tabs, CRLF etc 1”, “some text,
tabs, CRLF etc 2”

You could do something like this (until, as I understand it, this
feature is to be removed):

arr = Array.new()
while (line = gets)
arr << line if (line =~ /TEXT1/) … (line =~ /TEXT2/)
end

No that doesn’t work since for each line there is a new entry in the
array. But the OP wanted the texts to be in one string.

robert

Robert · 3 December 2003 14:17

“Hugh Sasse Staff Elec Eng” hgs@dmu.ac.uk schrieb im Newsbeitrag
news:Pine.GSO.4.58.0312031220260.10895@neelix…

Something like this should help:

open = “TEXT1”
close = “TEXT2”
array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

I get:
array == [“\nsome text, tabs, CRLF etc 1\n”, “\nsome text, tabs, CRLF
etc 2\n”, “\nsome text, tabs, CRLF etc 3\n”]
with your data.

Interesting. This doesn’t cope with nesting though. I’ve just
tried that out. Is there a good way to do that with scan?

Do you mean nesting of TEXT1…TEXT2 sections within each other? That
can’t be done with regexps. You need a context free parser for that.

Cheers

robert

···

On Wed, 3 Dec 2003, Dmitry V. Sabanin wrote:

Simon_Strandgaard1 · 3 December 2003 15:42

How about this ?

···

On Wed, 03 Dec 2003 15:19:54 +0100, Robert Klemme wrote:

open = “TEXT1”
close = “TEXT2”
array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten

–
Simon Strandgaard

ruby a.rb
[“some text, tabs, CRLF etc 1”]
[“some text, tabs, CRLF etc 2”]
[“some text, tabs, CRLF etc 3”]
cat a.rb
text=<<EOT
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
EOT
re = /TEXT1$.+?(^.*?$).+?TEXT2/m
text.scan(re){|match| p match }

Hugh_Sasse · 3 December 2003 14:46

“Hugh Sasse Staff Elec Eng” hgs@dmu.ac.uk schrieb im Newsbeitrag
news:Pine.GSO.4.58.0312031220260.10895@neelix…

Interesting. This doesn’t cope with nesting though. I’ve just
tried that out. Is there a good way to do that with scan?

Do you mean nesting of TEXT1…TEXT2 sections within each other? That

Yes.

can’t be done with regexps. You need a context free parser for that.

Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?

Cheers
robert

    Hugh

···

On Wed, 3 Dec 2003, Robert Klemme wrote:

ts1 · 3 December 2003 14:54

Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use \
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
"(an (example))". The delimiters being one character only couldn't
make a difference here, or could it?

Yes, this is because it use only *one* character for the delimiter that it
can make it work

Guy Decoux

Simon_Strandgaard1 · 3 December 2003 15:32

The %bxy feature seems nice, I had to look it up in Lua’s manual:
http://www.lua.org/manual/5.0/manual.html#5.3

Maybe I should add it to my regexp engine ?

···

On Wed, 03 Dec 2003 23:46:03 +0900, Hugh Sasse Staff Elec Eng wrote:

I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?

–
Simon Strandgaard

BTW: I have just released regepx-engine 0.6
http://raa.ruby-lang.org/list.rhtml?name=regexp

Hugh_Sasse · 3 December 2003 15:05

Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?

Yes, this is because it use only one character for the delimiter that it
can make it work

OK, I’m probably going to regret asking this (because of the
complexity of Deterministic Finite Automata theory) but:

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

Guy Decoux

    Hugh

···

On Wed, 3 Dec 2003, ts wrote:

ts1 · 3 December 2003 15:10

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.

This is why generally you see it implemented like this (delimiter with
only one character)

Guy Decoux

Hugh_Sasse · 3 December 2003 15:28

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

You have found the problem : the delimiter can’t be a regexp. You can have
string constants, this just make the implementation a little more complex
when it’s really easy to do it when you have only one character.

String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don’t think nesting
is actually respected in C comment blocks, but that’s another story.)

I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else?

This is why generally you see it implemented like this (delimiter with
only one character)

“The simplest thing that could possibly work”

Guy Decoux

    Thank you,
    Hugh

···

On Thu, 4 Dec 2003, ts wrote:

Robert · 3 December 2003 17:27

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200312031510.hB3FAds20117@moulon.inra.fr…

If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?

You have found the problem : the delimiter can’t be a regexp. You can
have
string constants, this just make the implementation a little more
complex
when it’s really easy to do it when you have only one character.

IMHO this is not fully correct: the regexp engine of Lua must have a
special hack to support nesting (and apparently that for single chars
only). You can’t do that with regexp engines that stay on the grounds of
regular languages, because finite automata can’t count. (Ok, they can
count to a certain limit, but then you have to code the count into the
states which quite soon gets very messy.)

So, normally regexps can’t nest unless the regexp engine at hand has a
special hack for this implemented, which catapults the set of recognizable
languages out of the regular domain.

Regards

robert

ts1 · 3 December 2003 15:37

String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don't think nesting
is actually respected in C comment blocks, but that's another story.)

not really agree with you : you generally want to parse (), , <> more
often than string constant.

I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else?

Probably I'm wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.

Guy Decoux

Hugh_Sasse · 3 December 2003 16:23

String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don’t think nesting
is actually respected in C comment blocks, but that’s another story.)

not really agree with you : you generally want to parse (), , <> more
often than string constant.

I may be wrong about the frequecy, but that one comes up in more
than one regexp faq, IIRC.

I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else?

Probably I’m wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.

Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt’s carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where “worse is better”
sometimes.

Guy Decoux

    Hugh

···

On Thu, 4 Dec 2003, ts wrote:

ts1 · 3 December 2003 16:32

I may be wrong about the frequecy, but that one comes up in more
than one regexp faq, IIRC.

because many persons use regexp even when they are not adapted (HTML, XML
are good examples for this)

Sometimes an imperfect solution is better than none.

Sometimes regexp are not adapted, and you must use another tool rather
than trying to add features which will give you only problems.

p.s. : a regexp engine is stupid, never forget it

Guy Decoux

Kaspar_Schiess · 3 December 2003 18:20

Hello,

Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt’s carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where “worse is better”
sometimes.

Excuse me to barge in your conversation, but why don’t you talk about
the correct solution of the [nested tags] problem, which in this case is
a school-book like simple solution using racc (or ryacc, last time I
looked).

file:
# empty production
>
file textblock
;

textblock:
TEXT1 othertext TEXT2
;

othertext:
# empty production
>
othertext TEXTLINE
;

Of course that’s just for the parsing. And I realise that there is more
than one way to write this. What I am trying to say is that if the book
says you best use this kind of tool, then why talk about imperfect
half-bread solutions ? The ‘perfect’ solution is not that far away !

kaspar

Topic		Replies	Views
String.scan (Regexp again...) ruby-talk	3	75	12 December 2002
String frustration ruby-talk	24	136	13 February 2003
Finding a sentence (more than one word & punctuation (, . ;)) in a string? ruby-talk	11	102	12 January 2006
Regexp help: Parsing a CSV file ruby-talk	26	194	27 February 2003
Extracting multiple lines from a file ruby-talk	17	110	31 December 2003

Some Regexp

Related topics