The difference between art and science is that science is what we
understand well enough to explain to a computer.
Art is everything else.
– Donald Knuth, “Discover”
/bin/sh -c ‘for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done’
===============================================================================
TEXT1
some text, tabs, CRLF etc 1
TEXT2
TEXT1
some text, tabs, CRLF etc 2
TEXT2
TEXT1
some text, tabs, CRLF etc 3
TEXT2
Something like this should help:
open = “TEXT1”
close = “TEXT2”
array =
data.scan(/#{Regexp::quote(open)}(.*?)#{Regexp::quote(close)}/m).flatten
Unfortunaltely these solutions require to read in the whole file at once.
If it is guaranteed that TEXT1 and TEXT2 are always on a line by themself
you can apply more efficient but a bit more comples solutions.
Regards
robert
···
On Wednesday 03 December 2003 17:32, Dmitry N Orlov wrote:
“Hugh Sasse Staff Elec Eng” hgs@dmu.ac.uk schrieb im Newsbeitrag
news:Pine.GSO.4.58.0312031220260.10895@neelix…
Interesting. This doesn’t cope with nesting though. I’ve just
tried that out. Is there a good way to do that with scan?
Do you mean nesting of TEXT1…TEXT2 sections within each other? That
Yes.
can’t be done with regexps. You need a context free parser for that.
Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?
Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use \
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
"(an (example))". The delimiters being one character only couldn't
make a difference here, or could it?
Yes, this is because it use only *one* character for the delimiter that it
can make it work
On Wed, 03 Dec 2003 23:46:03 +0900, Hugh Sasse Staff Elec Eng wrote:
I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?
Is this a limitation inherent in regexps, or just as they are
implemented now? I ask, because Lua (which uses % where we use
for things like \w, \d etc) has %bxy which matches a balanced x y
pair and its contents. Eg %b() would match the whole of
“(an (example))”. The delimiters being one character only couldn’t
make a difference here, or could it?
Yes, this is because it use only one character for the delimiter that it
can make it work
OK, I’m probably going to regret asking this (because of the
complexity of Deterministic Finite Automata theory) but:
If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?
If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?
You have found the problem : the delimiter can't be a regexp. You can have
string constants, this just make the implementation a little more complex
when it's really easy to do it when you have only one character.
This is why generally you see it implemented like this (delimiter with
only one character)
If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?
You have found the problem : the delimiter can’t be a regexp. You can have
string constants, this just make the implementation a little more complex
when it’s really easy to do it when you have only one character.
String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don’t think nesting
is actually respected in C comment blocks, but that’s another story.)
I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else?
This is why generally you see it implemented like this (delimiter with
only one character)
If the delimters were string constants, not regexps, and therefore
of constant length, how would a length greater than one cause this
to be impossible?
You have found the problem : the delimiter can’t be a regexp. You can
have
string constants, this just make the implementation a little more
complex
when it’s really easy to do it when you have only one character.
IMHO this is not fully correct: the regexp engine of Lua must have a
special hack to support nesting (and apparently that for single chars
only). You can’t do that with regexp engines that stay on the grounds of
regular languages, because finite automata can’t count. (Ok, they can
count to a certain limit, but then you have to code the count into the
states which quite soon gets very messy.)
So, normally regexps can’t nest unless the regexp engine at hand has a
special hack for this implemented, which catapults the set of recognizable
languages out of the regular domain.
String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don't think nesting
is actually respected in C comment blocks, but that's another story.)
not really agree with you : you generally want to parse (), , <> more
often than string constant.
I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else?
Probably I'm wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.
String constants are probably the most common case. They are the
frequently-asked-for case: for C comment blocks, bounded by /* and
*/, from faqs about regexps that I have seen. (I don’t think nesting
is actually respected in C comment blocks, but that’s another story.)
not really agree with you : you generally want to parse (), , <> more
often than string constant.
I may be wrong about the frequecy, but that one comes up in more
than one regexp faq, IIRC.
I think it would be really useful to have this, even restricted to
string constants. If I put this up as an RCR would there be
support, or have I overlooked somthing else?
Probably I’m wrong but I think that their use will be very limited, and if
you introduce this feature someone after will try to parse HTML or XML
with it and a regexp is not adapted for this.
Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt’s carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where “worse is better”
sometimes.
Sometimes an imperfect solution is better than none. To borrow
(abuse?) Andy Hunt’s carpentry metaphor, sometimes something nailed
together will serve better now than a beautiful piece of joinery
will if it is later. This could be a case where “worse is better”
sometimes.
Excuse me to barge in your conversation, but why don’t you talk about
the correct solution of the [nested tags] problem, which in this case is
a school-book like simple solution using racc (or ryacc, last time I
looked).
file:
# empty production
>
file textblock
;
textblock:
TEXT1 othertext TEXT2
;
othertext:
# empty production
>
othertext TEXTLINE
;
Of course that’s just for the parsing. And I realise that there is more
than one way to write this. What I am trying to say is that if the book
says you best use this kind of tool, then why talk about imperfect
half-bread solutions ? The ‘perfect’ solution is not that far away !