[Q] Reg. Expressios with "\n"

Hello,

I’m trying to match a regular expression on a string that contains
new-line characters.

e.g.

str = “hello something \n world”
=> “hello something \n world”

str =~ /(.*</html>)/
=> nil

I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.

Any ideas?

Thanks.

···


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

HTH,
Hal

···

----- Original Message -----
From: “Daniel Carrera” dcarrera@math.umd.edu
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 4:36 PM
Subject: [Q] Reg. Expressios with “\n”

str = “hello something \n world”
=> “hello something \n world”

str =~ /(.*</html>)/
=> nil

I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.

Multiline mode: /./m

···

On Thu, Apr 17, 2003 at 06:36:45AM +0900, Daniel Carrera wrote:

Hello,

I’m trying to match a regular expression on a string that contains
new-line characters.

e.g.

str = “hello something \n world”
=> “hello something \n world”

str =~ /(.*</html>)/
=> nil

I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.

Any ideas?


_ _

__ __ | | ___ _ __ ___ __ _ _ __
’_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Never trust an operating system you don’t have sources for. :wink:
– Unknown source

I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

Thanks! that worked.

···


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

“Hal E. Fulton” hal9000@hypermetrics.com writes:

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

But when parsing HTML, you probably shouldn’t be using REs at all.
This perfectly legal HTML will confuse that RE:

If you need to parse HTML, use an HTML parser; there will always be a
(usually simple) way to defeat a regular expression. “But I can
control the HTML!”, I hear you (generic) say. Sure you can-- now.
But what happens a year down the road when someone else is in charge
of generating it? Best to be safe and do it the right way from the
start.

-=Eric

···


Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.

Incidentally, if it’s readability you crave, (“you” in the general
sense), I find myself going to %r{} notation more and more, if for no
other reason than not having to backwhack /'s.

···

— Daniel Carrera dcarrera@math.umd.edu wrote:

I want an RE that matches this string, but ‘.’ doesn’t match
the \n

character. I guess I could replace the \n’s by something else,
but I’d

rather not alter the original string if I can avoid it.

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

Thanks! that worked.


Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

I’m just parsing the emails that I get from my hotmail-using friends.
I can’t see any reason why auto-generated html emails would have something
like .

I figure that what I have is good enough for my purposes.

···

On Thu, Apr 17, 2003 at 07:15:23AM +0900, Eric Schwartz wrote:

If you need to parse HTML, use an HTML parser; there will always be a
(usually simple) way to defeat a regular expression. “But I can
control the HTML!”, I hear you (generic) say. Sure you can-- now.
But what happens a year down the road when someone else is in charge
of generating it? Best to be safe and do it the right way from the
start.


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

I agree with the sentiment (that using REs to parse HTML, XML, etc)
isn’t a good idea, but the example above isn’t valid. Greedy matching
takes precedence, so it actually works like it should.

···

“Hal E. Fulton” hal9000@hypermetrics.com writes:

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

But when parsing HTML, you probably shouldn’t be using REs at all.
This perfectly legal HTML will confuse that RE:


#!/usr/bin/ruby

str = "

test html bleh "

puts $1 if str =~ /(.*)</html>/m

:!./re_test.rb

test html

bleh

Here’s a better example:


#!/usr/bin/ruby

str = "
here’s some bold text


here’s some more bold text"

puts “greedy: #$1” if str =~ /(.)</b>/m
puts “non-greedy: #$1” if str =~ /(.
?)</b>/m

:!./re_test.rb
greedy: here’s some bold text


here’s some more bold text
non-greedy: here’s some bold text

<!-- with a

In this case, neither greedy nor non-greedy matching works
appropriately, and no sane amount of lookahead or lookbehind assertions
could possibly account for all the corner cases. Note that this is
neither pathological nor contrived; think about the number of existing
HTML documents with unterminated

and

tags.

If you need to parse HTML, use an HTML parser; there will always be a
(usually simple) way to defeat a regular expression. “But I can
control the HTML!”, I hear you (generic) say. Sure you can-- now.
But what happens a year down the road when someone else is in charge
of generating it? Best to be safe and do it the right way from the
start.

Agreed.

-=Eric


Paul Duncan pabs@pablotron.org pabs in #gah (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

Thanks. I didn’t know about the %r{} notation. It really is nicer.

···

On Thu, Apr 17, 2003 at 06:56:48AM +0900, Michael Campbell wrote:

Incidentally, if it’s readability you crave, (“you” in the general
sense), I find myself going to %r{} notation more and more, if for no
other reason than not having to backwhack /'s.


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Daniel Carrera dcarrera@math.umd.edu writes:

I’m just parsing the emails that I get from my hotmail-using friends.
I can’t see any reason why auto-generated html emails would have something
like .

The point isn’t so much that (although it is a valid one), it’s that
next you decide to parse other bits further, and the deeper you go
into HTML, the likelihood of your hitting something that a regex can’t
parse increases exponentially, if not faster.

Sure, you can keep tweaking your regexes to account for each of the
cases where they fail, and that’ll keep you busy-- or you can just use
an HTML parser and quit worrying about it.

I figure that what I have is good enough for my purposes.

That’s entirely your call. I’m just trying to point out that to parse
HTML, you really should use an HTML parser-- other solutions may work
in the short term, but they invariably fail in the long run. I’m also
a big fan of using the right tool for the job-- just because you can
loosen a nut on your wheel with a pair of pliers doesn’t mean a tire
iron isn’t a better solution. :slight_smile:

-=Eric

···


Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.

That’s entirely your call. I’m just trying to point out that to parse
HTML, you really should use an HTML parser-- other solutions may work
in the short term, but they invariably fail in the long run. I’m also
a big fan of using the right tool for the job-- just because you can
loosen a nut on your wheel with a pair of pliers doesn’t mean a tire
iron isn’t a better solution. :slight_smile:

Definitelly. I also believe in using the right tool for the job.

I didn’t fully explain what I was trying to do. You see, I actually am
using an HTML parser, in the form of an external program (lynx -dump).
All I need is to extract the HTML code so I can send it to lynx.

For this an HTML parser looked like using a sledgehammer to swat a fly,
and an RE looked like a flyswatter.

Then again, perhaps calling lynx just for an HTML dump is
sledgehammerish…

What I am trying to do is convert the HTML emails I get into ASCII.
Do you know how to do an html dump using an HTML parser?

Thanks.

···


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Daniel Carrera dcarrera@math.umd.edu writes:

What I am trying to do is convert the HTML emails I get into ASCII.
Do you know how to do an html dump using an HTML parser?

The htmltest.rb file included with html-parser2 on RAA
(libhtml-parser-ruby from debian) does exactly that. Take a look at
it; it appears to be a fairly straightfoward port of the Python
library of the same name.

-=Eric

···


Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.

Using lynx to convert HTML to formatted ASCII is a good solution for the job
(I have mutt configured to use it for this purpose), although it doesn’t do
a good job of rendering tables.

It sounds to me that what you really want is a MIME parser, which will break
your mail into chunks - the chunk(s) that are declared to contain HTML (by
’Content-Type: text/html’) can then be sent wholemeal to lynx, you don’t
need to look at the content at all.

Cheers,

Brian.

···

On Thu, Apr 17, 2003 at 08:56:57AM +0900, Daniel Carrera wrote:

Definitelly. I also believe in using the right tool for the job.

I didn’t fully explain what I was trying to do. You see, I actually am
using an HTML parser, in the form of an external program (lynx -dump).
All I need is to extract the HTML code so I can send it to lynx.

For this an HTML parser looked like using a sledgehammer to swat a fly,
and an RE looked like a flyswatter.

Then again, perhaps calling lynx just for an HTML dump is
sledgehammerish…

Saluton!

Using lynx to convert HTML to formatted ASCII is a good solution
for the job (I have mutt configured to use it for this purpose),
although it doesn’t do a good job of rendering tables.

I use two alternatives for that purpose. The default one is
vilistextum. Let me quote the author:

“vilistextum is a html to ascii converter specifically programmed to
get the best out of incorrect html.”

As a rule of thumb any ‘HTML’ in an e-mails can be assumed to be
invalid. Main advantage: It is very small and fast (i86 binary is
around 30 K).

If vilistextum is not enough I use w3m (using mutt’s pipe feature).
w3m does frames and tables, etc.

For mailers that run in Emacs, w3m-emacs is a good choice.

Gis,

Josef ‘Jupp’ Schugt