[Q] Reg. Expressios with "\n"

Daniel_Carrera · 16 April 2003 21:36

Hello,

I’m trying to match a regular expression on a string that contains
new-line characters.

e.g.

str = “hello something \n world”
=> “hello something \n world”
str =~ /(.*</html>)/
=> nil

I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.

Any ideas?

Thanks.

···

–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

HAL_9000 · 16 April 2003 21:43

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

HTH,
Hal

···

----- Original Message -----
From: “Daniel Carrera” dcarrera@math.umd.edu
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Wednesday, April 16, 2003 4:36 PM
Subject: [Q] Reg. Expressios with “\n”

str = “hello something \n world”
=> “hello something \n world”
str =~ /(.*</html>)/
=> nil

I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.

Mauricio_Fernndez · 16 April 2003 21:49

Multiline mode: /./m

···

On Thu, Apr 17, 2003 at 06:36:45AM +0900, Daniel Carrera wrote:

Hello,

I’m trying to match a regular expression on a string that contains
new-line characters.

e.g.

str = “hello something \n world”
=> “hello something \n world”
str =~ /(.*</html>)/
=> nil

I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.

Any ideas?

–
_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |__ \ | | | | | (| | | | |
.__/ _,|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Never trust an operating system you don’t have sources for.
– Unknown source

Daniel_Carrera · 16 April 2003 21:47

I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

Thanks! that worked.

···

–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Eric_Schwartz6 · 16 April 2003 22:15

“Hal E. Fulton” hal9000@hypermetrics.com writes:

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

But when parsing HTML, you probably shouldn’t be using REs at all.
This perfectly legal HTML will confuse that RE:

If you need to parse HTML, use an HTML parser; there will always be a
(usually simple) way to defeat a regular expression. “But I can
control the HTML!”, I hear you (generic) say. Sure you can-- now.
But what happens a year down the road when someone else is in charge
of generating it? Best to be safe and do it the right way from the
start.

-=Eric

···

–
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.

Michael_Campbell1 · 16 April 2003 21:56

Incidentally, if it’s readability you crave, (“you” in the general
sense), I find myself going to %r{} notation more and more, if for no
other reason than not having to backwhack /'s.

···

— Daniel Carrera dcarrera@math.umd.edu wrote:

I want an RE that matches this string, but ‘.’ doesn’t match
the \n
character. I guess I could replace the \n’s by something else,
but I’d
rather not alter the original string if I can avoid it.

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

Thanks! that worked.

Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo

Daniel_Carrera · 16 April 2003 22:19

I’m just parsing the emails that I get from my hotmail-using friends.
I can’t see any reason why auto-generated html emails would have something
like .

I figure that what I have is good enough for my purposes.

···

On Thu, Apr 17, 2003 at 07:15:23AM +0900, Eric Schwartz wrote:

If you need to parse HTML, use an HTML parser; there will always be a
(usually simple) way to defeat a regular expression. “But I can
control the HTML!”, I hear you (generic) say. Sure you can-- now.
But what happens a year down the road when someone else is in charge
of generating it? Best to be safe and do it the right way from the
start.

–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Paul_Duncan · 18 April 2003 17:26

I agree with the sentiment (that using REs to parse HTML, XML, etc)
isn’t a good idea, but the example above isn’t valid. Greedy matching
takes precedence, so it actually works like it should.

···

Eric Schwartz (emschwar@pobox.com) wrote:

“Hal E. Fulton” hal9000@hypermetrics.com writes:

IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:

str =~ /(.*</html>)/m

But when parsing HTML, you probably shouldn’t be using REs at all.
This perfectly legal HTML will confuse that RE:

–
#!/usr/bin/ruby

str = "

test html bleh "

puts $1 if str =~ /(.*)</html>/m

:!./re_test.rb

test html bleh --

Here’s a better example:

–
#!/usr/bin/ruby

str = "
here’s some bold text

here’s some more bold text"

puts “greedy: #$1” if str =~ /(.)</b>/m
puts “non-greedy: #$1” if str =~ /(.?)</b>/m

:!./re_test.rb
greedy: here’s some bold text

here’s some more bold text
non-greedy: here’s some bold text

Daniel_Carrera · 16 April 2003 22:02

Thanks. I didn’t know about the %r{} notation. It really is nicer.

···

On Thu, Apr 17, 2003 at 06:56:48AM +0900, Michael Campbell wrote:

Incidentally, if it’s readability you crave, (“you” in the general
sense), I find myself going to %r{} notation more and more, if for no
other reason than not having to backwhack /'s.

–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Eric_Schwartz6 · 16 April 2003 23:36

Daniel Carrera dcarrera@math.umd.edu writes:

I’m just parsing the emails that I get from my hotmail-using friends.
I can’t see any reason why auto-generated html emails would have something
like .

The point isn’t so much that (although it is a valid one), it’s that
next you decide to parse other bits further, and the deeper you go
into HTML, the likelihood of your hitting something that a regex can’t
parse increases exponentially, if not faster.

Sure, you can keep tweaking your regexes to account for each of the
cases where they fail, and that’ll keep you busy-- or you can just use
an HTML parser and quit worrying about it.

I figure that what I have is good enough for my purposes.

That’s entirely your call. I’m just trying to point out that to parse
HTML, you really should use an HTML parser-- other solutions may work
in the short term, but they invariably fail in the long run. I’m also
a big fan of using the right tool for the job-- just because you can
loosen a nut on your wheel with a pair of pliers doesn’t mean a tire
iron isn’t a better solution.

-=Eric

···

–
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.

Daniel_Carrera · 16 April 2003 23:56

That’s entirely your call. I’m just trying to point out that to parse
HTML, you really should use an HTML parser-- other solutions may work
in the short term, but they invariably fail in the long run. I’m also
a big fan of using the right tool for the job-- just because you can
loosen a nut on your wheel with a pair of pliers doesn’t mean a tire
iron isn’t a better solution.

Definitelly. I also believe in using the right tool for the job.

I didn’t fully explain what I was trying to do. You see, I actually am
using an HTML parser, in the form of an external program (lynx -dump).
All I need is to extract the HTML code so I can send it to lynx.

For this an HTML parser looked like using a sledgehammer to swat a fly,
and an RE looked like a flyswatter.

Then again, perhaps calling lynx just for an HTML dump is
sledgehammerish…

What I am trying to do is convert the HTML emails I get into ASCII.
Do you know how to do an html dump using an HTML parser?

Thanks.

···

–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Eric_Schwartz6 · 17 April 2003 00:17

Daniel Carrera dcarrera@math.umd.edu writes:

What I am trying to do is convert the HTML emails I get into ASCII.
Do you know how to do an html dump using an HTML parser?

The htmltest.rb file included with html-parser2 on RAA
(libhtml-parser-ruby from debian) does exactly that. Take a look at
it; it appears to be a fairly straightfoward port of the Python
library of the same name.

-=Eric

···

–
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.

Brian_Candler · 17 April 2003 07:25

Using lynx to convert HTML to formatted ASCII is a good solution for the job
(I have mutt configured to use it for this purpose), although it doesn’t do
a good job of rendering tables.

It sounds to me that what you really want is a MIME parser, which will break
your mail into chunks - the chunk(s) that are declared to contain HTML (by
‘Content-Type: text/html’) can then be sent wholemeal to lynx, you don’t
need to look at the content at all.

Cheers,

Brian.

···

On Thu, Apr 17, 2003 at 08:56:57AM +0900, Daniel Carrera wrote:

Definitelly. I also believe in using the right tool for the job.

I didn’t fully explain what I was trying to do. You see, I actually am
using an HTML parser, in the form of an external program (lynx -dump).
All I need is to extract the HTML code so I can send it to lynx.

For this an HTML parser looked like using a sledgehammer to swat a fly,
and an RE looked like a flyswatter.

Then again, perhaps calling lynx just for an HTML dump is
sledgehammerish…

Josef_Jupp_SCHUGT · 17 April 2003 17:22

Saluton!

Brian Candler B.Candler@pobox.com; 2003-04-17, 12:48 UTC:

Using lynx to convert HTML to formatted ASCII is a good solution
for the job (I have mutt configured to use it for this purpose),
although it doesn’t do a good job of rendering tables.

I use two alternatives for that purpose. The default one is
vilistextum. Let me quote the author:

“vilistextum is a html to ascii converter specifically programmed to
get the best out of incorrect html.”

As a rule of thumb any ‘HTML’ in an e-mails can be assumed to be
invalid. Main advantage: It is very small and fast (i86 binary is
around 30 K).

If vilistextum is not enough I use w3m (using mutt’s pipe feature).
w3m does frames and tables, etc.

For mailers that run in Emacs, w3m-emacs is a good choice.

Gis,

Josef ‘Jupp’ Schugt

Topic		Replies	Views
Regular expression ruby-talk	2	66	24 September 2003
Multiline regexp and newlines ruby-talk	2	81	29 September 2008
Newline and regular expressions ruby-talk	3	69	23 July 2005
Multiline Regexps ruby-talk	3	83	9 December 2003
Problem replacing newlines in regexp ruby-talk	5	102	30 April 2007

[Q] Reg. Expressios with "\n"

puts $1 if str =~ /(.*)</html>/m

puts “greedy: #$1” if str =~ /(.)</b>/m puts “non-greedy: #$1” if str =~ /(.?)</b>/m

Related topics

puts “greedy: #$1” if str =~ /(.)</b>/m
puts “non-greedy: #$1” if str =~ /(.?)</b>/m