I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.
Any ideas?
Thanks.
···
–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137
I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.
I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.
I want an RE that matches this string, but ‘.’ doesn’t match the \n
character. I guess I could replace the \n’s by something else, but I’d
rather not alter the original string if I can avoid it.
IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:
str =~ /(.*</html>)/m
Thanks! that worked.
···
–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137
IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:
str =~ /(.*</html>)/m
But when parsing HTML, you probably shouldn’t be using REs at all.
This perfectly legal HTML will confuse that RE:
If you need to parse HTML, use an HTML parser; there will always be a
(usually simple) way to defeat a regular expression. “But I can
control the HTML!”, I hear you (generic) say. Sure you can-- now.
But what happens a year down the road when someone else is in charge
of generating it? Best to be safe and do it the right way from the
start.
-=Eric
···
–
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.
Incidentally, if it’s readability you crave, (“you” in the general
sense), I find myself going to %r{} notation more and more, if for no
other reason than not having to backwhack /'s.
I want an RE that matches this string, but ‘.’ doesn’t match
the \n
character. I guess I could replace the \n’s by something else,
but I’d
rather not alter the original string if I can avoid it.
IIRC you can use the multiline modifier “m” on
the RE. I haven’t tried this:
str =~ /(.*</html>)/m
Thanks! that worked.
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
I’m just parsing the emails that I get from my hotmail-using friends.
I can’t see any reason why auto-generated html emails would have something
like .
I figure that what I have is good enough for my purposes.
···
On Thu, Apr 17, 2003 at 07:15:23AM +0900, Eric Schwartz wrote:
If you need to parse HTML, use an HTML parser; there will always be a
(usually simple) way to defeat a regular expression. “But I can
control the HTML!”, I hear you (generic) say. Sure you can-- now.
But what happens a year down the road when someone else is in charge
of generating it? Best to be safe and do it the right way from the
start.
–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137
I agree with the sentiment (that using REs to parse HTML, XML, etc)
isn’t a good idea, but the example above isn’t valid. Greedy matching
takes precedence, so it actually works like it should.
Thanks. I didn’t know about the %r{} notation. It really is nicer.
···
On Thu, Apr 17, 2003 at 06:56:48AM +0900, Michael Campbell wrote:
Incidentally, if it’s readability you crave, (“you” in the general
sense), I find myself going to %r{} notation more and more, if for no
other reason than not having to backwhack /'s.
–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137
I’m just parsing the emails that I get from my hotmail-using friends.
I can’t see any reason why auto-generated html emails would have something
like .
The point isn’t so much that (although it is a valid one), it’s that
next you decide to parse other bits further, and the deeper you go
into HTML, the likelihood of your hitting something that a regex can’t
parse increases exponentially, if not faster.
Sure, you can keep tweaking your regexes to account for each of the
cases where they fail, and that’ll keep you busy-- or you can just use
an HTML parser and quit worrying about it.
I figure that what I have is good enough for my purposes.
That’s entirely your call. I’m just trying to point out that to parse
HTML, you really should use an HTML parser-- other solutions may work
in the short term, but they invariably fail in the long run. I’m also
a big fan of using the right tool for the job-- just because you can
loosen a nut on your wheel with a pair of pliers doesn’t mean a tire
iron isn’t a better solution.
-=Eric
···
–
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.
That’s entirely your call. I’m just trying to point out that to parse
HTML, you really should use an HTML parser-- other solutions may work
in the short term, but they invariably fail in the long run. I’m also
a big fan of using the right tool for the job-- just because you can
loosen a nut on your wheel with a pair of pliers doesn’t mean a tire
iron isn’t a better solution.
Definitelly. I also believe in using the right tool for the job.
I didn’t fully explain what I was trying to do. You see, I actually am
using an HTML parser, in the form of an external program (lynx -dump).
All I need is to extract the HTML code so I can send it to lynx.
For this an HTML parser looked like using a sledgehammer to swat a fly,
and an RE looked like a flyswatter.
Then again, perhaps calling lynx just for an HTML dump is
sledgehammerish…
What I am trying to do is convert the HTML emails I get into ASCII.
Do you know how to do an html dump using an HTML parser?
Thanks.
···
–
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137
What I am trying to do is convert the HTML emails I get into ASCII.
Do you know how to do an html dump using an HTML parser?
The htmltest.rb file included with html-parser2 on RAA
(libhtml-parser-ruby from debian) does exactly that. Take a look at
it; it appears to be a fairly straightfoward port of the Python
library of the same name.
-=Eric
···
–
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
– Blair Houghton.
Using lynx to convert HTML to formatted ASCII is a good solution for the job
(I have mutt configured to use it for this purpose), although it doesn’t do
a good job of rendering tables.
It sounds to me that what you really want is a MIME parser, which will break
your mail into chunks - the chunk(s) that are declared to contain HTML (by
‘Content-Type: text/html’) can then be sent wholemeal to lynx, you don’t
need to look at the content at all.
Cheers,
Brian.
···
On Thu, Apr 17, 2003 at 08:56:57AM +0900, Daniel Carrera wrote:
Definitelly. I also believe in using the right tool for the job.
I didn’t fully explain what I was trying to do. You see, I actually am
using an HTML parser, in the form of an external program (lynx -dump).
All I need is to extract the HTML code so I can send it to lynx.
For this an HTML parser looked like using a sledgehammer to swat a fly,
and an RE looked like a flyswatter.
Then again, perhaps calling lynx just for an HTML dump is
sledgehammerish…
Using lynx to convert HTML to formatted ASCII is a good solution
for the job (I have mutt configured to use it for this purpose),
although it doesn’t do a good job of rendering tables.
I use two alternatives for that purpose. The default one is
vilistextum. Let me quote the author:
“vilistextum is a html to ascii converter specifically programmed to
get the best out of incorrect html.”
As a rule of thumb any ‘HTML’ in an e-mails can be assumed to be
invalid. Main advantage: It is very small and fast (i86 binary is
around 30 K).
If vilistextum is not enough I use w3m (using mutt’s pipe feature).
w3m does frames and tables, etc.
For mailers that run in Emacs, w3m-emacs is a good choice.