Html stringScanner regexp

Tomas_Fischer · 3 May 2006 23:12

Hi,

I want to parse a html string like this:

...
<div align=left><a href=# class=title> title1 </a></div>
...
<div align=left><a href=# class=title> title2 </a></div>
...
<div align=left><a href=# class=title> title3 </a></div>
...
<div align=left><a href=# class=title> title4 </a></div>
...

"..." means other stuff.
I need to extract title1 to title4, so I tried

ScannerScan.scan(/.*class=title>(.*)<\/a><\/div>_NEWLINE_/) But I get
only the last title -- title4. Why? Is the regex wrong, or do I miss the
point with the scan method?

Best regards
Tomas

···

--
Posted via http://www.ruby-forum.com/.

Mike_Fletcher · 3 May 2006 23:35

Tomas Fischer wrote:
[...]

ScannerScan.scan(/.*class=title>(.*)<\/a><\/div>_NEWLINE_/) But I get
only the last title -- title4. Why? Is the regex wrong, or do I miss the
point with the scan method?

The .* is greedy and gobbles up as much of the source as it can (up to
the end of the string) and then the regex engine backtracks just enough
to match the last occurance. You might try this instead:

%r{ class=title>(.*?)</a></div>\n}

But remember that unless you can guarantee the formatting of your input
won't vary much you're probably better off using a proper HTML parser to
handle HTML rather than regexen.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Str.scan ruby-talk	5	71	15 June 2007
Regex problem ruby-talk	4	87	2 December 2007
Help for extracting text with regexp ruby-talk	4	126	18 February 2011
Regex find everything between ruby-talk	5	120	23 August 2011
String.scan (Regexp again...) ruby-talk	3	75	12 December 2002

Html stringScanner regexp

Related topics