Hi,
I want to parse a html string like this:
...
<div align=left><a href=# class=title> title1 </a></div>
...
<div align=left><a href=# class=title> title2 </a></div>
...
<div align=left><a href=# class=title> title3 </a></div>
...
<div align=left><a href=# class=title> title4 </a></div>
...
"..." means other stuff.
I need to extract title1 to title4, so I tried
ScannerScan.scan(/.*class=title>(.*)<\/a><\/div>_NEWLINE_/) But I get
only the last title -- title4. Why? Is the regex wrong, or do I miss the
point with the scan method?
Best regards
Tomas
···
--
Posted via http://www.ruby-forum.com/.
Tomas Fischer wrote:
[...]
ScannerScan.scan(/.*class=title>(.*)<\/a><\/div>_NEWLINE_/) But I get
only the last title -- title4. Why? Is the regex wrong, or do I miss the
point with the scan method?
The .* is greedy and gobbles up as much of the source as it can (up to
the end of the string) and then the regex engine backtracks just enough
to match the last occurance. You might try this instead:
%r{ class=title>(.*?)</a></div>\n}
But remember that unless you can guarantee the formatting of your input
won't vary much you're probably better off using a proper HTML parser to
handle HTML rather than regexen.
···
--
Posted via http://www.ruby-forum.com/\.