Html parsing using regular expressions

I'm new to Ruby and trying to use regular expressions to parse an html
file. The page is a large table with no spaces in the html code. I want
to count the number of times <tr> or <tr 'anything'> occurs. I'm stuck
on trying to match every variety of <tr>

I've tried

op_file = File.read(htmlfile)
if op_file =~ /(<tr(.*?)>)+/

but it catches the first <tr and matches all the way to the end of the
file. Anyone have any advice on matching and counting?

-Shinkaku

···

--
Posted via http://www.ruby-forum.com/.

Don't. Use Hpricot instead. Your brain will thank you for it.

I haven't used Hpricot, but I've heard great things about it; I've
tried to do HTML parsing with regexen, and it's a mook's game.

-austin

···

On 10/24/06, Anthony Walsh <akakuda@excite.com> wrote:

I'm new to Ruby and trying to use regular expressions to parse an html
file.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

Anthony Walsh wrote:

I'm new to Ruby and trying to use regular expressions to parse an html
file. The page is a large table with no spaces in the html code. I want
to count the number of times <tr> or <tr 'anything'> occurs. I'm stuck
on trying to match every variety of <tr>

I've tried

op_file = File.read(htmlfile)
if op_file =~ /(<tr(.*?)>)+/

but it catches the first <tr and matches all the way to the end of the
file. Anyone have any advice on matching and counting?

You need to tell us whether you have read the replies you received to this
same question when you asked it eight hours ago. I answered your question,
several others did also, you have not given any indication that you saw the
replies.

Here is one answer:

#!/usr/bin/ruby -w

path="path-to-HTML-page"

data = File.read(path)

array = data.scan(%r{<tr.*?>})

puts array.size # gives a count of occurrences

puts array # shows the matches

Please read replies before posting again.

···

--
Paul Lutus
http://www.arachnoid.com