Parsing HTML code with regex

Anthony_Walsh · 24 October 2006 18:34

I'm trying to parse through some html code and count the number of times
a match happens. The file is a large table with a ton of <tr> and <tr
'something'>. There are no spaces in the file. I'm trying to count and
print each <tr> and <tr 'something'>.

I haven't even gotten to counting my matches. I'm still working on
matching with <tr> or <tr 'anything'>

I've done:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

but it catches everything on from the first <tr to the end of the line.
Any ideas?

-Shinkaku

···

--
Posted via http://www.ruby-forum.com/.

Paul_Lutus · 24 October 2006 18:55

Anthony Walsh wrote:

I'm trying to parse through some html code and count the number of times
a match happens. The file is a large table with a ton of <tr> and <tr
'something'>. There are no spaces in the file. I'm trying to count and
print each <tr> and <tr 'something'>.

I haven't even gotten to counting my matches. I'm still working on
matching with <tr> or <tr 'anything'>

I've done:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

You want if op_file =~ /<tr.*?>/

But see below.

but it catches everything on from the first <tr to the end of the line.

Also, try scanning for matches, like this:

#!/usr/bin/ruby -w

path="path-to-HTML-page"

data = File.read(path)

array = data.scan(%r{<tr.*?>})

puts array

···

--
Paul Lutus
http://www.arachnoid.com

Phlip5 · 24 October 2006 19:05

Anthony Walsh wrote:

I'm trying to parse through some html code and count the number of times
a match happens.

If the code is not yet XHTML, use Tidy to upgrade it.

Then parse it with XPath, looking for your match.

(Tip: All HTML that you control should be XHTML, of the highest quality.
Don't rely on sloppy HTML and "browser forgiveness"!)

···

--
Phlip
http://www.greencheese.us/ZeekLand <-- NOT a blog!!!

Michael_Perle · 25 October 2006 12:10

Anthony Walsh wrote:

I'm trying to parse through some html code and count the number of times a match happens. The file is a large table with a ton of <tr> and <tr 'something'>. There are no spaces in the file. I'm trying to count and print each <tr> and <tr 'something'>.

I haven't even gotten to counting my matches. I'm still working on matching with <tr> or <tr 'anything'>

I've done:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

You are parsing always one line only.
Perhaps you mean a Regular Expression like

/(<tr([^>]*?>)+/m

Anyway I am not sure if the if... is the right
construct. Don't you want to get the return value
of the match, which delivers you a MatchData
object from which you can get the results as
an array or so.

MP

David_A_Black3 · 25 October 2006 12:45

Hi --

···

On Wed, 25 Oct 2006, Michael Perle wrote:

Anthony Walsh wrote:

I'm trying to parse through some html code and count the number of times a match happens. The file is a large table with a ton of <tr> and <tr 'something'>. There are no spaces in the file. I'm trying to count and print each <tr> and <tr 'something'>.

I haven't even gotten to counting my matches. I'm still working on matching with <tr> or <tr 'anything'>

I've done:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

You are parsing always one line only.
Perhaps you mean a Regular Expression like

/(<tr([^>]*?>)+/m

The /m doesn't make any difference there, because you're not using the
wildcard dot. /m just adds \n to the dot class.

David

--
David A. Black | dblack@wobblini.net
Author of "Ruby for Rails" [1] | Ruby/Rails training & consultancy [3]
DABlog (DAB's Weblog) [2] | Co-director, Ruby Central, Inc. [4]
[1] Ruby for Rails | [3] http://www.rubypowerandlight.com
[2] http://dablog.rubypal.com | [4] http://www.rubycentral.org

Anthony_Walsh · 25 October 2006 15:14

Also, try scanning for matches, like this:

#!/usr/bin/ruby -w

path="path-to-HTML-page"

data = File.read(path)

array = data.scan(%r{<tr.*?>})

puts array

Thanks, this worked.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Html parsing using regular expressions ruby-talk	2	108	25 October 2006
Parsing HTML code with regex ruby-talk	0	106	24 October 2006
Parsing HTML code with regex ruby-talk	0	108	24 October 2006
Strings and regex's ruby-talk	5	66	14 September 2005
Extracting text from HTML ruby-talk	7	80	11 May 2003

Parsing HTML code with regex

Related topics