Extracting text from HTML

Robo · 10 May 2003 13:05

Given a HTML file, I’m looking for a regex that can give me the text that
matches the following:

ANYTHING

So for I’ve got the following:

file = File.new(‘file.txt’, ‘r’)
s = file.read
file.close

channelBlocks = s.scan(/<table width=“132” border=“0” cellpadding="0"
cellspacing=“0”>(.*?)</table>/)

channelBlocks.each do |s|
puts s
end

But it doesn’t work, is it because the match is span across multiple lines
in the HTML? If so, what’s the best method to solve it?

Robo

Kent_Dahl2 · 10 May 2003 13:26

Robo wrote:

channelBlocks = s.scan(/<table width="132" border="0" cellpadding="0"
cellspacing="0">(.*?)</table>/)

channelBlocks.each do |s|
puts s
end

But it doesn’t work, is it because the match is span across multiple lines
in the HTML? If so, what’s the best method to solve it?

Yes, it is due to the multiple lines. Try

channelBlocks = Regexp.new(
‘

(.*)

’,
Regexp::MULTILINE
)

Watch out for greed and nested tables, though.

I gave up on trying to make the literal Regexp work, escaping things by
hand is a real chore.

HTH

···

–
([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_/ö____/ (_engineering.discipline=Computer::Technology)

Kent_Dahl2 · 10 May 2003 13:26

Kent Dahl wrote:

channelBlocks = Regexp.new(
‘
(.*)
’,
Regexp::MULTILINE
)

Arg, line wrapping was too aggressive there…

channelBlocks = Regexp.new(
'<table width=“132” border=“0” cellpadding=“0” ’ +
‘cellspacing=“0”>(.*)’,
Regexp::MULTILINE
)

···

–
([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_/ö____/ (_engineering.discipline=Computer::Technology)

Robo · 10 May 2003 23:49

Arg, line wrapping was too aggressive there…

channelBlocks = Regexp.new(
'<table width=“132” border=“0” cellpadding=“0” ’ +
‘cellspacing=“0”>(.*)’,
Regexp::MULTILINE
)

Thanks, just a slight problem, I want it to match from the start of the

tag to the next

tag, but I think Ruby matched it with the last tag in the file.

In the file, there’re 3 of these tables with width 132, I want to extract
those three tables. This is what I have so far:

file = File.new(‘your_tv_guide.txt’, ‘r’)
s = file.read
file.close

#puts s
regex = Regexp.new(
‘

(.*)

’,
Regexp::MULTILINE
)

channelBlocks = s.scan(regex)

channelBlocks.each do |s|
puts s
end

David_A_Black2 · 11 May 2003 00:06

Hi –

Arg, line wrapping was too aggressive there…

channelBlocks = Regexp.new(
'<table width=“132” border=“0” cellpadding=“0” ’ +
‘cellspacing=“0”>(.*)’,
Regexp::MULTILINE
)

Thanks, just a slight problem, I want it to match from the start of the
tag to the next
tag, but I think Ruby matched it with the last tag in the file.

That’s why Kent told you to watch out for regex ‘greed’

If you change .* to .*? you should be in business.

channelBlocks.each do |s|
puts s
end

Much simpler way to print an array:

puts channelBlocks

or

puts channel_blocks

(if you want to do your bit to preserve classic Ruby style

David

···

On Sun, 11 May 2003, Robo wrote:

–
David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Kent_Dahl2 · 11 May 2003 07:11

dblack@superlink.net wrote:

If you change .* to .*? you should be in business.

Oh dear, looks like I removed that part from the original code posted by
Robo when testing. My regexp-nuby status is showing. Guess a good RE
book should be on my wishlist for my birthday then.

···

–
([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_/ö____/ (_engineering.discipline=Computer::Technology)

Brian_Candler · 11 May 2003 08:29

Beware that regexp syntax and features vary from language to language, so
don’t expect the more esoteric examples to work in Ruby without some
modification.

You might want to try reading the documentation for “PCRE”, the
Perl-Compatible Regular Expressions library by Philip Hazel, author of
Exim. He writes excellent documents! Find it on freshmeat.net.

Regards,

Brian.

···

On Sun, May 11, 2003 at 04:11:35PM +0900, Kent Dahl wrote:

Oh dear, looks like I removed that part from the original code posted by
Robo when testing. My regexp-nuby status is showing. Guess a good RE
book should be on my wishlist for my birthday then.

David_A_Black2 · 11 May 2003 13:12

Hi –

···

On Sun, 11 May 2003, Brian Candler wrote:

On Sun, May 11, 2003 at 04:11:35PM +0900, Kent Dahl wrote:

Oh dear, looks like I removed that part from the original code posted by
Robo when testing. My regexp-nuby status is showing. Guess a good RE
book should be on my wishlist for my birthday then.

Beware that regexp syntax and features vary from language to language, so
don’t expect the more esoteric examples to work in Ruby without some
modification.

I’d recommend “Mastering Regular Expressions” by Jeff Friedl for the
wishlist; it takes those variations into account, and the 2nd edition
includes Ruby.

David

–
David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Topic		Replies	Views
Regexp question ruby-talk	3	76	9 May 2005
Simple regex question ruby-talk	3	75	25 October 2005
Regex find everything between ruby-talk	5	120	23 August 2011
Regular expression ruby-talk	7	100	23 March 2009
RegExp problem ruby-talk	2	90	17 May 2007

Extracting text from HTML

Related topics