Extracting text from HTML

Given a HTML file, I’m looking for a regex that can give me the text that
matches the following:

ANYTHING

So for I’ve got the following:

file = File.new(‘file.txt’, ‘r’)
s = file.read
file.close

channelBlocks = s.scan(/<table width=“132” border=“0” cellpadding="0"
cellspacing=“0”>(.*?)</table>/)

channelBlocks.each do |s|
puts s
end

But it doesn’t work, is it because the match is span across multiple lines
in the HTML? If so, what’s the best method to solve it?

Robo

Robo wrote:

channelBlocks = s.scan(/<table width="132" border="0" cellpadding="0"
cellspacing="0">(.*?)</table>/)

channelBlocks.each do |s|
puts s
end

But it doesn’t work, is it because the match is span across multiple lines
in the HTML? If so, what’s the best method to solve it?

Yes, it is due to the multiple lines. Try

channelBlocks = Regexp.new(

(.*)
’,
Regexp::MULTILINE
)

Watch out for greed and nested tables, though.

I gave up on trying to make the literal Regexp work, escaping things by
hand is a real chore. :stuck_out_tongue:

HTH

···


([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student
/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_
/ö____/ (_engineering.discipline=Computer::Technology)

Kent Dahl wrote:

channelBlocks = Regexp.new(

(.*)
’,
Regexp::MULTILINE
)

Arg, line wrapping was too aggressive there…

channelBlocks = Regexp.new(
'<table width=“132” border=“0” cellpadding=“0” ’ +
‘cellspacing=“0”>(.*)’,
Regexp::MULTILINE
)

···


([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student
/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_
/ö____/ (_engineering.discipline=Computer::Technology)

Arg, line wrapping was too aggressive there…

channelBlocks = Regexp.new(
'<table width=“132” border=“0” cellpadding=“0” ’ +
‘cellspacing=“0”>(.*)’,
Regexp::MULTILINE
)

Thanks, just a slight problem, I want it to match from the start of the

tag to the next
tag, but I think Ruby matched it with the last tag in the file.

In the file, there’re 3 of these tables with width 132, I want to extract
those three tables. This is what I have so far:

file = File.new(‘your_tv_guide.txt’, ‘r’)
s = file.read
file.close

#puts s
regex = Regexp.new(

(.*)
’,
Regexp::MULTILINE
)

channelBlocks = s.scan(regex)

channelBlocks.each do |s|
puts s
end

Hi –

Arg, line wrapping was too aggressive there…

channelBlocks = Regexp.new(
'<table width=“132” border=“0” cellpadding=“0” ’ +
‘cellspacing=“0”>(.*)’,
Regexp::MULTILINE
)

Thanks, just a slight problem, I want it to match from the start of the

tag to the next
tag, but I think Ruby matched it with the last tag in the file.

That’s why Kent told you to watch out for regex ‘greed’ :slight_smile:

If you change .* to .*? you should be in business.

channelBlocks.each do |s|
puts s
end

Much simpler way to print an array:

puts channelBlocks

or

puts channel_blocks

(if you want to do your bit to preserve classic Ruby style :slight_smile:

David

···

On Sun, 11 May 2003, Robo wrote:


David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

dblack@superlink.net wrote:

If you change .* to .*? you should be in business.

Oh dear, looks like I removed that part from the original code posted by
Robo when testing. My regexp-nuby status is showing. Guess a good RE
book should be on my wishlist for my birthday then. :slight_smile:

···


([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student
/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_
/ö____/ (_engineering.discipline=Computer::Technology)

Beware that regexp syntax and features vary from language to language, so
don’t expect the more esoteric examples to work in Ruby without some
modification.

You might want to try reading the documentation for “PCRE”, the
Perl-Compatible Regular Expressions library by Philip Hazel, author of
Exim. He writes excellent documents! Find it on freshmeat.net.

Regards,

Brian.

···

On Sun, May 11, 2003 at 04:11:35PM +0900, Kent Dahl wrote:

Oh dear, looks like I removed that part from the original code posted by
Robo when testing. My regexp-nuby status is showing. Guess a good RE
book should be on my wishlist for my birthday then. :slight_smile:

Hi –

···

On Sun, 11 May 2003, Brian Candler wrote:

On Sun, May 11, 2003 at 04:11:35PM +0900, Kent Dahl wrote:

Oh dear, looks like I removed that part from the original code posted by
Robo when testing. My regexp-nuby status is showing. Guess a good RE
book should be on my wishlist for my birthday then. :slight_smile:

Beware that regexp syntax and features vary from language to language, so
don’t expect the more esoteric examples to work in Ruby without some
modification.

I’d recommend “Mastering Regular Expressions” by Jeff Friedl for the
wishlist; it takes those variations into account, and the 2nd edition
includes Ruby.

David


David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav