cskilbeck
(cskilbeck)
19 November 2007 18:45
1
Hi,
I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:
require 'open-uri'
page = open('http://xxx.html ').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
print inner
but inner is empty - any ideas?
If I substitute line 2 with
page = '123<table>456</table>789
I get inner = 456, which is correct.
A_LeDonne
(A LeDonne)
19 November 2007 18:55
2
Hi,
I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:
require 'open-uri'
page = open('http://xxx.html ').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
Untested, but try:
inner = page.scan(%r{.*<table[^>]*>(.*)</table>.*}m)
print inner
but inner is empty - any ideas?
If I substitute line 2 with
page = '123<table>456</table>789
I get inner = 456, which is correct.
If you try page = '123<table><tr><td>456</td></tr></table>789', it
will fail again.
You only want to capture up to the next closing angle bracket. What's
happening is that the second .* is matching the contents of the entire
table, up to the closing angle bracket of the last tag (probably
</tr>) right before the </table>, and inner gets only the leftover
whitespace inbetween. So only capture characters that are NOT a
closing angle bracket.
-Alex
···
On Nov 19, 2007 1:45 PM, cskilbeck <charlieskilbeck@gmail.com> wrote:
use the right tools for the right job
require 'hpricot'
require 'open-uri'
doc = Hpricot(open('http://xxx.html '))
table = doc.at('table')
puts table.inner_html
(not tested)
regards,
···
On Nov 19, 2007, at 3:45 PM, cskilbeck wrote:
Hi,
I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:
require 'open-uri'
page = open('http://xxx.html ').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
print inner
but inner is empty - any ideas?
If I substitute line 2 with
page = '123<table>456</table>789
I get inner = 456, which is correct.
--
Rolando Abarca
Phone: +56-9 97851962
W_James
(W. James)
19 November 2007 19:15
4
inner = page[ %r{<table.*?>(.*?)</table>}mi, 1]
···
On Nov 19, 12:41 pm, cskilbeck <charlieskilb...@gmail.com> wrote:
Hi,
I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:
require 'open-uri'
page = open('http://xxx.html ').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
print inner
but inner is empty - any ideas?
If I substitute line 2 with
page = '123<table>456</table>789
I get inner = 456, which is correct.
cskilbeck
(cskilbeck)
19 November 2007 21:00
5
Thanks all for your help. non greedy matching is the key.
···
On Nov 19, 7:14 pm, William James <w_a_x_...@yahoo.com> wrote:
On Nov 19, 12:41 pm, cskilbeck <charlieskilb...@gmail.com> wrote:
> Hi,
> I need to extract everything between <table> and </table> on a website
> (there's only one table on the page. So far I have:
> require 'open-uri'
> page = open('http://xxx.html ').read
> page.gsub!(/\n/,"")
> page.gsub!(/\r/,"")
> inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
> print inner
> but inner is empty - any ideas?
> If I substitute line 2 with
> page = '123<table>456</table>789
> I get inner = 456, which is correct.
inner = page[ %r{<table.*?>(.*?)</table>}mi, 1]
Thufir
(Thufir)
20 November 2007 07:14
6
Amazing -- I thought that the above would be a massive project, not what
appears to be pseudo-code! Not everything in Ruby is magically easy, but
the above is pretty good
-Thufir
···
On Tue, 20 Nov 2007 04:00:35 +0900, Rolando Abarca wrote:
require 'hpricot'
require 'open-uri'
doc = Hpricot(open('http://xxx.html ')) table = doc.at('table')
puts table.inner_html