Scan HTML

Tom_Arra · 1 March 2008 03:22

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"

···

--
Posted via http://www.ruby-forum.com/.

Gregory_Seidman · 1 March 2008 03:34

You want the Hpricot gem.

require 'rubygems'
require 'hpricot'

html = <<EOF
<html>
<body>
<h3>test</h3>
</body>
</html>
EOF

doc = Hpricot(html)

puts (doc/'h3').first.inner_text

--Greg

···

On Sat, Mar 01, 2008 at 12:22:12PM +0900, Tom Arra wrote:

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"

W_James · 1 March 2008 04:49

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

···

On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"
--
Posted viahttp://www.ruby-forum.com/.

Steve_Dame · 1 March 2008 04:53

Google Hipricot and Mechanize!!!

···

Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: William James <w_a_x_man@yahoo.com>

Date: Sat, 1 Mar 2008 13:49:59
To:ruby-talk@ruby-lang.org (ruby-talk ML)
Subject: Re: Scan HTML

On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"
--
Posted viahttp://www.ruby-forum.com/.

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

W_James · 1 March 2008 04:55

No, he doesn't.

···

On Feb 29, 9:34 pm, Gregory Seidman <gsslist+r...@anthropohedron.net> wrote:

On Sat, Mar 01, 2008 at 12:22:12PM +0900, Tom Arra wrote:
> So I am new to Ruby scripting so I am not sure if this is possible or
> not. I want to make a script that will load a webpage and then search
> through the HTML of that page until it hits a certain tag. Once it hits
> that tag it need to grab all of the text between the tag and the
> appropriate end tag. Is something like this possible?

> Example
> <html>
> <body>
> <h3>test</h3>
> </body>
> </html>

> I want the script to return "test"

You want the Hpricot gem.

Tom_Arra · 1 March 2008 12:44

William James wrote:

···

On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:

</body>
</html>

I want the script to return "test"
--
Posted viahttp://www.ruby-forum.com/.

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.
--
Posted via http://www.ruby-forum.com/\.

Marc_Heiler · 1 March 2008 06:52

You want the Hpricot gem.

Personally I agree on that, insofar that I think the most simple,
"default" ruby solution is better than a specialized one. In this case I
think the better solution is Net::HTTP

···

--
Posted via http://www.ruby-forum.com/\.

Todd_Benson · 1 March 2008 07:00

Same question, different people, same strict requirements. It sounds
a little like homework. In that case, I suppose some of the regexp
solutions provided will work (for this small use case).

I still think Florian said it best, though. Unless you can "stack",
you won't be able to correctly reveal the components inside a nested
language structure. I haven't looked into the theory, but I can
attest to the pain in the arse I've had trying to scrape with regular
expressions.

Todd

···

On Fri, Feb 29, 2008 at 10:55 PM, William James <w_a_x_man@yahoo.com> wrote:

On Feb 29, 9:34 pm, Gregory Seidman <gsslist+r...@anthropohedron.net> > wrote:

> On Sat, Mar 01, 2008 at 12:22:12PM +0900, Tom Arra wrote:
> > So I am new to Ruby scripting so I am not sure if this is possible or
> > not. I want to make a script that will load a webpage and then search
> > through the HTML of that page until it hits a certain tag. Once it hits
> > that tag it need to grab all of the text between the tag and the
> > appropriate end tag. Is something like this possible?
>
> > Example
> > <html>
> > <body>
> > <h3>test</h3>
> > </body>
> > </html>
>
> > I want the script to return "test"
>
> You want the Hpricot gem.

No, he doesn't.

Tom_Arra · 1 March 2008 13:56

Tom Arra wrote:

William James wrote:

</body>
</html>

I want the script to return "test"
--
Posted viahttp://www.ruby-forum.com/.

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.

Well I just tried it and it worked like a charm. My next thing is to
limit what it brings back.

Example
<h3>blah blah blah 7.0.0.3.4 blah blah blah</h3>

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.

···

On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:

--
Posted via http://www.ruby-forum.com/\.

W_James · 1 March 2008 15:04

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"

···

On Mar 1, 7:56 am, Tom Arra <turtleman14...@gmail.com> wrote:

Tom Arra wrote:
> William James wrote:
>> On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:
>>> </body>
>>> </html>

>>> I want the script to return "test"
>>> --
>>> Posted viahttp://www.ruby-forum.com/.

>> require 'net/http'
>> puts Net::HTTP.new('www.google.com').get('/').
>> body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

>> If the tag can contain attributes, e.g.,
>> <title foo="bar">:

>> require 'net/http'
>> puts Net::HTTP.new('www.google.com').get('/').
>> body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

> So far I think this is closest to what I am looking for. I need to go to
> a website that has a server information and pull that out of the HTML.
> Then take that info and spit it back out to the user. If I am
> understanding the code above, it at least does the first part which I
> had no clue how to do.

Well I just tried it and it worked like a charm. My next thing is to
limit what it brings back.

Example
<h3>blah blah blah 7.0.0.3.4 blah blah blah</h3>

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.
--
Posted viahttp://www.ruby-forum.com/.

Tom_Arra · 1 March 2008 18:06

William James wrote:

···

On Mar 1, 7:56 am, Tom Arra <turtleman14...@gmail.com> wrote:

>> require 'net/http'
> So far I think this is closest to what I am looking for. I need to go to

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.
--
Posted viahttp://www.ruby-forum.com/.

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"

Your really good at this stuff! One thing i noticed is that it works
perfectly for the regular domain but as soon as I put a full URL into
the Net::HTTP.new command it starts to throw errors. Any ideas.
--
Posted via http://www.ruby-forum.com/\.

Tom_Arra · 1 March 2008 18:44

Heres what I have so far

#! /usr/bin/ruby
require 'net/http'

text = Net::HTTP.new('www.tomarra.com').get('/').body[
%r{<title\s*>(.*?)</title\s*>}mi, 1 ]
print "TomArra.com Title Tag: "
print text
print "\n"
s = "<h3>blah blah 7.0.0.4.3 blah blah</h3>"[ /[\d.]{3,}/ ]
print s

puts Net::HTTP.new('www.tomarra.com/credits.html').get('/').body[
%r{<center\s*>(.*?)</center\s*>}mi, 1 ]

and here is my output
TomArra.com Title Tag: Welcome To TomArra.com
7.0.0.4.3
SocketError: getaddrinfo: nodename nor servname provided, or not known

method initialize in http.rb at line 564
method open in http.rb at line 564
method connect in http.rb at line 564
method timeout in timeout.rb at line 48
method timeout in timeout.rb at line 76
method connect in http.rb at line 564
method do_start in http.rb at line 557
method start in http.rb at line 546
method request in http.rb at line 1044
method get in http.rb at line 781
at top level in simple.rb at line 11
Program exited.

···

--
Posted via http://www.ruby-forum.com/.

W_James · 1 March 2008 18:50

Tom Arra wrote:

William James wrote:
>> >> require 'net/http'
>> > So far I think this is closest to what I am looking for. I need to go to
>>
>> I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
>> is going to have to deal with more regular expressions but I never
>> really understood how to use them well.
>> --
>> Posted viahttp://www.ruby-forum.com/.
>
> E:\>irb --prompt xmp
> s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
> ==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
> # Find a substring composed of numerals and dots that is
> # at least 3 characters long.
> s[ /[\d.]{3,}/ ]
> ==>"7.0.0.3.4"

Your really good at this stuff! One thing i noticed is that it works
perfectly for the regular domain but as soon as I put a full URL into
the Net::HTTP.new command it starts to throw errors. Any ideas.

Use the rest of the URL as the argument for ".get()":

require 'net/http'
puts Net::HTTP.new('www.newlisp.org').get('/index.cgi?Documentation').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

···

> On Mar 1, 7:56 am, Tom Arra <turtleman14...@gmail.com> wrote:

Tom_Arra · 1 March 2008 19:23

William James wrote:

Tom Arra wrote:

>
the Net::HTTP.new command it starts to throw errors. Any ideas.

Use the rest of the URL as the argument for ".get()":

require 'net/http'
puts Net::HTTP.new('www.newlisp.org').get('/index.cgi?Documentation').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

Works like a charm thanks for all your help!!

···

--
Posted via http://www.ruby-forum.com/\.

Tom_Arra · 3 March 2008 22:50

One more little problem. I noticed that this net/http method automaticly
puts in port 80. Problem is that I need to get to a different port.
There has to be a way around this, right?

···

--
Posted via http://www.ruby-forum.com/.

Tom_Arra · 3 March 2008 23:04

Tom Arra wrote:

One more little problem. I noticed that this net/http method automaticly
puts in port 80. Problem is that I need to get to a different port.
There has to be a way around this, right?

Nevermind just figured it out

puts Net::HTTP.new('<<Server Here>>',<<port # here>>)

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Yet another Hpricot question ruby-talk	5	75	12 October 2006
Hpricot parsing ruby-talk	5	132	20 April 2009
Need help parsing HTML with Hpricot ruby-talk	3	121	25 October 2007
Html parsing with Hpricot ruby-talk	2	83	9 June 2010
Html parser with regex, how to solve? ruby-talk	4	130	6 January 2008

Scan HTML

Related topics