How can I count number of elements in an HTML page

Hi there, I'm using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:

···

----
<div class="first section">
   <h1>
      Section Heading I'm interested in:
   </h1>
   <ul>
      <li>
         <form action="foo" method="post" name="">
           <button type="submit">foo</button>
         </form>
      </li>
      <li>
         <form action="bar" method="post" name="">
            <button type="submit">bar</button>
         </form>
      </li>
   </ul>
</div>

<div class="next section">
----

So what I want to do is count the number of li's in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

TIA

I'd use Nokogiri. Off the top of my head, it would be something like (untested):

require 'nokogiri'

html_string=<<END
#[your html]
END

doc = Nokogiri::HTML(html_string)
puts doc.search("/div/ul/li").size

Maybe you will need to adjust the xpath search, but I think it should
be something like that.

Hope this helps,

Jesus.

···

On Tue, Oct 5, 2010 at 10:45 PM, Paul <tester.paul@gmail.com> wrote:

Hi there, I'm using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:
----
<div class="first section">
<h1>
Section Heading I'm interested in:
</h1>
<ul>
<li>
<form action="foo" method="post" name="">
<button type="submit">foo</button>
</form>
</li>
<li>
<form action="bar" method="post" name="">
<button type="submit">bar</button>
</form>
</li>
</ul>
</div>

<div class="next section">
----

So what I want to do is count the number of li's in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

I would say if you aren't exactly concerned with the content of the row. Perhaps just counting the number of lines in the array? I guess you would have to read in the page line by line...but that isn't too hard.

---- Paul <tester.paul@gmail.com> wrote:

···

Hi there, I'm using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:
----
<div class="first section">
   <h1>
      Section Heading I'm interested in:
   </h1>
   <ul>
      <li>
         <form action="foo" method="post" name="">
           <button type="submit">foo</button>
         </form>
      </li>
      <li>
         <form action="bar" method="post" name="">
            <button type="submit">bar</button>
         </form>
      </li>
   </ul>
</div>

<div class="next section">
----

So what I want to do is count the number of li's in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

TIA

Nokogiri allows direct access to HTML elements as data. I use it a lot
in my work. Try something like:

require 'nokogiri'
page = Nokogiri::HTML response.body
count = 0
page.xpath("//div[@class='first section']).each do |element|
  count += 1 if element.xpath("/ul")
end

Or something along those lines... (I didn't test this first).

···

On Tue, 2010-10-05 at 15:45 -0500, Paul wrote:

Hi there, I'm using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:
----
<div class="first section">
   <h1>
      Section Heading I'm interested in:
   </h1>
   <ul>
      <li>
         <form action="foo" method="post" name="">
           <button type="submit">foo</button>
         </form>
      </li>
      <li>
         <form action="bar" method="post" name="">
            <button type="submit">bar</button>
         </form>
      </li>
   </ul>
</div>

<div class="next section">
----

So what I want to do is count the number of li's in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

TIA

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

$html.scan(%r{<div.*first section.*</div>}m).to_s.scan(/<li>/).size

···

--
Posted via http://www.ruby-forum.com/\.

I am now at my computer so I can test this. It seems that
Nokogiri::HTML yields a complete HTML, adding <html> and <body> tags
around the fragment, so this works:

irb(main):002:0> require 'nokogiri'
=> true
irb(main):003:0> html_string =<<END
irb(main):004:0" <div class="first section">
irb(main):005:0" <h1>
irb(main):006:0" Section Heading I'm interested in:
irb(main):007:0" </h1>
irb(main):008:0" <ul>
irb(main):009:0" <li>
irb(main):010:0" <form action="foo" method="post" name="">
irb(main):011:0" <button type="submit">foo</button>
irb(main):012:0" </form>
irb(main):013:0" </li>
irb(main):014:0" <li>
irb(main):015:0" <form action="bar" method="post" name="">
irb(main):016:0" <button type="submit">bar</button>
irb(main):017:0" </form>
irb(main):018:0" </li>
irb(main):019:0" </ul>
irb(main):020:0" </div>
irb(main):021:0" END
[...snip...]
irb(main):034:0> doc.search("/html/body/div/ul/li").size
=> 2

Hope this helps,

Jesus.

···

2010/10/5 Jesús Gabriel y Galán <jgabrielygalan@gmail.com>:

On Tue, Oct 5, 2010 at 10:45 PM, Paul <tester.paul@gmail.com> wrote:

Hi there, I'm using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:
----
<div class="first section">
<h1>
Section Heading I'm interested in:
</h1>
<ul>
<li>
<form action="foo" method="post" name="">
<button type="submit">foo</button>
</form>
</li>
<li>
<form action="bar" method="post" name="">
<button type="submit">bar</button>
</form>
</li>
</ul>
</div>

<div class="next section">
----

So what I want to do is count the number of li's in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

I'd use Nokogiri. Off the top of my head, it would be something like (untested):

require 'nokogiri'

html_string=<<END
#[your html]
END

doc = Nokogiri::HTML(html_string)
puts doc.search("/div/ul/li").size

Maybe you will need to adjust the xpath search, but I think it should
be something like that.

Thanks Steel. This worked fine. I just needed to make it a lazy
search with .*?

I've got nothing against Nokogiri or the other solutions but I was
hoping for a solution like this that just uses the core libraries for
portability.

Cheers! Paul.

···

On Oct 5, 9:33 pm, Steel Steel <angel_st...@ymail.com> wrote:

> I can find the section I want with a regex but I don't know how to
> iterate through the string looking for particular elements. I was
> thinking about taking the section I'm interested in and saving it as
> an array and then iterating through each array element (html line)
> that way, but I thought there might be a quicker way to do it.

$html.scan(%r{<div.*first section.*</div>}m).to_s.scan(/<li>/).size

Hi,

The "jazzez" gem lists -- Count for each and every html tag.

jazzez used -- mechanize, Hpricot libraries...

For more details --> http://jazzez.wordpress.com

Thanks
Raveendran

···

--
Posted via http://www.ruby-forum.com/.

> I can find the section I want with a regex but I don't know how to
> iterate through the string looking for particular elements. I was
> thinking about taking the section I'm interested in and saving it as
> an array and then iterating through each array element (html line)
> that way, but I thought there might be a quicker way to do it.

$html.scan(%r{<div.*first section.*</div>}m).to_s.scan(/<li>/).size

Thanks Steel. This worked fine. I just needed to make it a lazy
search with .*?

I've got nothing against Nokogiri or the other solutions but I was
hoping for a solution like this that just uses the core libraries for
portability.

You have to be careful, then, about all the possible combinations that
make a valid HTML but make the above regexp fail:

irb(main):025:0> string=<<EOS
irb(main):026:0" <html><body><div class="first section"><ul><li

something</li><li>something else</li></ul></div></body></html>

irb(main):027:0" EOS
=> "<html><body><div class=\"first section\"><ul><li

something</li><li>something else</li></ul></div></body></html>\n"

irb(main):029:0> string.scan(%r{<div.*first
section.*</div>}m).to_s.scan(/<li>/).size
=> 1

irb(main):031:0> require 'nokogiri'
=> true
irb(main):032:0> doc = Nokogiri::HTML(string)
=> #<Nokogiri::HTML::Document:0x..fdb940e66 name="document"
children=[#<Nokogiri::XML::DTD:0x..fdb940ce0 name="html">,
#<Nokogiri::XML::Element:0x..fdb940cb8 name="html"
children=[#<Nokogiri::XML::Element:0x..fdb94043e name="body"
children=[#<Nokogiri::XML::Element:0x..fdb940268 name="div"
attributes=[#<Nokogiri::XML::Attr:0x..fdb9401c8 name="class"
value="first section">]
children=[#<Nokogiri::XML::Element:0x..fdb93fe8a name="ul"
children=[#<Nokogiri::XML::Element:0x..fdb93fcbe name="li"
children=[#<Nokogiri::XML::Text:0x..fdb93fb2e "something">]>,
#<Nokogiri::XML::Element:0x..fdb93fa70 name="li"
children=[#<Nokogiri::XML::Text:0x..fdb93f91c "something
else">]>]>]>]>]>]>
irb(main):033:0> doc.search("/html/body/div/ul/li").size
=> 2

In general: parsing HTML with regexp can get messy. Best leave the
work to a proper library that handles all the strange nuances.

Jesus.

···

On Fri, Oct 8, 2010 at 11:10 PM, Paul <tester.paul@gmail.com> wrote:

On Oct 5, 9:33 pm, Steel Steel <angel_st...@ymail.com> wrote:

I would try REXML, then. It's an XML parser in the standard library.
http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

I'd be reserve regex parsing of xml only for very informal situations where
I just a quick solution non rigorous solution (ie a one-time solution that I
plan to verify personally), I am pretty sure that it is not possible to
correctly parse xml with regex.

···

On Fri, Oct 8, 2010 at 4:10 PM, Paul <tester.paul@gmail.com> wrote:

On Oct 5, 9:33 pm, Steel Steel <angel_st...@ymail.com> wrote:
> > I can find the section I want with a regex but I don't know how to
> > iterate through the string looking for particular elements. I was
> > thinking about taking the section I'm interested in and saving it as
> > an array and then iterating through each array element (html line)
> > that way, but I thought there might be a quicker way to do it.
>
> $html.scan(%r{<div.*first section.*</div>}m).to_s.scan(/<li>/).size

Thanks Steel. This worked fine. I just needed to make it a lazy
search with .*?

I've got nothing against Nokogiri or the other solutions but I was
hoping for a solution like this that just uses the core libraries for
portability.

Cheers! Paul.

For more details :

3. Get the Html tags

Ex.

require ‘jazzez’
output= Jazzez.new
output.tagdetails(“google.com\”)

Output:

1<html tag(s)
1</html> tag(s)
1<head tag(s)
1</head> tag(s)
1<body tag(s)
1</body> tag(s)
2<table tag(s)
2</table> tag(s)
3<tr tag(s)
3</tr> tag(s)
9<td tag(s)
9</td> tag(s)
0<th tag(s)
0</th> tag(s)
0<l tag(s)
0</l> tag(s)
0<link tag(s)
1<p tag(s)
1</p> tag(s)
4<div tag(s)
4</div> tag(s)
0<span tag(s)
0</span> tag(s)
4<script tag(s)
4</script> tag(s)
0<ul tag(s)
0</ul> tag(s)
0<ol tag(s)
0</ol> tag(s)
16<a tag(s)
15</a> tag(s)
0<h1 tag(s)
0</h1> tag(s)
0<h2 tag(s)
0</h2> tag(s)
0<h3 tag(s)
0</h3> tag(s)
0<h4 tag(s)
0</h4> tag(s)
0<h5 tag(s)
0</h5> tag(s)
0<h6 tag(s)
0</h6> tag(s)
4<font tag(s)
4</font> tag(s)
0<select tag(s)
0</select> tag(s)
0<option tag(s)
0</option> tag(s)

Thanks
Raveendran

···

--
Posted via http://www.ruby-forum.com/.