Hello,
On several sites(probably malformed HTML/JavaScript/XML/general parsing
hell) I have the following problem.
For ex:
moonwolf@trantor:~/ruby$ irb
irb(main):001:0> ['rubygems','nokogiri','hpricot','open-uri'].each { |r|
require r }
=> ["rubygems", "nokogiri", "hpricot", "open-uri"]
irb(main):002:0> doc=Nokogiri(open("http://maps.google.com/"))
=> <?xml version="1.0"?>
<!DOCTYPE html>
<html/>
irb(main):003:0> doc/"a"
=>
Same with Nokogiri.Hpricot:
irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.google.com/"))
=> <?xml version="1.0"?>
<!DOCTYPE html>
<html/>
However with regular Hpricot:
irb(main):009:0> (Hpricot(open("http://maps.google.com/"))/"a").size
=> 53
(the full post of course is too long, so just showed something simpler)
Hpricot by itself of course works. I tried looking and there's not much by
way of documentation or blogs on something like this.
Any suggestions/explanations will be welcome as I like Nokogiri's speed very
much.
I am using:
moonwolf@trantor:~/ruby$ gem list --local | grep -i nokogiri
nokogiri (1.2.3)
moonwolf@trantor:~/ruby$ ruby --version
ruby 1.8.6 (2008-03-03 patchlevel 114) [i686-linux]
Jayanth
Hello,
On several sites(probably malformed HTML/JavaScript/XML/general parsing
hell) I have the following problem.
For ex:
moonwolf@trantor:~/ruby$ irb
irb(main):001:0> ['rubygems','nokogiri','hpricot','open-uri'].each { |r|
require r }
=> ["rubygems", "nokogiri", "hpricot", "open-uri"]
irb(main):002:0> doc=Nokogiri(open("http://maps.google.com/"\))
=> <?xml version="1.0"?>
<!DOCTYPE html>
<html/>
irb(main):003:0> doc/"a"
=>
Same with Nokogiri.Hpricot:
irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.google.com/"\))
=> <?xml version="1.0"?>
<!DOCTYPE html>
<html/>
However with regular Hpricot:
irb(main):009:0> (Hpricot(open("http://maps.google.com/"\))/"a").size
=> 53
(the full post of course is too long, so just showed something simpler)
Hpricot by itself of course works. I tried looking and there's not much by
way of documentation or blogs on something like this.
Any suggestions/explanations will be welcome as I like Nokogiri's speed very
much.
Nokogiri detects the XML header and parses it as XML. If you force it
to use the HTML parser, you may be more successfull:
>> (Nokogiri::HTML(open("http://maps.google.com/"\))/'a').length
=> 53
···
On Thu, May 07, 2009 at 03:45:28PM +0900, Srijayanth Sridhar wrote:
>>
--
Aaron Patterson
http://tenderlovemaking.com/
Whoops,
irb(main):015:0> (Nokogiri::HTML(open("http://maps.google.com/
"))/'a').length
=> 0
Not sure what the deal is.
Jayanth
···
On Thu, May 7, 2009 at 12:35 PM, Srijayanth Sridhar <srijayanth@gmail.com>wrote:
Thanks Aaron.
Jayanth
On Thu, May 7, 2009 at 12:32 PM, Aaron Patterson < > aaron@tenderlovemaking.com> wrote:
On Thu, May 07, 2009 at 03:45:28PM +0900, Srijayanth Sridhar wrote:
> Hello,
>
> On several sites(probably malformed HTML/JavaScript/XML/general parsing
> hell) I have the following problem.
>
> For ex:
>
> moonwolf@trantor:~/ruby$ irb
> irb(main):001:0> ['rubygems','nokogiri','hpricot','open-uri'].each { |r|
> require r }
> => ["rubygems", "nokogiri", "hpricot", "open-uri"]
> irb(main):002:0> doc=Nokogiri(open("http://maps.google.com/"\))
> => <?xml version="1.0"?>
> <!DOCTYPE html>
> <html/>
>
> irb(main):003:0> doc/"a"
> =>
>
> Same with Nokogiri.Hpricot:
>
> irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.google.com/"\))
> => <?xml version="1.0"?>
> <!DOCTYPE html>
> <html/>
>
> However with regular Hpricot:
>
> irb(main):009:0> (Hpricot(open("http://maps.google.com/"\))/"a").size
> => 53
> (the full post of course is too long, so just showed something simpler)
>
>
> Hpricot by itself of course works. I tried looking and there's not much
by
> way of documentation or blogs on something like this.
>
> Any suggestions/explanations will be welcome as I like Nokogiri's speed
very
> much.
Nokogiri detects the XML header and parses it as XML. If you force it
to use the HTML parser, you may be more successfull:
>> (Nokogiri::HTML(open("http://maps.google.com/"\))/'a').length
=> 53
>>
--
Aaron Patterson
http://tenderlovemaking.com/