How to use ReXML "in the wild"?

Kenneth_McDonald · 16 December 2008 01:23

I'd very much like to use ReXML's XPATH features to extract info from Google's financial info pages, but find that Rexml chokes on the Javascript, here's the result of trying to read in a page with this bit of code:

require "rexml/document"
require 'net/http'
Net::HTTP.start('finance.google.com') do |http|
response = http.get('/finance?fstype=ii&q=NYSE:WAT')
rdoc = REXML::Document.new(response.body)
end

···

==========
Output:

/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse': #<RuntimeError: Illegal character '&' in raw string " (REXML::ParseException)
(function(){
var d=navigator.userAgent.toLowerCase().indexOf("msie")!=-1;function e(){var b=document.styleSheets;for(var a=b.length-1;a>=0;--a){var c=b[a].href;if(c)if(c.indexOf("styles/finance_")!=-1||c.indexOf("styles_")!=-1)return b[a]}return null}function f(){var b=e();if(b){var a=b.rules;return a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}
function g(){if(document.scripts)for(var b=0;b">
/usr/local/lib/ruby/1.8/rexml/text.rb:91:in `initialize'
.

Is there a good way to get around this problem? If, not, I guess it's back to regular expressions...

Thanks,
Ken

Peter_Szinek3 · 16 December 2008 01:40

Hi Kenneth,

I'd very much like to use ReXML's XPATH features to extract info from
Google's financial info pages, but find that Rexml chokes on the
Javascript, here's the result of trying to read in a page with this
bit of code:

Don't try that REXML in the wild == epic FAIL. At this level, you might
want to try Hpricot or Nokogiri. At a bit higher level, scRUBYt!
You can read about web scraping in Ruby here (my most succesfull article
ever, was even mentioned in Learning Ruby from O'Reilly):

http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/

Is there a good way to get around this problem? If, not, I guess it's
back to regular expressions...

Web scraping with regular expressions is almost never a good idea.

Try scRUBYt!:

require 'rubygems'
require 'scrubyt'

data = Scrubyt::Extractor.define do
fetch 'http://finance.google.com/finance?fstype=ii&q=NYSE:WAT'

  body '/html/body' do
    revenue '/div[4]/div[2]/table/tr[2]' do
      ending_9_27 '/td[2]'
      ending_6_28 '/td[3]'
    end

    gross_profit '/div[4]/div[2]/table/tr[2]' do
      ending_9_27 '/td[2]'
    end
  end
end

puts data.to_xml

output:

HTH,
Peter

···

___
http://scrubyt.org
http://www.rubyrailways.com

Phlip1 · 16 December 2008 03:42

Kenneth McDonald wrote:

I'd very much like to use ReXML's XPATH features to extract info from Google's financial info pages, but find that Rexml chokes on the Javascript, here's the result of trying to read in a page with this bit of code:

I have studied REXML for many years, and I still can't figure out how to get it to recognize an — or similar advanced entity.

Like the other responder said, give up while you still can. libxml-ruby is also stable enough to give a shot - oh yeah, except it crashes on non-tiny inputs.

Aaaand...

/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse': #<RuntimeError: Illegal character '&' in raw string

That's because REXML and your web browser disagree on the definition of well-formed. Your browser accepts a naked & inside a JavaScript tag, but REXML does not. REXML is technically correct, and your browser would have accepted && here, but...

a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}

...browsers cannot correctly interpolate & appearing inside JavaScript literal strings, because some lowlife coder using Notepad might have actually wanted "&" when they wrote "&" - such as with document.write().

So, because REXML cannot accept normal HTML, due to hits and misses of standards compliance on all sides - you are better off with a dedicated parser!

···

--
Phlip

Topic		Replies	Views
XPath and HTML ruby-talk	8	87	13 October 2003
HTML dom ruby-talk	8	123	25 June 2009
HTML Parser suggestions wanted ruby-talk	12	156	4 June 2002
Parsing xml ruby-talk	23	162	26 March 2009
Help needed with rexml ruby-talk	14	91	31 August 2005

How to use ReXML "in the wild"?

Related topics