Easy way for a Nuub to get link-element from a html-source

Marcus_Strube · 26 November 2007 10:23

hi all.

im very new to ruby and im not sure how to do this the easiest way in
ruby. i want to read the content from e.g. "www.spiegel.de" and just
this line

and from this line the "title" and the "href"

since the order in "link" is not sure, it doesnt look like regexp is the
first choice. and i couldn't find a HTML::Parse.

···

--
Posted via http://www.ruby-forum.com/.

Lee_Jarvis2 · 26 November 2007 10:56

Marcus Strube wrote:

since the order in "link" is not sure, it doesnt look like regexp is the
first choice. and i couldn't find a HTML::Parse.

Check out hpricot.

http://code.whytheluckystiff.net/hpricot/

Regards,
Lee

···

--
Posted via http://www.ruby-forum.com/\.

Kai_Brust · 26 November 2007 11:18

How about hpricot?

http://code.whytheluckystiff.net/hpricot/

- Kai Brust

···

On 26.11.2007, at 11:23, Marcus Strube wrote:

hi all.

im very new to ruby and im not sure how to do this the easiest way in
ruby. i want to read the content from e.g. "www.spiegel.de" and just
this line

<link rel="alternate" type="application/rss+xml" title="SPIEGEL ONLINE
als RSS-Feed" href="http://www.spiegel.de/schlagzeilen/rss/index.xml" />

and from this line the "title" and the "href"

since the order in "link" is not sure, it doesnt look like regexp is the
first choice. and i couldn't find a HTML::Parse.

Peter_Szinek3 · 26 November 2007 11:48

Marcus Strube wrote:

hi all.

im very new to ruby and im not sure how to do this the easiest way in
ruby. i want to read the content from e.g. "www.spiegel.de" and just
this line

<link rel="alternate" type="application/rss+xml" title="SPIEGEL ONLINE
als RSS-Feed" href="http://www.spiegel.de/schlagzeilen/rss/index.xml" />

and from this line the "title" and the "href"

since the order in "link" is not sure, it doesnt look like regexp is the
first choice. and i couldn't find a HTML::Parse.

Another possibility is scRUBYt!:

···

==========================================
require 'rubygems'
require 'scrubyt'

feed_data = Scrubyt::Extractor.define do
fetch 'http://www.spiegel.de/'

   link "//link[@rel='alternate']" do
     title "title", :type => :attribute
     href "href", :type => :attribute
   end
end

puts feed_data.to_xml

output:

==========================================
<root>
   <link>
     <title>SPIEGEL ONLINE als RSS-Feed</title>
     <href>http://www.spiegel.de/schlagzeilen/rss/index.xml</href>
   </link>
</root>

or, to_hash:

==========================================
[{:title=>"SPIEGEL ONLINE als RSS-Feed", :href=>"http://www.spiegel.de/schlagzeilen/rss/index.xml"\}]

Cheers,
Peter
___
http://www.rubyrailways.com
http://scrubyt.org

Marcus_Strube · 26 November 2007 11:38

How about hpricot?

http://code.whytheluckystiff.net/hpricot/

ok, hpricot then.

is it just

gem install hpricot ??

or do i need to install this "ragel"-thing too?? (and if so which which
is the best way to do so??)

···

--
Posted via http://www.ruby-forum.com/\.

Marcus_Strube · 26 November 2007 13:23

Another possibility is scRUBYt!:

That looks good. That looks good. Thank you!

···

--
Posted via http://www.ruby-forum.com/\.

Peter_Szinek3 · 26 November 2007 13:57

Marcus Strube wrote:

Another possibility is scRUBYt!:

That looks good. That looks good. Thank you!

Hm yeah, but the downside (as of the recent version - it'll be fixed in the next one) is that the installation process is somewhat... hmm... not that easy (mainly if you are on win32). If you still decide to go for scRUBYt!, we can talk on #scrubyt @ irc.freenode.net or you can ask your questions in the forum (http://agora.scrubyt.org).

Cheers,
Peter

···

___
http://www.rubyrailways.com
http://scrubyt.org

Topic		Replies	Views
Is there link extractor or similar html processing libs for Ruby ruby-talk	16	141	10 March 2006
Scan HTML ruby-talk	15	81	3 March 2008
HTML parser using Hpricot ruby-talk	0	83	8 January 2010
Extracing the URL from hpricot element ruby-talk	1	136	10 December 2008
How to extract links of a particular class type ruby-talk	10	124	5 February 2009

Easy way for a Nuub to get link-element from a html-source

puts feed_data.to_xml

========================================== <root> <link> <title>SPIEGEL ONLINE als RSS-Feed</title> <href>http://www.spiegel.de/schlagzeilen/rss/index.xml&lt;/href&gt; </link> </root>

========================================== [{:title=>"SPIEGEL ONLINE als RSS-Feed", :href=>"http://www.spiegel.de/schlagzeilen/rss/index.xml&quot;\}]

Related topics

==========================================
<root>
<link>
<title>SPIEGEL ONLINE als RSS-Feed</title>
<href>http://www.spiegel.de/schlagzeilen/rss/index.xml</href>
</link>
</root>

==========================================
[{:title=>"SPIEGEL ONLINE als RSS-Feed", :href=>"http://www.spiegel.de/schlagzeilen/rss/index.xml"\}]