[ANN] scRUBYt! 0.2.8

Peter_Szinek · 19 April 2007 19:50

This is long overdue (0.2.8 is out for about a week already), but
anyway, here we go:

···

============
What’s this?

scRUBYt! is a very easy to learn and use, yet powerful Web scraping
framework based on Hpricot and mechanize. It’s purpose is to free you
from the drudgery of web page crawling, looking up HTML tags,
attributes, XPaths, form names and other typical low-level web scraping
woes by figuring these out from your examples copy’n’pasted from the Web
page.

=========
CHANGELOG

[NEW] download pattern: download the file pointed to by the
parent pattern
[NEW] checking checkboxes
[NEW] basic authentication support
[NEW] default values for missing elements (basic version)
[NEW] possibility to resolve relative paths against a custom url
[NEW] first simple version of to_csv and to_hash
[NEW] complete rewrite of the exporting system (Credit: Neelance)
[NEW] first version of smart regular expressions: they are constructed
from examples, just as regular expressions (Credit: Neelance)
[NEW] Possibility to click the n-th link
[FIX] Clicking on links using scRUBYt’s advanced example lookup (i.e.
you can use :begins_with etc.)
[NEW] Forcing writing text of non-leaf nodes with :write_text => true
[NEW] Possibility to set custom user-agent; Specified default user agent
as Konqueror
[FIX] Fixed crawling to detail pages in case of leaving the
original site (Credit: Michael Mazour)
[FIX] fixing the ‘//’ problem - if the relative url contained two
slashes, the fetching failed
[FIX] scrubyt assumed that documents have a list of nested elements
(Credit: Rick Bradley)
[FIX] crawling to detail pages works also if the parent pattern is
a string pattern
[FIX] shorcut url fixed again
[FIX] regexp pattern fixed in case it’s parent was a string
[FIX] refactoring the core classes, lots of bugfixes and stabilization

=============
Misc comments

As of 0.2.8, scRUBYt! depends on ParseTree and Ruby2Ruby - unfortunately
it seems ParseTree is not that trivial to set up on windows. However, we
are currently working on a new project to solve this problem, and we are
making quite good progress so I believe for the next release, 0.3.0,
this obstacle will be blown away. Until then windows users should either
install scRUBYt! on cygwin, install ParseTree somehow or use 0.2.6 until
we are ready with the Ruby bridge to ParseTree which will make the
installation on windows possible without the need to compile C.

Please continue to report problems, discuss things or give any kind of
feedback on the scRUBYt! forum at

http://agora.scrubyt.org

Cheers,
Peter - on the behalf of the scRUBYt! devel team

__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

–~--~---------~–~----~------------~-------~–~----~
You received this message because you are subscribed to the Google Groups “Ruby on Rails: Talk” group.
To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~–~—

Pit · 19 April 2007 21:15

Peter Szinek schrieb:

As of 0.2.8, scRUBYt! depends on ParseTree and Ruby2Ruby - unfortunately (...)

Peter, I'm curious: could you tell me (only in a few words) how you are using ParseTree and Ruby2Ruby?

Regards,
Pit

Peter_Szinek3 · 19 April 2007 21:30

Pit Capitain wrote:

Peter Szinek schrieb:

As of 0.2.8, scRUBYt! depends on ParseTree and Ruby2Ruby - unfortunately (...)

Peter, I'm curious: could you tell me (only in a few words) how you are using ParseTree and Ruby2Ruby?

Well, I guess it's the best to illustrate it with an example:

This is a learning extractor:

···

=========================================================================
google_data = Scrubyt::Extractor.define do

   fetch 'http://www.google.com/ncr'
   fill_textfield 'q', 'ruby'
   submit

   link "Ruby Programming Language" do
     url "href", :type => :attribute
   end
   next_page "Next", :limit => 3
end

i.e. it works for just the first page of google, for the query 'ruby'. To create a generalized extractor which can be used on any google result page, we have to export it after it 'learned' how to do this on the given example. Since I would like this so called production extractor to resemble the original as much as possible, I am using Ruby2Ruby and ParseTree. With them I can get this result:

=========================================================================
google_data = Scrubyt::Extractor.define do
   fetch("http://www.google.com/ncr"\)
   fill_textfield("q", "ruby")
   submit

   link("/html/body/div/div/a") do
     url("href", { :type => :attribute })
   end

next_page("Next", { :limit => 3 })
end

If you can tell me any other way to achieve this (originally I took the source code of the learning extractors and replaced the examples with XPaths and did any other modifications required - but it was a mess after some time) I would be really thankful.

There is a disagreement in the development team about this, too - one viewpoint is that a dependency on parsetree and ruby2ruby costs us too much (mainly on windows), and the other is that this problem has to be solved then we can depend on these packages....

What would you suggest?

Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

Pit · 19 April 2007 22:14

Peter Szinek schrieb:

Pit Capitain wrote:

Peter, I'm curious: could you tell me (only in a few words) how you are using ParseTree and Ruby2Ruby?

Well, I guess it's the best to illustrate it with an example:
(...)

Thank you for the example. Lets see if I understand what you're doing: you have a given Ruby code (the learning extractor) and need to manipulate it according to certain rules to create a modified version of the code (the generalized extractor).

If you can tell me any other way to achieve this (originally I took the source code of the learning extractors and replaced the examples with XPaths and did any other modifications required - but it was a mess after some time) I would be really thankful.

Well, in your example it looks like you use nothing but the Scrubyt DSL. In this case you could capture the calling sequence and the method arguments when you execute the DSL. But if you want to allow normal Ruby code in the #define block (loops, conditionals, etc), then I can't think of other solutions.

There is a disagreement in the development team about this, too - one viewpoint is that a dependency on parsetree and ruby2ruby costs us too much (mainly on windows), and the other is that this problem has to be solved then we can depend on these packages....

What would you suggest?

If I really wanted to support Windows users, I'd try to compile ParseTree with MinGW and/or convince the maintainers to provide a binary version of the gem

But my question wasn't meant as a recommendation not to use ParseTree. I'm simply interested in use cases for working with the Ruby AST.

Regards,
Pit

Peter_Szinek3 · 19 April 2007 22:50

Hallo Pit,

Thank you for the example. Lets see if I understand what you're doing: you have a given Ruby code (the learning extractor) and need to manipulate it according to certain rules to create a modified version of the code (the generalized extractor).

Exactly!

Well, in your example it looks like you use nothing but the Scrubyt DSL. In this case you could capture the calling sequence and the method arguments when you execute the DSL. But if you want to allow normal Ruby code in the #define block (loops, conditionals, etc), then I can't think of other solutions.

Yes, we are just adding this possibility (so you can do branching etc. with native Ruby code - a DSL is fine but it always has its limits, whereas Ruby doesn't - so in the long run we have to do this (or something equally powerful).

If I really wanted to support Windows users,

Yeah, we surely do! A lot of them are using scRUBYt!, too...

I'd try to compile ParseTree with MinGW and/or convince the maintainers to provide a binary version of the gem

Well, I don't really understand why aren't they doing this anyway, since
their windows users have the same trouble as we do (actually we have it just because of ParseTree). This is a viable alternative if others do not work out (mainly the next paragraph).

ATM one team member is working on a solution called ParseTreeReloaded which will wrap pure Ruby code around ParseTree so no C compiling will be needed. ATM ParseTreeReloaded can already parse its own source code so I guess he's making some great progress... Let's see.

But my question wasn't meant as a recommendation not to use ParseTree. I'm simply interested in use cases for working with the Ruby AST.

Yeah sure, I am also in the don't-drop-parsetree camp... however if we can not solve this problem permanently under windows (which I am 99% positive we can) I'll have to look for a different solutions because of the win32 ppl...

Cheers,
Peter

···

__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

Topic		Replies	Views
[ANN] scRUBYt! 0.3.4 ruby-talk	0	108	27 September 2007
[ANN] scRUBYt! 0.2.0 - WWW::Mechanize and Hpricot on steroids ruby-talk	0	96	5 February 2007
[ANN] scRUBYt! 0.2.3 - Hpricot and Mechanize on steroids ruby-talk	3	157	21 February 2007
[ANN] scRUBYt! - Hpricot and WWW::Mechanize on even more steroids, 0.2.6 released ruby-talk	1	99	28 March 2007
[ANN] scRUBYt! 0.4.1 ruby-talk	1	141	6 January 2009

[ANN] scRUBYt! 0.2.8

============ What’s this?

========= CHANGELOG

============= Misc comments

link "Ruby Programming Language" do url "href", :type => :attribute end next_page "Next", :limit => 3 end

next_page("Next", { :limit => 3 }) end

Related topics

============
What’s this?

=========
CHANGELOG

=============
Misc comments

link "Ruby Programming Language" do
url "href", :type => :attribute
end
next_page "Next", :limit => 3
end

next_page("Next", { :limit => 3 })
end