Hey all,
I am pleased to announce that the long-awaited new release of scRUBYt!, 0.3.4 is available for download. A lot of bugs have been fixed and some cool features scrubbed in, so be sure to check it out!
···
==========
scrubWHAT?
scRUBYt! is a very easy to learn and use, yet powerful Web scraping framework based on Hpricot and mechanize (and from the next version, on FireWatir!). It's purpose is to free you from the drudgery of web page crawling, looking up HTML tags, attributes, XPaths, form names and other typical low-level web scraping woes by figuring these out from your examples copy'n'pasted from the Web page.
=========
Changelog
- [NEW] Script pattern; possibility to evaluate custom function on the input of the pattern
- [NEW] Constant pattern; Can add constant patterns with the syntax: pattern 'Hello world', :type => :constant
- [NEW] Text pattern; structure agnostic scraping based on labels and other textual clues
- [NEW] new output method: to_flat_xml for creating feed-like flat XMLs instead of hierarchical ones
- [NEW] to_flat_xml with spec delimiters splits up the concatenated hash results
- [MOD] Change in the semantics of the "div[stuff]" style examples
* divs which contain "stuff" (rather than their whole text is "stuff") are matched
* generalization is false by default
- [NEW] Possibility to define arbitrary delimiter for to_hash (used when the result
contains commas)
- [NEW/MOD] Changes in the logging module: (Credit: Tim Fletcher)
* Extract the logging into a class to allow for filtering
* Allow the logger to be set to nil (to disable logging), and have this as the default.
Logging now has to be explicitly enabled, as follows:
Scrubyt.logger = Scrubyt::Logger.new
* Allow loggers to point to streams other than STDERR.
- [NEW/MOD] Changes in the download pattern:
* possibility to specify an array of files that should be ignored during the downloading
(e.g. 'nopicture.gif')
* Handling timeout during downloads instead of crashing
* Fixed downloading in case the filename contains no '.'
* Fixed downloading for more URL types that were not working before
- [NEW] New option: example_type. Possibility to force example type
(instead of leaving it to scRUBYt! to guess)
- [NEW] Entirely new test suite using rcov; Tests are added continously;
The goal is to achieve full coverage
- [FIX] Fixed the infamous regexp bug which caused the pricegrabber
scenario (among other things) to fail
- [FIX] Do not evaluate the detail pattern twice
- [FIX] Fixed dependencies (namely parse_tree_reloaded) and correct versions
=========
Read more
Some additional explanation about the new release can be found here: http://scrubyt.org/a-hot-new-release-034-is-out-whats-new
============
In the works
Paul Nikitochkin created jscRUBYt!, which should solve the win32 problems by using the J-versions of the dependencies. I have been very swamped recently, so didn't have too much time to look into his code, but I am sure this will be very helpful to a lot of you so it's on the short term TODO list.
Glenn Gillen has almost finished firescRUBYt! - scRUBYt! on FireWatir, which is using FireWatir as the agent (rather than mechanize) to navigate and extract data from the web page. I think this is the coolest addition in scRUBYt!'s history ever, since it enables scraping of pages containing AJAX/Javascript and/or different tricks which were not possible to work around with mechanize, and parsing pages with ease which caused Hpricot to choke and gag...
=========================
Would like to contribute?
* If you are a coder and would like to be the part of the development team, contact us at scrubyt['maps-on'.reverse]@scrubyt.org
* If you'd like to contribute to the documentation/how-tos/tutorials, check out the wiki at http://wiki.scrubyt.org.
* If you found a bug, have suggestions or feature requests, please use scRUBYt!'s lighthouse tracker at http://scrubyt.lighthouseapp.com
* If you'd like to discuss or propose features, get some help or would like to check out and learn from the problems of others, visit the forum at http://agora.scrubyt.org
* If neither of the above, but you still would like to tell us something, bring us champaigne/chocolates, poke Glenn to finish FireWatir faster or whatever else, contact us at scrubyt['maps-on'.reverse]@scrubyt.org
H4ppy scrubbing,
Peter
__
http://www.rubyrailways.com
http://scrubyt.org