Ruby Sanity Check

HI

Just a quick check. I am starting into Ruby with little interest in
Ruby on Rails (its good to know its there though). I want to use Ruby
as a more data centric means. So in my first case I want to extract
data from websites and put it in a suitable format xml, csv etc for
use in excel or later for use in a db application such as mysql or
postgre.

I am not nuts am I?

I was looking at several comparison and I thought that Ruby dealt with
lists and data in a logical fashion.

Ruby provides wonderful tools to do all of this. So no, you're not
completely nuts :slight_smile:

You could use Nokogiri (http://nokogiri.org/\) to extract data from
websites. Ruby (1.9) also has a very nice CSV library that's fast and
easy to use. It's also pretty trivial to just go straight into the DB
with Ruby DBI or some of the ORMs like DataMapper and ActiveRecord.

Cheers,
Jason

···

On Mon, Oct 11, 2010 at 7:25 PM, flebber <flebber.crue@gmail.com> wrote:

HI

Just a quick check. I am starting into Ruby with little interest in
Ruby on Rails (its good to know its there though). I want to use Ruby
as a more data centric means. So in my first case I want to extract
data from websites and put it in a suitable format xml, csv etc for
use in excel or later for use in a db application such as mysql or
postgre.

I am not nuts am I?

I was looking at several comparison and I thought that Ruby dealt with
lists and data in a logical fashion.

Nope, you're quite sane.

Sorry, I should have been a little more helpful:

http://mechanize.rubyforge.org/mechanize/

http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html

http://ar.rubyonrails.org/

By "extract data from websites" I assume you mean screen scraping. Here are
two Railscasts about it

Some Nokogiri tutorials about it
http://nokogiri.org/tutorials

Some Mechanize tutorials about it (you will only need to use Mechanize if
you need to interact with the site, it uses Nokogiri under the covers. Note
that it can't handle Javascript, and there are some alternatives if you need
that)
http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html

Depending on what site you're trying to get info from, it might have an API,
and there might even be a gem for interacting with that API, and saving
yourself the headache and brittleness of screen scraping.

For outputing to XML, Nokogiri can do that. If you have difficulty getting
it installed, I've also enjoyed using Hpricot (aside from api, the biggest
difference is that Nokogiri is built on libxml2, an open source very popular
C library, while Hpricot is built on a Ragel parser), and if you have
difficulty with that as well, the standard library provides one called
REXML.

Also consider YAML, which is built into the stdlib, (but has difficulty, I
found, dealing with huge data sets).

There are a couple of gems for JSON, I can't remember which one I've used.

For CSV, the fastercsv gem.

I am almost certain there are tools for interacting with Excel, but I'm on a
Mac, so not able to really help there.

Depending on what you're doing, you may not need the intermediate form to be
human readable (maybe you just need to perpetuate an array of strings
between runnings of your script, or something like that). If that is the
case, you can just marshall the data.
module Marshal - RDoc Documentation Probably the easiest solution,
and really fast, but it means your data is Ruby.

For dealing with databases, ActiveRecord, DataMapper, and Sequel should be
able to help you out.

ActiveRecord is extremely mature as it's the de facto Rails M in its MVC,
but it requires a little bit of infrastructure to get going outside of
Rails. If you want to use it, http://guides.rubyonrails.org/ is, IMO, the
best resource. There are also lots of Railscasts that deal with it (note
that AR3 just released, so the interface is a little different).

DataMapper is another nice project, I like it because you can do it all in
one file without migrations (easy to get up and going) you literally define
your schema in your code. It has some other nice features such as
guaranteeing that there will only ever be one instance of your DB rows in
memory at a time (you can find yourself in some wonky situations with AR,
where it has cached results, or you load the same data twice, and the one is
unaware of the other). It also has a cool solution to the n+1 problem, where
it will preload data as soon as it recognizes you're going to query for it
in a loop. Unfortunately, it's nowhere near as mature as ActiveRecord. I
finally ended up switching my last project off of DataMapper and onto
ActiveRecord after too many headaches dealing with polymorphism, immature
libraries for it (I needed tagging), and dissatisfaction with the IRC
channel. If you don't need external libraries like that, you probably won't
experience such frustrations. If you're interested in it, it has some good
tutorials on its site http://datamapper.org/docs/ I also really liked the $9
Peepcode about Sinatra, which uses DataMapper to talk to its database.
Online Courses, Learning Paths, and Certifications - Pluralsight

I've not used Sequel, but I've seen its creator present at Ruby Midwest. He
_really_ knows his stuff. I've also only heard good things about the
project, such as actively developed, and easy to get support for. But my
understanding is that it's main strength is in connecting to "non
opinionated" (what AR would call "legacy") databases. If you have the
ability to design yours from the beginning, some of it's strengths might be
necessary.

···

On Mon, Oct 11, 2010 at 6:25 PM, flebber <flebber.crue@gmail.com> wrote:

HI

Just a quick check. I am starting into Ruby with little interest in
Ruby on Rails (its good to know its there though). I want to use Ruby
as a more data centric means. So in my first case I want to extract
data from websites and put it in a suitable format xml, csv etc for
use in excel or later for use in a db application such as mysql or
postgre.

I am not nuts am I?

I was looking at several comparison and I thought that Ruby dealt with
lists and data in a logical fashion.

Just a quick check. I am starting into Ruby with little interest in
Ruby on Rails (its good to know its there though). I want to use Ruby
as a more data centric means. So in my first case I want to extract
data from websites and put it in a suitable format xml, csv etc for
use in excel or later for use in a db application such as mysql or
postgre.

If you need a headless browser to scrape websites which do a lot of ajax

stuff, use celerity http://celerity.rubyforge.org/

flebber wrote in post #949224:

HI

Just a quick check. I am starting into Ruby with little interest in
Ruby on Rails (its good to know its there though). I want to use Ruby
as a more data centric means. So in my first case I want to extract
data from websites and put it in a suitable format xml, csv etc for
use in excel or later for use in a db application such as mysql or
postgre.

I am not nuts am I?

I was looking at several comparison and I thought that Ruby dealt with
lists and data in a logical fashion.

For excel I would reccomend win32ole it is available natively on windows
or you can install on linux through wine. I have no clue about mac.
http://www.perlmonks.org/?node_id=430194

Roo also looks pretty cool but haven't done anything with it yet since
it can't write to cells in excel.
http://roo.rubyforge.org

but if you are putting to a database anyways it can read the data and
may be what you are looking for.

···

--
Posted via http://www.ruby-forum.com/\.

Thanks for the links I was hoping I was not nuts :-;, I didn't think
so.

I was looking at nokogiri, would it be more feature complete than
hpricot?

···

On Oct 12, 11:11 am, Steve Klabnik <st...@steveklabnik.com> wrote:

Sorry, I should have been a little more helpful:

http://mechanize.rubyforge.org/mechanize/

http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-12\.\.\.

http://ar.rubyonrails.org/

Wow thank you!!! Totally beyond expectation. Lets me spend more time
learning than searching, very much appreciated. I will update when I
understand better the tools and what flow I am going to use.

···

On Oct 12, 4:41 pm, Josh Cheek <josh.ch...@gmail.com> wrote:

[Note: parts of this message were removed to make it a legal post.]

On Mon, Oct 11, 2010 at 6:25 PM, flebber <flebber.c...@gmail.com> wrote:
> HI

> Just a quick check. I am starting into Ruby with little interest in
> Ruby on Rails (its good to know its there though). I want to use Ruby
> as a more data centric means. So in my first case I want to extract
> data from websites and put it in a suitable format xml, csv etc for
> use in excel or later for use in a db application such as mysql or
> postgre.

> I am not nuts am I?

> I was looking at several comparison and I thought that Ruby dealt with
> lists and data in a logical fashion.

By "extract data from websites" I assume you mean screen scraping. Here are
two Railscasts about ithttp://railscasts.com/episodes/173-screen-scraping-with-scrapihttp://railscasts.com/episodes/190-screen-scraping-with-nokogiri

Some Nokogiri tutorials about ithttp://nokogiri.org/tutorials

Some Mechanize tutorials about it (you will only need to use Mechanize if
you need to interact with the site, it uses Nokogiri under the covers. Note
that it can't handle Javascript, and there are some alternatives if you need
that)http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html

Depending on what site you're trying to get info from, it might have an API,
and there might even be a gem for interacting with that API, and saving
yourself the headache and brittleness of screen scraping.

For outputing to XML, Nokogiri can do that. If you have difficulty getting
it installed, I've also enjoyed using Hpricot (aside from api, the biggest
difference is that Nokogiri is built on libxml2, an open source very popular
C library, while Hpricot is built on a Ragel parser), and if you have
difficulty with that as well, the standard library provides one called
REXML.

Also consider YAML, which is built into the stdlib, (but has difficulty, I
found, dealing with huge data sets).

There are a couple of gems for JSON, I can't remember which one I've used.

For CSV, the fastercsv gem.

I am almost certain there are tools for interacting with Excel, but I'm on a
Mac, so not able to really help there.

Depending on what you're doing, you may not need the intermediate form to be
human readable (maybe you just need to perpetuate an array of strings
between runnings of your script, or something like that). If that is the
case, you can just marshall the data.http://ruby-doc.org/core/classes/Marshal.htmlProbably the easiest solution,
and really fast, but it means your data is Ruby.

For dealing with databases, ActiveRecord, DataMapper, and Sequel should be
able to help you out.

ActiveRecord is extremely mature as it's the de facto Rails M in its MVC,
but it requires a little bit of infrastructure to get going outside of
Rails. If you want to use it,http://guides.rubyonrails.org/is, IMO, the
best resource. There are also lots of Railscasts that deal with it (note
that AR3 just released, so the interface is a little different).

DataMapper is another nice project, I like it because you can do it all in
one file without migrations (easy to get up and going) you literally define
your schema in your code. It has some other nice features such as
guaranteeing that there will only ever be one instance of your DB rows in
memory at a time (you can find yourself in some wonky situations with AR,
where it has cached results, or you load the same data twice, and the one is
unaware of the other). It also has a cool solution to the n+1 problem, where
it will preload data as soon as it recognizes you're going to query for it
in a loop. Unfortunately, it's nowhere near as mature as ActiveRecord. I
finally ended up switching my last project off of DataMapper and onto
ActiveRecord after too many headaches dealing with polymorphism, immature
libraries for it (I needed tagging), and dissatisfaction with the IRC
channel. If you don't need external libraries like that, you probably won't
experience such frustrations. If you're interested in it, it has some good
tutorials on its sitehttp://datamapper.org/docs/I also really liked the $9
Peepcode about Sinatra, which uses DataMapper to talk to its database.Online Courses, Learning Paths, and Certifications - Pluralsight

I've not used Sequel, but I've seen its creator present at Ruby Midwest. He
_really_ knows his stuff. I've also only heard good things about the
project, such as actively developed, and easy to get support for. But my
understanding is that it's main strength is in connecting to "non
opinionated" (what AR would call "legacy") databases. If you have the
ability to design yours from the beginning, some of it's strengths might be
necessary.- Hide quoted text -

- Show quoted text -

Possibly. It would also likely be faster and more stable, and I would guess
it's more actively maintained.

···

On Monday, October 11, 2010 08:25:33 pm flebber wrote:

On Oct 12, 11:11 am, Steve Klabnik <st...@steveklabnik.com> wrote:
> Sorry, I should have been a little more helpful:
>
> http://mechanize.rubyforge.org/mechanize/
>
> http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-12\.\.\.
>
> http://ar.rubyonrails.org/

Thanks for the links I was hoping I was not nuts :-;, I didn't think
so.

I was looking at nokogiri, would it be more feature complete than
hpricot?

I was looking at nokogiri, would it be more feature complete than
hpricot?

This question is a bit more complicated than you'd imagine... there's
a lot of history there.

Either is fine. But you should probably go with nokogiri these days.