Screen scraping an html text contents into a file

basi · 6 December 2005 17:17

Hello,

I'm looking for a screen scraper that will extract text contents off
html pages and save the text into files. I have looked at Mechanize and
Rubyful_Soup, but they are a bit over my head to modify to save just
the text contents to a file. (I'm a researcher trying to use Ruby for
real world text analysis tasks, and trying to learn Ruby at the same
time.) The levels of usage I'd love to have (choosey beggar):

program prompts me for url address to scrape and file name to save texts into,
or edit program to enter url address and file name

Of course a program that, given a url, would walk down the links, open
the pages, and save the text contents to a file would be ... that would
be a commercial product. Is there one?

Thanks!
basi

Lou_Vanek · 6 December 2005 18:20

you might get away with just using curl:

curl www.apple.com > mytextfile

or wget, which is capable of acting recursively on an entire site.
http://www.delorie.com/gnu/docs/wget/wget_14.html

basi wrote:

···

Hello,

I'm looking for a screen scraper that will extract text contents off
html pages and save the text into files. I have looked at Mechanize and
Rubyful_Soup, but they are a bit over my head to modify to save just
the text contents to a file. (I'm a researcher trying to use Ruby for
real world text analysis tasks, and trying to learn Ruby at the same
time.) The levels of usage I'd love to have (choosey beggar):

program prompts me for url address to scrape and file name to save texts into,
or edit program to enter url address and file name

Of course a program that, given a url, would walk down the links, open
the pages, and save the text contents to a file would be ... that would
be a commercial product. Is there one?

Thanks!
basi

Brian_Schroder1 · 6 December 2005 21:01

try this:

$ w3m -dump www.ruby-lang.org

cheers,

Brian

···

On 06/12/05, basi <basi_lio@hotmail.com> wrote:

Hello,

I'm looking for a screen scraper that will extract text contents off
html pages and save the text into files. I have looked at Mechanize and
Rubyful_Soup, but they are a bit over my head to modify to save just
the text contents to a file. (I'm a researcher trying to use Ruby for
real world text analysis tasks, and trying to learn Ruby at the same
time.) The levels of usage I'd love to have (choosey beggar):

> program prompts me for url address to scrape and file name to save texts into,
> or edit program to enter url address and file name

Of course a program that, given a url, would walk down the links, open
the pages, and save the text contents to a file would be ... that would
be a commercial product. Is there one?

Thanks!
basi

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

Gene_Tani · 6 December 2005 21:02

basi wrote:

Hello,

I'm looking for a screen scraper that will extract text contents off
html pages and save the text into files. I have looked at Mechanize and
Rubyful_Soup, but they are a bit over my head to modify to save just
the text contents to a file. (I'm a researcher trying to use Ruby for
real world text analysis tasks, and trying to learn Ruby at the same
time.) The levels of usage I'd love to have (choosey beggar):

> program prompts me for url address to scrape and file name to save texts into,
> or edit program to enter url address and file name

Of course a program that, given a url, would walk down the links, open
the pages, and save the text contents to a file would be ... that would
be a commercial product. Is there one?

Thanks!
basi

i think open-uri and Rubyful_soup are pretty straightforward. I like
this shows open-uri vs. Net::HTTP:

There's commercial website downloaders that will follow every link in
every page, hit the server hundreds of times in a few seconds and get
your IP blacklisted pretty quickly (so run them from Starbuck's
wireless). Look in Oreilly Spidering Hacks, for the right way to do
it. the (perl) examples are straightforward.

Edward_Faulkner · 6 December 2005 21:19

basi wrote:
> Of course a program that, given a url, would walk down the links, open
> the pages, and save the text contents to a file would be ... that would
> be a commercial product. Is there one?

No need for a commercial product. wget does all that.

basi · 7 December 2005 00:02

Can't find a w3m binaries for windows xp. I'll continue to look.
Thanks,
basi

basi · 6 December 2005 23:17

Hi,
Thanks for the info on wget and curl. Both are powerful page
downloaders. The downloaded pages are still "tagged". I need to find a
way to "run" the pages and capture only the text display.
Thanks again.
basi

J-Van · 6 December 2005 23:43

$ lynx -dump www.rubystuff.com

Ruby Stuff. The Ruby Store for Ruby programmers.

Got the Right Stuff?

The Ruby Stuff store is one-stop shopping for assorted Ruby hacker
goodness.

T-shirts, hats, coffee mugs, clocks, mouse pads, and more.

Shirts

Ruby Stuff offers a nifty variety of stunning T-shirts and jerseys for
men and women. You'll feel naked without one.

[1]More ...

Coffee Mugs

Hackers + caffeine = max coding pleasure. Drink your beverage of
choice from one of these mugs.

[2]More ...

RubyStuff now has Stamps!

These first-class U. S. postage stamps won't make the mail go any
faster, but they are sure to raise an eyebrow.

[3]More ...

Stuff

There's yet more Ruby stuff: Knock-out clocks; mighty mouse pads,
handsome hats.

[4]Mouse pads, [5]bags, [6]undies, [7]hats, [8]buttons, [9]more

[10]About RubyStuff.com ...

References

   1. http://www.rubystuff.com/shirts.html
   2. http://www.rubystuff.com/mugs.html
   3. http://www.rubystuff.com/stamps.html
   4. http://www.rubystuff.com/mousepads.html
   5. http://www.rubystuff.com/bags.html
   6. http://www.rubystuff.com/undies.html
   7. http://www.rubystuff.com/hats.html
   8. http://www.rubystuff.com/buttons_and_magnets.html
   9. http://www.rubystuff.com/assorted.html
  10. http://www.rubystuff.com/about.html

···

On 12/6/05, basi <basi_lio@hotmail.com> wrote:

Hi,
Thanks for the info on wget and curl. Both are powerful page
downloaders. The downloaded pages are still "tagged". I need to find a
way to "run" the pages and capture only the text display.

basi · 7 December 2005 00:17

Hello,
Can't find windows xp binaries of w3m, snarf, also tried cUrl, wget,
but lynx does look like it renders the page close to what I'm looking
for.
Thanks to all who responded!
basi

Steve_Callaway · 7 December 2005 04:33

I will throw something like this together in Ruby over
the next days when I get some time and post it on
RubyForge. I have already done this sort of stuff in
Java and the concepts just really need a port.All we
are looking at Basi's initial level of requirements is
to send an HTTP get, and pipe the response to a file.

Link following is a little more tricky since you need
to parse the HTML, issue a get, pipe the file, rinse
and repeat, but again not exactly rocket science.

rgds

Steve

···

--- basi <basi_lio@hotmail.com> wrote:

Hello,
Can't find windows xp binaries of w3m, snarf, also
tried cUrl, wget,
but lynx does look like it renders the page close to
what I'm looking
for.
Thanks to all who responded!
basi

__________________________________________
Yahoo! DSL Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

Martin_DeMello1 · 7 December 2005 08:52

Nope, according to the OP's requirements, you also need to render the
html and spit out the rendered version as text, which makes lynx --dump
the right tool for the job. It'd be quite a big task to duplicate this
in ruby, I think.

martin

···

Steve Callaway <sjc2000_uk@yahoo.com> wrote:

I will throw something like this together in Ruby over
the next days when I get some time and post it on
RubyForge. I have already done this sort of stuff in
Java and the concepts just really need a port.All we
are looking at Basi's initial level of requirements is
to send an HTTP get, and pipe the response to a file.

Steve_Callaway · 7 December 2005 09:03

By rendering the html, my interpretation of this was
that it is merely a question of stripping tags etc,
which can quickly be accomplished with gsub. Or am I
missing something?

rgds

Steve

···

--- Martin DeMello <martindemello@yahoo.com> wrote:

Steve Callaway <sjc2000_uk@yahoo.com> wrote:
> I will throw something like this together in Ruby
over
> the next days when I get some time and post it on
> RubyForge. I have already done this sort of stuff
in
> Java and the concepts just really need a port.All
we
> are looking at Basi's initial level of
requirements is
> to send an HTTP get, and pipe the response to a
file.

Nope, according to the OP's requirements, you also
need to render the
html and spit out the rendered version as text,
which makes lynx --dump
the right tool for the job. It'd be quite a big task
to duplicate this
in ruby, I think.

martin

__________________________________________
Yahoo! DSL Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

Brian_Schroder1 · 7 December 2005 10:55

E.g. Tables and frames. So better use links2 or w3m for the task.

cheers,

Brian

···

On 07/12/05, Steve Callaway <sjc2000_uk@yahoo.com> wrote:

--- Martin DeMello <martindemello@yahoo.com> wrote:

> Steve Callaway <sjc2000_uk@yahoo.com> wrote:
> > I will throw something like this together in Ruby
> over
> > the next days when I get some time and post it on
> > RubyForge. I have already done this sort of stuff
> in
> > Java and the concepts just really need a port.All
> we
> > are looking at Basi's initial level of
> requirements is
> > to send an HTTP get, and pipe the response to a
> file.
>
> Nope, according to the OP's requirements, you also
> need to render the
> html and spit out the rendered version as text,
> which makes lynx --dump
> the right tool for the job. It'd be quite a big task
> to duplicate this
> in ruby, I think.
>
> martin
>
>

By rendering the html, my interpretation of this was
that it is merely a question of stripping tags etc,
which can quickly be accomplished with gsub. Or am I
missing something?

rgds

Steve

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

Martin_DeMello1 · 7 December 2005 13:37

Even without things like tables, the significance of various whitespace
elements (space, tab, newline) in html is very different from its
significance in the rendered page. e.g. the following can't be done by
just stripping tags:

<ul><li>one
two</li><li>three<li>four</ul>

martin

···

Steve Callaway <sjc2000_uk@yahoo.com> wrote:

>

By rendering the html, my interpretation of this was
that it is merely a question of stripping tags etc,
which can quickly be accomplished with gsub. Or am I
missing something?

Steve_Callaway · 7 December 2005 12:34

Ah, yeah, forgot all about those nasty little things.
Not insuperable but would certainly add an overhead to
handle them effectively.

Steve

···

--- Brian Schröder <ruby.brian@gmail.com> wrote:

On 07/12/05, Steve Callaway <sjc2000_uk@yahoo.com> > wrote:
>
>
> --- Martin DeMello <martindemello@yahoo.com>
wrote:
>
> > Steve Callaway <sjc2000_uk@yahoo.com> wrote:
> > > I will throw something like this together in
Ruby
> > over
> > > the next days when I get some time and post it
on
> > > RubyForge. I have already done this sort of
stuff
> > in
> > > Java and the concepts just really need a
port.All
> > we
> > > are looking at Basi's initial level of
> > requirements is
> > > to send an HTTP get, and pipe the response to
a
> > file.
> >
> > Nope, according to the OP's requirements, you
also
> > need to render the
> > html and spit out the rendered version as text,
> > which makes lynx --dump
> > the right tool for the job. It'd be quite a big
task
> > to duplicate this
> > in ruby, I think.
> >
> > martin
> >
> >
>
> By rendering the html, my interpretation of this
was
> that it is merely a question of stripping tags
etc,
> which can quickly be accomplished with gsub. Or am
I
> missing something?
>
> rgds
>
> Steve
>

E.g. Tables and frames. So better use links2 or w3m
for the task.

cheers,

Brian

--
http://ruby.brian-schroeder.de/

Stringed instrument chords:
http://chordlist.brian-schroeder.de/

__________________________________________
Yahoo! DSL Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

Ryan_Leavengood · 7 December 2005 16:44

WWW::Mechanize can do most of what is needed, except for the dumping
of the HTML as text. As others have said, what we really need is some
kind of HTML to text renderer. There has got to be gobs of C or C++
code out there that does this...how hard would it be to make a Ruby C
extension for this? Hash anyone ever thought about making a nice Ruby
extension for Gecko or even the HTML renderers in lynx or w3m?

Ryan

Topic		Replies	Views
Ruby screen scraping ruby-talk	27	111	21 November 2006
Simple screen scraper using scrAPI ruby-talk	14	123	30 November 2006
How to extract texts from html source? ruby-talk	13	139	2 June 2005
Page reader ruby-talk	2	81	26 July 2007
Script to fetch Wikipedia text ruby-talk	4	121	11 October 2006

Screen scraping an html text contents into a file

Related topics