How to extract texts from html source?

Sam_Kong · 9 May 2005 19:04

Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

Thanks.
Sam

James_Britt4 · 9 May 2005 19:22

Sam Kong wrote:

Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem

James

···

Thanks.
Sam

.

--

http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com

Ben_Giddings1 · 10 May 2005 17:14

You may find my HTMLTokenizer library convenient for this. To do what you
need, all you'd do is keep calling "tokenizer.getText()"

http://rubyforge.org/projects/htmltokenizer/

Ben

···

On Monday 09 May 2005 15:04, Sam Kong wrote:

Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

Brian_Schroder1 · 9 May 2005 19:37

You don't need ruby for this:

$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.
.
  * You can follow links and/or view images in HTML.
  * Internet message preview mode, you can browse HTML mail.
  * You can follow links in plain text if it includes URL forms.
  * With w3m-img, you can view image inline.
.
For more information,
see w3m download | SourceForge.net

$ w3m -dump http://ruby.brian-schroeder.de/quiz/mazes/ | head
A ruby a day!

Ruby Quiz Solutions (Amazing Mazes)

Amazing Mazes

For a full description see: (Amazing Mazes on Ruby Quiz Homepage)[http://
Ruby Quiz - Amazing Mazes (#31)]

Another graph algorithm. Create a maze that is fully connected and has only one
$

regards,

Brian

···

On 09/05/05, James Britt <james_b@neurogami.com> wrote:

Sam Kong wrote:
> Hi, all!
>
> Quite often, when I need to read a list of web pages, I download the
> html sources and save them in a single file like a.html.
> If they are mostly texts, I open the html using web browser, select all
> and copy it to an editor and save it.
> I want to make the process shorter.
> How can I extract the text from html source?
> I'm sure there're many parsers for it.
> What is the most convenient one?

Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem

James

>
> Thanks.
> Sam
>
>
> .
>

--

http://www.ruby-doc.org
http://www.rubyxml.com
http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com

--
http://ruby.brian-schroeder.de/

multilingual _non rails_ ruby based vocabulary trainer:
http://www.vocabulaire.org/ | http://www.gloser.org/ | http://www.vokabeln.net/

Sam_Kong · 9 May 2005 19:54

James Britt wrote:

Sam Kong wrote:
> Hi, all!
>
> Quite often, when I need to read a list of web pages, I download

the

> html sources and save them in a single file like a.html.
> If they are mostly texts, I open the html using web browser, select

all

> and copy it to an editor and save it.
> I want to make the process shorter.
> How can I extract the text from html source?
> I'm sure there're many parsers for it.
> What is the most convenient one?

Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem

Thank James.
That looks cool.
However, it doesn't seem to have a function to extract texts from html.
(Or did I miss it?)
What I want is...

<table><tr><td>TEST</td></tr></table> => TEST

Is there a module that does this?

Regards,
Sam

···

James

>
> Thanks.
> Sam
>
>
> .
>

--

http://www.ruby-doc.org
http://www.rubyxml.com
http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com

James_Britt4 · 10 May 2005 17:20

Ben Giddings wrote:

···

On Monday 09 May 2005 15:04, Sam Kong wrote:

Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

You may find my HTMLTokenizer library convenient for this. To do what you need, all you'd do is keep calling "tokenizer.getText()"

http://rubyforge.org/projects/htmltokenizer/

WWW::Mechanize sits atop such a process, but makes it easier to define what to do for elected elements and such.

Just sayin' ...

James

Sam_Kong · 9 May 2005 20:05

Brian Schröder wrote:

> Sam Kong wrote:
> > Hi, all!
> >
> > Quite often, when I need to read a list of web pages, I download

the

> > html sources and save them in a single file like a.html.
> > If they are mostly texts, I open the html using web browser,

select all

> > and copy it to an editor and save it.
> > I want to make the process shorter.
> > How can I extract the text from html source?
> > I'm sure there're many parsers for it.
> > What is the most convenient one?
>
> Take a a look at Michael Neumann's WWW::Mechanize
>
> http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
> http://rubyforge.org/frs/?group_id=427&release_id=2014
>
> Or install the gem
>
> James
>
> >
> > Thanks.
> > Sam
> >
> >
> > .
> >
>
> --
>
> http://www.ruby-doc.org
> http://www.rubyxml.com
> http://catapult.rubyforge.com
> http://orbjson.rubyforge.com
> http://ooo4r.rubyforge.com
> http://www.jamesbritt.com
>
>

You don't need ruby for this:

$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.
.
  * You can follow links and/or view images in HTML.
  * Internet message preview mode, you can browse HTML mail.
  * You can follow links in plain text if it includes URL forms.
  * With w3m-img, you can view image inline.
.
For more information,
see w3m download | SourceForge.net

$ w3m -dump http://ruby.brian-schroeder.de/quiz/mazes/ | head
A ruby a day!

Oh, thanks.
I just realized that even lynx can do that.

Regards,
Sam

Ruby Quiz Solutions (Amazing Mazes)

Amazing Mazes

For a full description see: (Amazing Mazes on Ruby Quiz

Homepage)[http://

Ruby Quiz - Amazing Mazes (#31)]

Another graph algorithm. Create a maze that is fully connected and

has only one

$

regards,

Brian

--
http://ruby.brian-schroeder.de/

multilingual _non rails_ ruby based vocabulary trainer:
http://www.vocabulaire.org/ | http://www.gloser.org/ |

···

On 09/05/05, James Britt <james_b@neurogami.com> wrote:

James_Britt4 · 10 May 2005 02:49

Sam Kong wrote:

Thank James.
That looks cool.
However, it doesn't seem to have a function to extract texts from html.
(Or did I miss it?)

No, it is a library for the (fairly) easy creation of HTML munging code.

Some coding is required, but it allows complete control (so you get just the text of interest).

James

William_Park · 2 June 2005 23:35

I guess you run it through XML parser, like Expat which is everywhere
these days. Even Bash and Gawk have interface to it.

···

Sam Kong <sam.s.kong@gmail.com> wrote:

What I want is...

<table><tr><td>TEST</td></tr></table> => TEST

Is there a module that does this?

--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html

Tom_Reilly · 10 May 2005 02:07

Several years ago, one of the members of the group offered me this routine which does a pretty good job of
extracting the text from a html page.

···

#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------

def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end

daz · 10 May 2005 12:04

Sam Kong wrote:

[...] If they are mostly texts, I open the html using
web browser, select all and copy it to an editor and save it.

Save As ... [text file].txt

- Removes all tags.
(Verified with Opera, Firefox & IE6, so I guess most browsers do this)
( e.g. test page: http://www.qurl.net/ )

daz

Sam_Kong · 10 May 2005 16:00

Yes, that's right...
I just want to do it all with my ruby program...hehe
Thanks anyway.

Sam

Sam_Kong · 10 May 2005 16:00

Tom Reilly wrote:

Several years ago, one of the members of the group offered me this
routine which does a pretty good job of
extracting the text from a html page.

#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------

def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end

Thank you for sharing the code.
However, this code works only for a simple line, right?
When I tested it with a page of html by looping line by line, the
result was not what I expected.
Probably, I need to get a DOM parser...

Sam

Julius_Plenz · 30 May 2005 20:45

* Tom Reilly <w3gat@nwlagardener.org> [2005-05-10]:

#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------

def striphtml(line)

I'd rather recommend to use
line.gsub(/\n/, ' ').gsub(/<[^>]+>/, '')
instead of

line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end

Julius

Topic		Replies	Views
Is there link extractor or similar html processing libs for Ruby ruby-talk	16	148	10 March 2006
Screen scraping an html text contents into a file ruby-talk	15	121	7 December 2005
Htnl parser ruby-talk	2	94	26 October 2004
How extract data from a web site? ruby-talk	7	113	17 April 2006
Parsing html ruby-talk	4	112	27 October 2004

How to extract texts from html source?

Related topics