Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?
Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?
Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?
$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.
.
* You can follow links and/or view images in HTML.
* Internet message preview mode, you can browse HTML mail.
* You can follow links in plain text if it includes URL forms.
* With w3m-img, you can view image inline.
.
For more information,
see w3m download | SourceForge.net
Another graph algorithm. Create a maze that is fully connected and has only one
$
regards,
Brian
···
On 09/05/05, James Britt <james_b@neurogami.com> wrote:
Sam Kong wrote:
> Hi, all!
>
> Quite often, when I need to read a list of web pages, I download the
> html sources and save them in a single file like a.html.
> If they are mostly texts, I open the html using web browser, select all
> and copy it to an editor and save it.
> I want to make the process shorter.
> How can I extract the text from html source?
> I'm sure there're many parsers for it.
> What is the most convenient one?
Sam Kong wrote:
> Hi, all!
>
> Quite often, when I need to read a list of web pages, I download
the
> html sources and save them in a single file like a.html.
> If they are mostly texts, I open the html using web browser, select
all
> and copy it to an editor and save it.
> I want to make the process shorter.
> How can I extract the text from html source?
> I'm sure there're many parsers for it.
> What is the most convenient one?
Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?
You may find my HTMLTokenizer library convenient for this. To do what you need, all you'd do is keep calling "tokenizer.getText()"
$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.
.
* You can follow links and/or view images in HTML.
* Internet message preview mode, you can browse HTML mail.
* You can follow links in plain text if it includes URL forms.
* With w3m-img, you can view image inline.
.
For more information,
see w3m download | SourceForge.net
Several years ago, one of the members of the group offered me this routine which does a pretty good job of
extracting the text from a html page.
···
#--------------------------------------------------------------------
# Strip HTML Tags from Line #--------------------------------------------------------------------
def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end
Several years ago, one of the members of the group offered me this
routine which does a pretty good job of
extracting the text from a html page.
#--------------------------------------------------------------------
# Strip HTML Tags from Line #--------------------------------------------------------------------
def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end
Thank you for sharing the code.
However, this code works only for a simple line, right?
When I tested it with a page of html by looping line by line, the
result was not what I expected.
Probably, I need to get a DOM parser...
* Tom Reilly <w3gat@nwlagardener.org> [2005-05-10]:
#--------------------------------------------------------------------
# Strip HTML Tags from Line #--------------------------------------------------------------------
def striphtml(line)
I'd rather recommend to use
line.gsub(/\n/, ' ').gsub(/<[^>]+>/, '')
instead of