Seperate body content from HTML

I am currently working on a script that will parse lyrics on online lyric pages. To get at the actual lyrics I need to take the HTML source and somehow separate out all the <BODY> content. I would then put all of the body content into a string and parse it to remove all image tags, style tags and non visible characters leaving me with just text.

I am new to Ruby and regular expressions so I am having some trouble getting the data between the <BODY> and </BODY> tags into a string. Right now I have the entire page loaded into a string called data and I run #scan with a regular expression and a block that prints out the matches from #scan.

I guess I am just asking for a good regular expression(or other means) of separating out the body content of an HTML document from the rest of the source.

Thanks,
Matthew Margolis

Hi,
Take a look at the HTMLTokenizer module at RAA.
http://raa.ruby-lang.org/project/htmltokenizer/

Cheers,
Joao

···

--- Matthew Margolis <mrmargolis@wisc.edu> wrote:

I am currently working on a script that will parse
lyrics on online
lyric pages. To get at the actual lyrics I need to
take the HTML source
and somehow separate out all the <BODY> content. I
would then put all
of the body content into a string and parse it to
remove all image tags,
style tags and non visible characters leaving me
with just text.

I am new to Ruby and regular expressions so I am
having some trouble
getting the data between the <BODY> and </BODY> tags
into a string.
Right now I have the entire page loaded into a
string called data and I
run #scan with a regular expression and a block that
prints out the
matches from #scan.

I guess I am just asking for a good regular
expression(or other means)
of separating out the body content of an HTML
document from the rest of
the source.

Thanks,
Matthew Margolis

__________________________________
Do you Yahoo!?
Read only the mail you want - Yahoo! Mail SpamGuard.
http://promotions.yahoo.com/new_mail

Matthew,

I wrote some code that does exactly the same thing, and I did it with
some regular expressions. It works, but it can get a little messy. You
might have better luck with an html tokenizer as someone else said.
Usually the hardest part is finding out all the variations on the HTML
returned. A lot of sites with dynamic content require trying to fetch
all kinds of information so you can see what the HTML will look like.

While writing lyrics plugins, one very difficult thing I ran into was
pages having different content depending on my User Agent string. For
example, sometimes the capitalization of the tags would be different in
different browsers. Once the content was completely different.

If you want to use some of my code to help your project along, you can
find it at http://kapheine.hypa.net/musicextras under the API docs (or
download it).

···

On Mon, Jul 05, 2004 at 11:54:53PM +0900, Matthew Margolis wrote:

I am currently working on a script that will parse lyrics on online
lyric pages. To get at the actual lyrics I need to take the HTML source
and somehow separate out all the <BODY> content. I would then put all
of the body content into a string and parse it to remove all image tags,
style tags and non visible characters leaving me with just text.

I am new to Ruby and regular expressions so I am having some trouble
getting the data between the <BODY> and </BODY> tags into a string.
Right now I have the entire page loaded into a string called data and I
run #scan with a regular expression and a block that prints out the
matches from #scan.

I guess I am just asking for a good regular expression(or other means)
of separating out the body content of an HTML document from the rest of
the source.

Thanks,
Matthew Margolis

--
Zachary P. Landau <kapheine@hypa.net>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

Joao Pedrosa wrote:

Hi,
Take a look at the HTMLTokenizer module at RAA.
http://raa.ruby-lang.org/project/htmltokenizer/

Cheers,
Joao

I am currently working on a script that will parse
lyrics on online lyric pages. To get at the actual lyrics I need to
take the HTML source and somehow separate out all the <BODY> content. I
would then put all of the body content into a string and parse it to
remove all image tags, style tags and non visible characters leaving me
with just text.

I am new to Ruby and regular expressions so I am
having some trouble getting the data between the <BODY> and </BODY> tags
into a string. Right now I have the entire page loaded into a
string called data and I run #scan with a regular expression and a block that
prints out the matches from #scan.

I guess I am just asking for a good regular
expression(or other means) of separating out the body content of an HTML
document from the rest of the source.

Thanks,
Matthew Margolis

Excellent. Thank you very much.

-Matthew Margolis

···

--- Matthew Margolis <mrmargolis@wisc.edu> wrote:

Zachary P. Landau wrote:

···

On Mon, Jul 05, 2004 at 11:54:53PM +0900, Matthew Margolis wrote:

I am currently working on a script that will parse lyrics on online lyric pages. To get at the actual lyrics I need to take the HTML source and somehow separate out all the <BODY> content. I would then put all of the body content into a string and parse it to remove all image tags, style tags and non visible characters leaving me with just text.

I am new to Ruby and regular expressions so I am having some trouble getting the data between the <BODY> and </BODY> tags into a string. Right now I have the entire page loaded into a string called data and I run #scan with a regular expression and a block that prints out the matches from #scan.

I guess I am just asking for a good regular expression(or other means) of separating out the body content of an HTML document from the rest of the source.

Thanks,
Matthew Margolis
   
Matthew,

I wrote some code that does exactly the same thing, and I did it with
some regular expressions. It works, but it can get a little messy. You
might have better luck with an html tokenizer as someone else said.
Usually the hardest part is finding out all the variations on the HTML
returned. A lot of sites with dynamic content require trying to fetch
all kinds of information so you can see what the HTML will look like.

While writing lyrics plugins, one very difficult thing I ran into was
pages having different content depending on my User Agent string. For
example, sometimes the capitalization of the tags would be different in
different browsers. Once the content was completely different.

If you want to use some of my code to help your project along, you can
find it at http://kapheine.hypa.net/musicextras under the API docs (or
download it).

--
Zachary P. Landau <kapheine@hypa.net>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

Thank you Zachary. I am checking out the API docs right now.

-Matthew Margolis