Seperate body content from HTML

MATTHEW_REUBEN_MARGO · 5 July 2004 14:54

I am currently working on a script that will parse lyrics on online lyric pages. To get at the actual lyrics I need to take the HTML source and somehow separate out all the <BODY> content. I would then put all of the body content into a string and parse it to remove all image tags, style tags and non visible characters leaving me with just text.

I am new to Ruby and regular expressions so I am having some trouble getting the data between the <BODY> and </BODY> tags into a string. Right now I have the entire page loaded into a string called data and I run #scan with a regular expression and a block that prints out the matches from #scan.

I guess I am just asking for a good regular expression(or other means) of separating out the body content of an HTML document from the rest of the source.

Thanks,
Matthew Margolis

Joao_Pedrosa2 · 5 July 2004 15:20

Hi,
Take a look at the HTMLTokenizer module at RAA.
http://raa.ruby-lang.org/project/htmltokenizer/

Cheers,
Joao

···

--- Matthew Margolis <mrmargolis@wisc.edu> wrote:

I am currently working on a script that will parse
lyrics on online
lyric pages. To get at the actual lyrics I need to
take the HTML source
and somehow separate out all the <BODY> content. I
would then put all
of the body content into a string and parse it to
remove all image tags,
style tags and non visible characters leaving me
with just text.

I am new to Ruby and regular expressions so I am
having some trouble
getting the data between the <BODY> and </BODY> tags
into a string.
Right now I have the entire page loaded into a
string called data and I
run #scan with a regular expression and a block that
prints out the
matches from #scan.

I guess I am just asking for a good regular
expression(or other means)
of separating out the body content of an HTML
document from the rest of
the source.

Thanks,
Matthew Margolis

__________________________________
Do you Yahoo!?
Read only the mail you want - Yahoo! Mail SpamGuard.
http://promotions.yahoo.com/new_mail

Zachary_P_Landau · 6 July 2004 16:39

Matthew,

I wrote some code that does exactly the same thing, and I did it with
some regular expressions. It works, but it can get a little messy. You
might have better luck with an html tokenizer as someone else said.
Usually the hardest part is finding out all the variations on the HTML
returned. A lot of sites with dynamic content require trying to fetch
all kinds of information so you can see what the HTML will look like.

While writing lyrics plugins, one very difficult thing I ran into was
pages having different content depending on my User Agent string. For
example, sometimes the capitalization of the tags would be different in
different browsers. Once the content was completely different.

If you want to use some of my code to help your project along, you can
find it at http://kapheine.hypa.net/musicextras under the API docs (or
download it).

···

On Mon, Jul 05, 2004 at 11:54:53PM +0900, Matthew Margolis wrote:

I am currently working on a script that will parse lyrics on online
lyric pages. To get at the actual lyrics I need to take the HTML source
and somehow separate out all the <BODY> content. I would then put all
of the body content into a string and parse it to remove all image tags,
style tags and non visible characters leaving me with just text.

I am new to Ruby and regular expressions so I am having some trouble
getting the data between the <BODY> and </BODY> tags into a string.
Right now I have the entire page loaded into a string called data and I
run #scan with a regular expression and a block that prints out the
matches from #scan.

I guess I am just asking for a good regular expression(or other means)
of separating out the body content of an HTML document from the rest of
the source.

Thanks,
Matthew Margolis

--
Zachary P. Landau <kapheine@hypa.net>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

MATTHEW_REUBEN_MARGO · 5 July 2004 15:45

Joao Pedrosa wrote:

Hi,
Take a look at the HTMLTokenizer module at RAA.
http://raa.ruby-lang.org/project/htmltokenizer/

Cheers,
Joao

I am currently working on a script that will parse
lyrics on online lyric pages. To get at the actual lyrics I need to
take the HTML source and somehow separate out all the <BODY> content. I
would then put all of the body content into a string and parse it to
remove all image tags, style tags and non visible characters leaving me
with just text.

I am new to Ruby and regular expressions so I am
having some trouble getting the data between the <BODY> and </BODY> tags
into a string. Right now I have the entire page loaded into a
string called data and I run #scan with a regular expression and a block that
prints out the matches from #scan.

I guess I am just asking for a good regular
expression(or other means) of separating out the body content of an HTML
document from the rest of the source.

Thanks,
Matthew Margolis

Excellent. Thank you very much.

-Matthew Margolis

···

--- Matthew Margolis <mrmargolis@wisc.edu> wrote:

MATTHEW_REUBEN_MARGO · 6 July 2004 23:59

Zachary P. Landau wrote:

···

On Mon, Jul 05, 2004 at 11:54:53PM +0900, Matthew Margolis wrote:

I am currently working on a script that will parse lyrics on online lyric pages. To get at the actual lyrics I need to take the HTML source and somehow separate out all the <BODY> content. I would then put all of the body content into a string and parse it to remove all image tags, style tags and non visible characters leaving me with just text.

I am new to Ruby and regular expressions so I am having some trouble getting the data between the <BODY> and </BODY> tags into a string. Right now I have the entire page loaded into a string called data and I run #scan with a regular expression and a block that prints out the matches from #scan.

I guess I am just asking for a good regular expression(or other means) of separating out the body content of an HTML document from the rest of the source.

Thanks,
Matthew Margolis

Matthew,

I wrote some code that does exactly the same thing, and I did it with
some regular expressions. It works, but it can get a little messy. You
might have better luck with an html tokenizer as someone else said.
Usually the hardest part is finding out all the variations on the HTML
returned. A lot of sites with dynamic content require trying to fetch
all kinds of information so you can see what the HTML will look like.

While writing lyrics plugins, one very difficult thing I ran into was
pages having different content depending on my User Agent string. For
example, sometimes the capitalization of the tags would be different in
different browsers. Once the content was completely different.

If you want to use some of my code to help your project along, you can
find it at http://kapheine.hypa.net/musicextras under the API docs (or
download it).

--
Zachary P. Landau <kapheine@hypa.net>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

Thank you Zachary. I am checking out the API docs right now.

-Matthew Margolis

Topic		Replies	Views
Regex html ruby-talk	10	83	16 May 2007
Regular expression ruby-talk	7	100	23 March 2009
Trying to use regex ruby-talk	3	99	20 June 2007
Strinpping html using regexp ruby-talk	4	82	5 May 2009
Regex find everything between ruby-talk	5	120	23 August 2011

Seperate body content from HTML

Related topics