Need a regex searching html code

Chirantan · 28 February 2008 06:40

I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.

Example,

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

In the above example, Plot Outline is header that I am looking for
then, regex should give me -

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>

And if "Tagline:" is what I am looking for then regex should give me -

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.

Todd_Benson · 28 February 2008 12:07

Scraping html is not the easiest thing in the world. I would
recommend the hpricot library.

Todd

···

On Thu, Feb 28, 2008 at 12:40 AM, Chirantan <chirantan.rajhans@gmail.com> wrote:

I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.

Example,

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

In the above example, Plot Outline is header that I am looking for
then, regex should give me -

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>

And if "Tagline:" is what I am looking for then regex should give me -

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.

W_James · 28 February 2008 15:55

Note that this will give spurious results if an html comment happens
to contain what you are looking for.

def find_header header, html
  # Put all of the DIVs in an array.
  divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
  divs.each{|s|
    if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
      return $2.strip
    end
  }
  return nil
end

html = DATA.read

puts find_header( "Plot Outline:", html )

__END__
<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

···

On Feb 28, 12:36 am, Chirantan <chirantan.rajh...@gmail.com> wrote:

I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.

Example,

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

In the above example, Plot Outline is header that I am looking for
then, regex should give me -

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>

And if "Tagline:" is what I am looking for then regex should give me -

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.

Mark_Thomas1 · 28 February 2008 18:54

A regex will break too easily when parsing HTML. A real parser will do
a much better job, and often be more concise and readable, too.

This does what you want:

···

#-------
require 'rubygems'
require 'hpricot'
@doc = Hpricot(html) # or Hpricot(open("filename"))

def find(term)
  @doc.search("//div[@class='info']").each do |info|
    header = info.search("h5").remove
    if header.inner_text == term
      puts info.inner_html
    end
  end
end
#-------

find("Plot Outline:")

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a href="http://
Live Free or Die Hard (2007) - Plot - IMDb" class="tn15more
inline" onclick="(new Image()).src='/rg/title-tease/plotsummary/images/
b.gif?
link=/title/tt0337978/plotsummary';">more</a>

Mark

W_James · 28 February 2008 20:14

More concise:

def find_header header, html
  html.scan( %r{<div.*?>(.*?)</div>}im ).flatten.each{|s|
    return $1.strip if s =~ %r{<h5>#{header}</h5(.*)}im }
  return nil
end

···

On Feb 28, 9:50 am, William James <w_a_x_...@yahoo.com> wrote:

On Feb 28, 12:36 am, Chirantan <chirantan.rajh...@gmail.com> wrote:

> I have an html code into string. I want to retrieve the content (Can
> be any HTML code with any number of tags) present inside the div after
> the heading till the end of the div.

> Example,

> <div class="info">
> <h5>Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>

> <div class="info">
> <h5>Plot Outline:</h5>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
> </div>

> In the above example, Plot Outline is header that I am looking for
> then, regex should give me -

> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>

> And if "Tagline:" is what I am looking for then regex should give me -

> Yippee Ki Yay Mo - John 6:27

> I hope the problem statement is clear.

Note that this will give spurious results if an html comment happens
to contain what you are looking for.

def find_header header, html
  # Put all of the DIVs in an array.
  divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
  divs.each{|s|
    if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
      return $2.strip
    end
  }
  return nil
end

html = DATA.read

puts find_header( "Plot Outline:", html )

__END__
<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

Chirantan · 29 February 2008 04:00

Thank you William and Mark,

The codes worked. Thanks a lot.

···

On Feb 29, 1:14 am, William James <w_a_x_...@yahoo.com> wrote:

On Feb 28, 9:50 am, William James <w_a_x_...@yahoo.com> wrote:

> On Feb 28, 12:36 am, Chirantan <chirantan.rajh...@gmail.com> wrote:

> > I have an html code into string. I want to retrieve the content (Can
> > be any HTML code with any number of tags) present inside the div after
> > the heading till the end of the div.

> > Example,

> > <div class="info">
> > <h5>Tagline:</h5>
> > Yippee Ki Yay Mo - John 6:27
> > </div>

> > <div class="info">
> > <h5>Plot Outline:</h5>
> > John McClane takes on an Internet-based terrorist organization who is
> > systematically shutting down the United States. <a class="tn15more
> > inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > link=/title/tt0337978/plotsummary';">more</a>
> > </div>

> > In the above example, Plot Outline is header that I am looking for
> > then, regex should give me -

> > John McClane takes on an Internet-based terrorist organization who is
> > systematically shutting down the United States. <a class="tn15more
> > inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > link=/title/tt0337978/plotsummary';">more</a>

> > And if "Tagline:" is what I am looking for then regex should give me -

> > Yippee Ki Yay Mo - John 6:27

> > I hope the problem statement is clear.

> Note that this will give spurious results if an html comment happens
> to contain what you are looking for.

> def find_header header, html
> # Put all of the DIVs in an array.
> divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
> divs.each{|s|
> if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
> return $2.strip
> end
> }
> return nil
> end

> html = DATA.read

> puts find_header( "Plot Outline:", html )

> __END__
> <div class="info">
> <h5>Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>

> <div class="info">
> <h5>Plot Outline:</h5>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
> </div>

More concise:

def find_header header, html
  html.scan( %r{<div.*?>(.*?)</div>}im ).flatten.each{|s|
    return $1.strip if s =~ %r{<h5>#{header}</h5(.*)}im }
  return nil
end

Mark_Thomas1 · 29 February 2008 13:54

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Florian_Gilcher · 29 February 2008 16:52

Whats quite interesting is that I am not able to find a nice article on _why_
this doesn't work. So, in short:

Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.

Regular languages do not have any possibility to "look behind". They do only
look forward. This is the reason why you cannot define a regular language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.

A more abstract example:
There is no (formal) regular expression that matches a word that consists
of n times "a" and then n times "b":

ab
aabb
aaabbb
aaaabbbb
etc.

What you can do is extract a tag, push it on a stack, extract the
next one, etc. and pop them when encountering matching closing tags. Tags
by itself can be described with regexps (afaik, this is how Textmate does its
markup).

Greetings
Skade

[1] Chomsky hierarchy - Wikipedia

···

On Feb 29, 2008, at 2:54 PM, Mark Thomas wrote:

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

W_James · 29 February 2008 19:05

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

Easily fixed.

def find_header header, html
  html.scan( %r{<div.*?>(.*?)</div\s*>}im ).flatten.
  each{|s|
    return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
  return nil
end

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Who told you that they are not? And why did you take his word for it?
Does hpricot use regular expressions?

···

On Feb 29, 7:50 am, Mark Thomas <m...@thomaszone.com> wrote:

Jari_Williamsson · 29 February 2008 19:19

Mark Thomas wrote:

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Sorry if I'm missing the point:

···

---
the_text = %q{
<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end
---

Result:
Within DIV tags: <div class="info">
Within DIV tags: <h5 >Tagline:</h5>
Within H5 tags: <h5 >Tagline:</h5>
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags: </div>

Best regards,

Jari Williamsson

Todd_Benson · 29 February 2008 17:35

Thank you for that great explanation! I was waiting for someone to
bring up formal grammar, but I was afraid to, because I wasn't sure it
applied (not that familiar with how regexps actually work).

Todd

···

On Fri, Feb 29, 2008 at 10:52 AM, Florian Gilcher <flo@andersground.net> wrote:

On Feb 29, 2008, at 2:54 PM, Mark Thomas wrote:

> All the regex solutions provided will break with the following
> perfectly valid HTML:
>
> <div class="info">
> <h5 >Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
>
> This is one of many reasons it is a BAD idea to use regexes to parse
> HTML. Regular expressions are simply not the right tool for the job.
>

Whats quite interesting is that I am not able to find a nice article
on _why_
this doesn't work. So, in short:

Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.

Regular languages do not have any possibility to "look behind". They
do only
look forward. This is the reason why you cannot define a regular
language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.

A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times "a" and then n times "b":

ab
aabb
aaabbb
aaaabbbb
etc.

What you can do is extract a tag, push it on a stack, extract the
next one, etc. and pop them when encountering matching closing tags.
Tags
by itself can be described with regexps (afaik, this is how Textmate
does its
markup).

Greetings
Skade

[1] Chomsky hierarchy - Wikipedia

W_James · 29 February 2008 19:14

> All the regex solutions provided will break with the following
> perfectly valid HTML:

> <div class="info">
> <h5 >Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>

> This is one of many reasons it is a BAD idea to use regexes to parse
> HTML. Regular expressions are simply not the right tool for the job.

Whats quite interesting is that I am not able to find a nice article
on _why_
this doesn't work. So, in short:

Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.

Regular languages do not have any possibility to "look behind". They
do only
look forward. This is the reason why you cannot define a regular
language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.

A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times "a" and then n times "b":

And that doesn't matter much. One can use as many regular expressions
as he wishes.

ab
aabb
aaabbb
aaaabbbb
etc.

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
  if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/)
    puts s
  else
    puts '-'
  end
}

Or one can use regular expression + code:

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
  if s.match(/^(a+)(b+)$/) and $1.size == $2.size
    puts s
  else
    puts '-'
  end
}

What makes anyone think that a single regular expression
has to do all the work?

···

On Feb 29, 10:52 am, Florian Gilcher <f...@andersground.net> wrote:

On Feb 29, 2008, at 2:54 PM, Mark Thomas wrote:

Todd_Benson · 29 February 2008 19:28

What if you have a div inside a div? Although, the OP said "any"
legitimate html inside a div, there's part of me that begs the
question: which div?

Todd

···

On Fri, Feb 29, 2008 at 1:19 PM, Jari Williamsson <jari.williamsson@mailbox.swipnet.se> wrote:

Mark Thomas wrote:
> All the regex solutions provided will break with the following
> perfectly valid HTML:
>
> <div class="info">
> <h5 >Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
>
> This is one of many reasons it is a BAD idea to use regexes to parse
> HTML. Regular expressions are simply not the right tool for the job.

Sorry if I'm missing the point:
---
the_text = %q{

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end
---

Result:
Within DIV tags: <div class="info">
Within DIV tags: <h5 >Tagline:</h5>
Within H5 tags: <h5 >Tagline:</h5>
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags: </div>

Florian_Gilcher · 29 February 2008 19:33

This may work on this short snippet. Consider this:

the_text = %q{
<div class="info">
   <div class="nextinfo">
   <h5 >Tagline:</h5>
   Yippee Ki Yay Mo - John 6:27
   </div>
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end

It doesn't see the second </div> as it considers _both_ divs closed. (which isn't even possible to determine, as we did not save any state). Second question: which <div> am I in at a certain point? Or, in other words: whats the #innerText of .info, whats the #innerText of .nextinfo? You won't get far without a stack and that can be proven [1].
If this is of interest to you, consider reading a book about computer theory. It may be hard stuff, but it pays off :).[2]

Greetings
Florian Gilcher

[1] Up to the reader ;).
[2] Don't feel bad if you didn't and don't consider this as an offense. I know many good programmers that never read any theory. But it certainly isn't bad to know about it.

···

On Feb 29, 2008, at 8:19 PM, Jari Williamsson wrote:

Mark Thomas wrote:

All the regex solutions provided will break with the following
perfectly valid HTML:
<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Sorry if I'm missing the point:
---
the_text = %q{
<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end
---

Result:
Within DIV tags: <div class="info">
Within DIV tags: <h5 >Tagline:</h5>
Within H5 tags: <h5 >Tagline:</h5>
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags: </div>

Best regards,

Jari Williamsson

Mark_Thomas1 · 29 February 2008 21:19

> All the regex solutions provided will break with the following
> perfectly valid HTML:

> <div class="info">
> <h5 >Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>

Easily fixed.

def find_header header, html
html.scan( %r{<div.*?>(.*?)</div\s*>}im ).flatten.
each{|s|
return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
return nil
end

Easily broken again.

<div class="info">
<h5 class="header">Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

The point is, regex-based parsing is fragile, and is provably
incomplete for parsing arbitrarily nested structures like HTML. A real
parser (such as a recursive descent parser) is needed. I use regular
expressions often, but when parsing HTML, XML, or other nested data, I
reach for other tools.

> This is one of many reasons it is a BAD idea to use regexes to parse
> HTML. Regular expressions are simply not the right tool for the job.

Who told you that they are not? And why did you take his word for it?

Experience, for one. Until I really understood parsers, I tended to
use regular expressions for everything. I've been using regular
expressions for a LONG time, and I am very comfortable with them. But
parsing HTML was always troublesome.

This has been discussed for years e.g. in Perl circles (PerlMonks,
etc) where it is well known that regexes do not fit nested data.
People with questions asking how to parse HTML with a regex will get
chided, especially with so many good parsers available in Perl. There
are good parsers available in Ruby now too, so people should be
encouraged to use them.

Does hpricot use regular expressions?

Of course not.

···

On Feb 29, 2:03 pm, William James <w_a_x_...@yahoo.com> wrote:

On Feb 29, 7:50 am, Mark Thomas <m...@thomaszone.com> wrote:

Jari_Williamsson · 29 February 2008 19:36

Todd Benson wrote:

···

On Fri, Feb 29, 2008 at 1:19 PM, Jari Williamsson > <jari.williamsson@mailbox.swipnet.se> wrote:

Mark Thomas wrote:
> All the regex solutions provided will break with the following
> perfectly valid HTML:
>
> <div class="info">
> <h5 >Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
>
> This is one of many reasons it is a BAD idea to use regexes to parse
> HTML. Regular expressions are simply not the right tool for the job.

Sorry if I'm missing the point:
---
the_text = %q{

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end
---

Result:
Within DIV tags: <div class="info">
Within DIV tags: <h5 >Tagline:</h5>
Within H5 tags: <h5 >Tagline:</h5>
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags: </div>

What if you have a div inside a div? Although, the OP said "any"
legitimate html inside a div, there's part of me that begs the
question: which div?

Sure, for real-life HTML with nested tags it'll break. I just wanted to point out that for simple parsing needs (as the example that I replied to) regexps can find both beginning and end tags.

Best regards,

Jari Williamsson

Jari_Williamsson · 29 February 2008 19:41

Florian Gilcher wrote:

Mark Thomas wrote:

All the regex solutions provided will break with the following
perfectly valid HTML:
<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Sorry if I'm missing the point:
---
the_text = %q{
<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
}

the_text.each_line do |line|
  puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
  puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end
---

Result:
Within DIV tags: <div class="info">
Within DIV tags: <h5 >Tagline:</h5>
Within H5 tags: <h5 >Tagline:</h5>
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags: </div>

Best regards,

Jari Williamsson

This may work on this short snippet. Consider this:

the_text = %q{
<div class="info">
  <div class="nextinfo">
  <h5 >Tagline:</h5>
  Yippee Ki Yay Mo - John 6:27
  </div>
</div>
}

the_text.each_line do |line|
  puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
  puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end

It doesn't see the second </div> as it considers _both_ divs closed.

It consider the first div closed. It never sees the other one.

Best regards,

Jari Williamsson

···

On Feb 29, 2008, at 8:19 PM, Jari Williamsson wrote:

Florian_Gilcher · 29 February 2008 19:44

ab
aabb
aaabbb
aaaabbbb
etc.

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/)
   puts s
else
   puts '-'
end
}

Or one can use regular expression + code:

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
if s.match(/^(a+)(b+)$/) and $1.size == $2.size
   puts s
else
   puts '-'
end
}

What makes anyone think that a single regular expression
has to do all the work?

I don't know. But many think one fits. Thats why i wrote this explanation, as it is something i see almost everyday and to give some insight to those that are pondering on why this is so.
So: your solution does not fit the problem, but thanks for showing that another problem (parsing "a*nb*n" with a touring-complete language) can indeed be solved.

I also stated this in my last paragraph: you can solve the problem by using regular expressions. But the language of regular expressions by itself is not mighty enough to solve it alone.

Greetings
Florian

Topic		Replies	Views
Html parser with regex, how to solve? ruby-talk	4	130	6 January 2008
Regex find everything between ruby-talk	5	120	23 August 2011
Regular expression ruby-talk	7	100	23 March 2009
Scan HTML ruby-talk	15	80	3 March 2008
Page crawling and URL grabbing ruby-talk	4	127	27 January 2009

Need a regex searching html code

Related topics