Regex find everything between

So here's the problem:

I have a html document that is being spit out to me as a string.

example: "<!doctype html>\n<html lang=\"en\">\n <head>\n</head>\n
<body>\n \t<header>\n \t <hgroup>\n \t <h1 class=\"my-class\">My
page Testing</h1>\n<p class=\"my-class icon\">some text here</p>
\t<footer>\n \t <p class=\"fred my-class\">This is my footer
info</p>\n \t</footer>\n </body>\n</html>"

I'm using regular expression to find all the opening tags of the dom
elements. <html lang=\"en\">, <head>, <body>, <h1 class=\"my-class\">,
etc... and it's working. This is via scan() method.

···

==============================
elements = []
opening_tags = file.scan(/<\w+\s+[^>]*>/)
opening_tags.each do |tag|
  if tag.match(/class=\\"(.*?)editor(.*?)\\"/) # tries to match anything
with a class="editor"
    close = get_closing_tag(tag)
      # finds which DOM element it is and returns close tag
      # example if '<p class="my-class">' returns '</p>'
    file.match(/#{tag}(.+)#{close}]/) { |m| elements << m }
      # pushes all matches to elements array

=======================================

So I get the opening tags as it should
  <h1 class=\my-class\"> and <p class=\"fred my-class\">
and I get a proper closing tag for each
  </h1> and </p>
but /#{tag}(.+)#{close}]/ returns nothing

Output from Rails.logger.info
+++++++++++++++++++++++++++++++++++++++
==== tag ====
"<h1 class=\"my-class\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="my-class">(.+)<\/p>]/
==== tag ====
"<p class=\"my-class icon\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="my-class icon">(.+)<\/p>]/
==== tag ====
"<p class=\"fred my-class\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="fred my-class">(.+)<\/p>]/
======= elements ========
[]

+++++++++++++++++++++++++++++++++++++++

Any help would be appreciated. I'm at my wits end here. If there is a
completely better way to do this, I'm all ears as well.

Thank you in advance.

--
Posted via http://www.ruby-forum.com/.

Try out nokogiri: https://github.com/tenderlove/nokogiri

After you've let it parse your document you can use css3 or xpath selectors
to find what you are looking for.

Letting someone else do all the dirty work is a good idea for potentially
dirty html.

-- John-John Tedro

···

On Mon, Aug 22, 2011 at 4:53 PM, Keith Raymond <raymondke99@gmail.com>wrote:

So here's the problem:

I have a html document that is being spit out to me as a string.

example: "<!doctype html>\n<html lang=\"en\">\n <head>\n</head>\n
<body>\n \t<header>\n \t <hgroup>\n \t <h1 class=\"my-class\">My
page Testing</h1>\n<p class=\"my-class icon\">some text here</p>
\t<footer>\n \t <p class=\"fred my-class\">This is my footer
info</p>\n \t</footer>\n </body>\n</html>"

I'm using regular expression to find all the opening tags of the dom
elements. <html lang=\"en\">, <head>, <body>, <h1 class=\"my-class\">,
etc... and it's working. This is via scan() method.

==============================
elements =
opening_tags = file.scan(/<\w+\s+[^>]*>/)
opening_tags.each do |tag|
if tag.match(/class=\\"(.*?)editor(.*?)\\"/) # tries to match anything
with a class="editor"
   close = get_closing_tag(tag)
     # finds which DOM element it is and returns close tag
     # example if '<p class="my-class">' returns '</p>'
   file.match(/#{tag}(.+)#{close}]/) { |m| elements << m }
     # pushes all matches to elements array

=======================================

So I get the opening tags as it should
<h1 class=\my-class\"> and <p class=\"fred my-class\">
and I get a proper closing tag for each
</h1> and </p>
but /#{tag}(.+)#{close}]/ returns nothing

Output from Rails.logger.info
+++++++++++++++++++++++++++++++++++++++
==== tag ====
"<h1 class=\"my-class\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="my-class">(.+)<\/p>]/
==== tag ====
"<p class=\"my-class icon\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="my-class icon">(.+)<\/p>]/
==== tag ====
"<p class=\"fred my-class\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="fred my-class">(.+)<\/p>]/
======= elements ========

+++++++++++++++++++++++++++++++++++++++

Any help would be appreciated. I'm at my wits end here. If there is a
completely better way to do this, I'm all ears as well.

Thank you in advance.

--
Posted via http://www.ruby-forum.com/\.

There is: nokogiri -- it's made for exactly this. Trying to parse XML or
HTML via regex is a path to tears and insanity. :slight_smile:

···

On Mon, Aug 22, 2011 at 7:53 AM, Keith Raymond <raymondke99@gmail.com> wrote:

I have a html document that is being spit out to me as a string.

I'm using regular expression to find ...

If there is a completely better way to do this, I'm all ears as well.

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

twitter: @hassan

Keith Raymond wrote in post #1017873:

So here's the problem:

I have a html document that is being spit out to me as a string.

I'm using regular expression to find all the opening tags of the dom
elements.

What is your ultimate goal?

···

--
Posted via http://www.ruby-forum.com/\.

if tag.match(/class=\\"(.*?)editor(.*?)\\"/) # tries to match anything
with a class="editor"

tag = '<div class="myeditor">'

if tag.match(/class=\\"(.*?)editor(.*?)\\"/)
  puts 'yes'
else
  puts 'no'
end

--output:--
no

tag = '<div class="myeditor">'

if tag.match(/class="(.*?)editor(.*?)"/)
  puts 'yes'
else
  puts 'no'
end

--output:--
yes

close = get_closing_tag(tag)

but /#{tag}(.+)#{close}]/ returns nothing

Do you really expect anyone to be able to tell you what's wrong there?
How would anyone know what get_closing_tag() returns?

require 'nokogiri'

f = File.open('html.htm')
doc = Nokogiri::HTML(f)

results = doc.xpath('//*[contains(@class,"editor")]').each do |el|
  p [
      el.attributes['class'].value,
      el.children[0].text
  ]
end

--output:--
["editor_greeting", "Hello world"]
["myeditor_fruit", "Apple"]
["editor_name", "Papillon"]

==== html.htm:

<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
  </head>

  <body>
    <h1 class='editor_greeting'>Hello world</h1>

    <div class="myeditor_fruit">Apple</div>

    <div class="article">
      <div>Not this node.</div>
      <div class="editor_name">Papillon</div>
    </div>

  </body>

</html>

···

===

See the following for the basics of xpath:

http://www.w3schools.com/xpath/default.asp

--
Posted via http://www.ruby-forum.com/\.

You guys have just made my week!!! Thank you so much!

Nokogirl works like a charm. Soooo amazing!

I will definitely add this to my list of "must have" gems.

Thank you again.

···

--
Posted via http://www.ruby-forum.com/.