Regex find everything between

Keith_Raymond · 22 August 2011 14:53

So here's the problem:

I have a html document that is being spit out to me as a string.

example: "<!doctype html>\n<html lang=\"en\">\n <head>\n</head>\n
<body>\n \t<header>\n \t <hgroup>\n \t <h1 class=\"my-class\">My
page Testing</h1>\n<p class=\"my-class icon\">some text here</p>
\t<footer>\n \t <p class=\"fred my-class\">This is my footer
info</p>\n \t</footer>\n </body>\n</html>"

I'm using regular expression to find all the opening tags of the dom
elements. <html lang=\"en\">, <head>, <body>, <h1 class=\"my-class\">,
etc... and it's working. This is via scan() method.

···

==============================
elements = []
opening_tags = file.scan(/<\w+\s+[^>]*>/)
opening_tags.each do |tag|
  if tag.match(/class=\\"(.*?)editor(.*?)\\"/) # tries to match anything
with a class="editor"
    close = get_closing_tag(tag)
      # finds which DOM element it is and returns close tag
      # example if '<p class="my-class">' returns '</p>'
    file.match(/#{tag}(.+)#{close}]/) { |m| elements << m }
      # pushes all matches to elements array

=======================================

So I get the opening tags as it should
<h1 class=\my-class\"> and <p class=\"fred my-class\">
and I get a proper closing tag for each
</h1> and </p>
but /#{tag}(.+)#{close}]/ returns nothing

Output from Rails.logger.info
+++++++++++++++++++++++++++++++++++++++
==== tag ====
"<h1 class=\"my-class\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="my-class">(.+)<\/p>]/
==== tag ====
"<p class=\"my-class icon\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="my-class icon">(.+)<\/p>]/
==== tag ====
"<p class=\"fred my-class\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="fred my-class">(.+)<\/p>]/
======= elements ========
[]

+++++++++++++++++++++++++++++++++++++++

Any help would be appreciated. I'm at my wits end here. If there is a
completely better way to do this, I'm all ears as well.

Thank you in advance.

--
Posted via http://www.ruby-forum.com/.

John-John_Tedro · 22 August 2011 15:03

Try out nokogiri: https://github.com/tenderlove/nokogiri

After you've let it parse your document you can use css3 or xpath selectors
to find what you are looking for.

Letting someone else do all the dirty work is a good idea for potentially
dirty html.

-- John-John Tedro

···

On Mon, Aug 22, 2011 at 4:53 PM, Keith Raymond <raymondke99@gmail.com>wrote:

So here's the problem:

I have a html document that is being spit out to me as a string.

example: "<!doctype html>\n<html lang=\"en\">\n <head>\n</head>\n
<body>\n \t<header>\n \t <hgroup>\n \t <h1 class=\"my-class\">My
page Testing</h1>\n<p class=\"my-class icon\">some text here</p>
\t<footer>\n \t <p class=\"fred my-class\">This is my footer
info</p>\n \t</footer>\n </body>\n</html>"

I'm using regular expression to find all the opening tags of the dom
elements. <html lang=\"en\">, <head>, <body>, <h1 class=\"my-class\">,
etc... and it's working. This is via scan() method.

==============================
elements =
opening_tags = file.scan(/<\w+\s+[^>]*>/)
opening_tags.each do |tag|
if tag.match(/class=\\"(.*?)editor(.*?)\\"/) # tries to match anything
with a class="editor"
   close = get_closing_tag(tag)
     # finds which DOM element it is and returns close tag
     # example if '<p class="my-class">' returns '</p>'
   file.match(/#{tag}(.+)#{close}]/) { |m| elements << m }
     # pushes all matches to elements array

=======================================

So I get the opening tags as it should
<h1 class=\my-class\"> and <p class=\"fred my-class\">
and I get a proper closing tag for each
</h1> and </p>
but /#{tag}(.+)#{close}]/ returns nothing

Output from Rails.logger.info
+++++++++++++++++++++++++++++++++++++++
==== tag ====
"<h1 class=\"my-class\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="my-class">(.+)<\/p>]/
==== tag ====
"<p class=\"my-class icon\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="my-class icon">(.+)<\/p>]/
==== tag ====
"<p class=\"fred my-class\">"
==== close ====
"</p>"
==== /#{tag}(.+)#{close}]/ ====
/<p class="fred my-class">(.+)<\/p>]/
======= elements ========

+++++++++++++++++++++++++++++++++++++++

Any help would be appreciated. I'm at my wits end here. If there is a
completely better way to do this, I'm all ears as well.

Thank you in advance.

--
Posted via http://www.ruby-forum.com/\.

Hassan_Schroeder · 22 August 2011 15:11

There is: nokogiri -- it's made for exactly this. Trying to parse XML or
HTML via regex is a path to tears and insanity.

···

On Mon, Aug 22, 2011 at 7:53 AM, Keith Raymond <raymondke99@gmail.com> wrote:

I have a html document that is being spit out to me as a string.

I'm using regular expression to find ...

If there is a completely better way to do this, I'm all ears as well.

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

twitter: @hassan

7stud · 22 August 2011 22:52

Keith Raymond wrote in post #1017873:

So here's the problem:

I have a html document that is being spit out to me as a string.

I'm using regular expression to find all the opening tags of the dom
elements.

What is your ultimate goal?

···

--
Posted via http://www.ruby-forum.com/\.

7stud · 22 August 2011 23:56

if tag.match(/class=\\"(.*?)editor(.*?)\\"/) # tries to match anything
with a class="editor"

tag = '<div class="myeditor">'

if tag.match(/class=\\"(.*?)editor(.*?)\\"/)
puts 'yes'
else
puts 'no'
end

--output:--
no

tag = '<div class="myeditor">'

if tag.match(/class="(.*?)editor(.*?)"/)
puts 'yes'
else
puts 'no'
end

--output:--
yes

close = get_closing_tag(tag)

but /#{tag}(.+)#{close}]/ returns nothing

Do you really expect anyone to be able to tell you what's wrong there?
How would anyone know what get_closing_tag() returns?

require 'nokogiri'

f = File.open('html.htm')
doc = Nokogiri::HTML(f)

results = doc.xpath('//*[contains(@class,"editor")]').each do |el|
  p [
      el.attributes['class'].value,
      el.children[0].text
  ]
end

--output:--
["editor_greeting", "Hello world"]
["myeditor_fruit", "Apple"]
["editor_name", "Papillon"]

==== html.htm:

<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
  </head>

<body>
<h1 class='editor_greeting'>Hello world</h1>

<div class="myeditor_fruit">Apple</div>

    <div class="article">
      <div>Not this node.</div>
      <div class="editor_name">Papillon</div>
    </div>

</body>

</html>

···

===

See the following for the basics of xpath:

http://www.w3schools.com/xpath/default.asp

--
Posted via http://www.ruby-forum.com/\.

Keith_Raymond · 23 August 2011 14:23

You guys have just made my week!!! Thank you so much!

Nokogirl works like a charm. Soooo amazing!

I will definitely add this to my list of "must have" gems.

Thank you again.

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Simple regex question ruby-talk	3	75	25 October 2005
Extracting text from HTML ruby-talk	7	80	11 May 2003
Regular expression ruby-talk	7	100	23 March 2009
Html stringScanner regexp ruby-talk	1	84	3 May 2006
Regex problem ruby-talk	4	87	2 December 2007

Regex find everything between

Related topics