Regexp problem

Joao_Silva · 9 February 2009 11:39

how i can extract:

<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</script>
MB</b></td>

i need this nuber: 123313? I tried to match this in many ways but i stil
have problem with escape characters.

···

--
Posted via http://www.ruby-forum.com/.

Mike_IMAP · 9 February 2009 11:51

Of course that depends upon how general this needs to be. If it will always be the first part of the first parameter to a call to Math.ceil and negated, then:

···

======================================================================
text = <<EOS
<td>Traffic left:</td><td
align=

<b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</

MB</b></td>
EOS

m = text.match(/Math\.ceil\(\-(\d+)/)
puts m[1] if m

Of course, it seems "suspicious that you don't want to pick up the minus, and this seems to take a lot of consistency for granted. For a good answer, you'll need to specify what conditions will always be the same.

On Feb 9, 2009, at 6:39 AM, Joao Silva wrote:

how i can extract:

<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</>
MB</b></td>

i need this nuber: 123313? I tried to match this in many ways but i stil
have problem with escape characters.
--
Posted via http://www.ruby-forum.com/\.

W_James · 10 February 2009 08:19

Joao Silva wrote:

how i can extract:

<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/100
0)));</script> MB</b></td>

i need this nuber: 123313? I tried to match this in many ways but i
stil have problem with escape characters.

list = DATA.read.scan( %r{<td.*?>\s*(.*?)\s*</td>}im ).flatten

list.each_cons(2){|a,b|
  if "Traffic left:" == a and b =~ /Math.ceil\((-?\d+)/
    p $1
  end
}

__END__

<td>NOT TRAFFIC LEFT:</td><td
align=right><b>
<script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));
</script>
MB</b></td>

<td> Traffic left:
</td><td
align=right><b><script>
document.write(setzeTT(""+Math.ceil(-123313/1000)));
</script>
MB</b></td>

Mark_Thomas · 11 February 2009 17:03

As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.

* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)

require 'nokogiri'
doc = Nokogiri::HTML(html)

puts doc.xpath('//td[contains(.,"Traffic left")]/following-
sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)

Joao_Silva · 9 February 2009 12:49

m = text.match(/Math\.ceil\(\-(\d+)/)

I cannot use regexp on this - need regexp on whole this prase
(<td>Traffic left:</td>.....), because document is full of strings like
this.

···

--
Posted via http://www.ruby-forum.com/\.

Rick_DeNatale1 · 10 February 2009 14:19

As 7Stud pointed out, a toolbox with only regular expressions inside is
often a poor choice for dealing with xml/html

Here's a rather verbose and commented program using a combination of hpricot
and a regular expression to do something like what I think you are looking
for:

require 'rubygems'
require 'hpricot'

def get_traffic_left_numbers(string)
  doc = Hpricot(string)
  results =
  # iterate over all of the td elements in the document
  traffic_lefts = doc.search("td").each do |td1|
    # check to see if the td contents is "Traffic left:"
    if td1.inner_text == "Traffic left:"
      # if yes, get the next sibling
      td2 = td1.next_sibling
      # and then for each script tag inside
      td2.search("script") do | script |
        # get the script_tag text
        script_text = script.inner_text
        # Use a regexp to capture the number
        number = /Math\.ceil\(-?(\d+)/.match(script_text)
        # add the number we found, if any, to the results array
        results << number[1] if number
      end
    end
  end
  results
end

p get_traffic_left_numbers("<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</script>
MB</b></td>
<td>NOT TRAFFIC LEFT:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));</script>
MB</b></td>")

When run this outputs:

["123313"]

In other words it produces an array of strings representing the target
numbers in a script tag within a td tag which follows another td tag whose
inner text is "Traffic left:"

HTH

···

On Tue, Feb 10, 2009 at 3:19 AM, William James <w_a_x_man@yahoo.com> wrote:

Joao Silva wrote:

> how i can extract:
>
> <td>Traffic left:</td><td
> align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/100
> 0)));</script> MB</b></td>
>
> i need this nuber: 123313? I tried to match this in many ways but i
> stil have problem with escape characters.

list = DATA.read.scan( %r{<td.*?>\s*(.*?)\s*</td>}im ).flatten

list.each_cons(2){|a,b|
if "Traffic left:" == a and b =~ /Math.ceil\((-?\d+)/
p $1
end
}

__END__

<td>NOT TRAFFIC LEFT:</td><td
align=right><b>
<script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));
</script>
MB</b></td>

<td> Traffic left:
</td><td
align=right><b><script>
document.write(setzeTT(""+Math.ceil(-123313/1000)));
</script>
MB</b></td>

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale

W_James · 11 February 2009 20:38

What if the cell contains "No Traffic left"?

···

On Feb 11, 11:01 am, Mark Thomas <m...@thomaszone.com> wrote:

As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.

* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)

require 'nokogiri'
doc = Nokogiri::HTML(html)

puts doc.xpath('//td[contains(.,"Traffic left")]/following-
sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)

W_James · 11 February 2009 22:28

As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.

* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)

require 'nokogiri'
doc = Nokogiri::HTML(html)

puts doc.xpath('//

// is quite cryptic.

td[contains(.,

.?

"Traffic left")]/following-

sibling::td//script'

script?

).to_s.scan(/Math.ceil.-(\d*)/)

I'd rather use Ruby.

···

On Feb 11, 11:01 am, Mark Thomas <m...@thomaszone.com> wrote:

Mike_IMAP · 10 February 2009 00:11

If you're only trying to pull out the single number, this REGEX will work for the whole phrase you provided.

One of the things you want to do with a REGEX is to avoid any more detail than is necessary to find what you're looking for. The REGEX does not need to "match" the whole string.

···

On Feb 9, 2009, at 7:49 AM, Joao Silva wrote:

m = text.match(/Math\.ceil\(\-(\d+)/)

I cannot use regexp on this - need regexp on whole this prase
(<td>Traffic left:</td>.....), because document is full of strings like
this.
--
Posted via http://www.ruby-forum.com/\.

Igor_Pirnovar · 10 February 2009 18:15

Rick Denatale wrote:

As 7Stud pointed out, a toolbox with only regular expressions
inside is often a poor choice for dealing with xml/html

Here's a rather verbose and commented program using a
combination of hpricot and a regular expression to do
something like what I think you are looking for:

require 'rubygems'
require 'hpricot'
. . .

When run this outputs: ["123313"]

In other words it produces an array of strings representing
the target numbers in a script tag within a td tag which
follows another td tag whose inner text is "Traffic left:"

Rick, your solution is swell, and it is probably worth while considering
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer's perspective William's
solution is far more appealing, much shorter, easier to understand and
requires virtually no additional learning effort. It nullifies or
"flattens" the comment started out by 7Stud that you also elevated to an
undeserving height.

···

On Tue, Feb 10, 2009 at 3:19 AM, William James wrote:

--
Posted via http://www.ruby-forum.com/\.

Mark_Thomas · 11 February 2009 21:04

Then you can use the XPath function starts-with() instead of contains
().

···

On Feb 11, 3:36 pm, w_a_x_...@yahoo.com wrote:

On Feb 11, 11:01 am, Mark Thomas <m...@thomaszone.com> wrote:

> As long as we're being pedagogical, I prefer XPath to all the previous
> posted solutions.

> * More accommodating to minor changes in the HTML
> * Very short (one-liner) and easy to read (IMHO)

> require 'nokogiri'
> doc = Nokogiri::HTML(html)

> puts doc.xpath('//td[contains(.,"Traffic left")]/following-
> sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)

What if the cell contains "No Traffic left"?

David_A_Black1 · 11 February 2009 22:48

Hi --

···

On Thu, 12 Feb 2009, w_a_x_man@yahoo.com wrote:

On Feb 11, 11:01 am, Mark Thomas <m...@thomaszone.com> wrote:

As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.

* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)

require 'nokogiri'
doc = Nokogiri::HTML(html)

puts doc.xpath('//

// is quite cryptic.

It's standard XPath notation.

David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Coming in 2009: The Well-Grounded Rubyist (http://manning.com/black2\)

http://www.wishsight.com => Independent, social wishlist management!

Mark_Thomas · 12 February 2009 03:23

I'd rather use Ruby.

Would you use Ruby string functions instead of the regular expression?
You could, but you probably wouldn't want to. XPath is like regular
expressions for XML and HTML. It has a particular syntax but once you
learn it, it's very powerful.

// is quite cryptic.

It's the wildcard in XPath. So '//td' just means the td can be
anywhere in the tree, as opposed to '/td' which would be at the root.
It's no more cryptic than the .* wildcard in regexps.

td[contains(.,"Traffic Left")]

The square braces constrain the td with an expression that compares
the current td node (that's what the . means) to the string "Traffic
Left". So this phrase says select the <td> tag(s) which contain the
string.

following-sibling::td//script

This says find the <script> tag under the next (in document order)
<td> tag.

XPath isn't hard to learn. And it's well worth the investment.

···

On Feb 11, 5:29 pm, w_a_x_...@yahoo.com wrote:

7stud · 10 February 2009 05:12

Mike Cargal wrote:

If you're only trying to pull out the single number, this REGEX will
work for the whole phrase you provided.

The problem is that your regex will also retrieve 9999999 in this html:

<td>NOT TRAFFIC LEFT:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));</script>
MB</b></td>

and the op is trying to tell you that he doesn't want that number.

Parsing html with regex's is a bad strategy.

···

--
Posted via http://www.ruby-forum.com/\.

Rick_DeNatale1 · 10 February 2009 21:46

Rick Denatale wrote:
>
> As 7Stud pointed out, a toolbox with only regular expressions
> inside is often a poor choice for dealing with xml/html
>
> Here's a rather verbose and commented program using a
> combination of hpricot and a regular expression to do
> something like what I think you are looking for:
>
> require 'rubygems'
> require 'hpricot'
> . . .
>
> When run this outputs: ["123313"]
>
> In other words it produces an array of strings representing
> the target numbers in a script tag within a td tag which
> follows another td tag whose inner text is "Traffic left:"

Rick, your solution is swell, and it is probably worth while considering
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer's perspective William's
solution is far more appealing,

subjective.

much shorter,

certainly, particularly with my pedagogical comments,

easier to understand and

I'd be quite willing to argue that.

requires virtually no additional learning effort.

Yes, we wouldn't want to expend any unnecessary effort on learning would we.

And by the way to get that to work (in Ruby 1.8) a nuby rubyist would have
to learn that you'd need to include 'enumerable' to get the cons method.

It nullifies or
"flattens" the comment started out by 7Stud that you also elevated to an
undeserving height.

You can treat regular expressions as a Maslovian hammer, but I've had enough
experiences with xml to realize that that hammer is often a very poor tool
for parsing html. I'd rather expend my learning budget in learning how to
apply a tool like Hpricot than to debug my own low-level attempts.

But, as they say, to each his own.

···

On Tue, Feb 10, 2009 at 1:15 PM, Igor Pirnovar <gooigpi@gmail.com> wrote:

> On Tue, Feb 10, 2009 at 3:19 AM, William James wrote:

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale

Igor_Pirnovar · 11 February 2009 04:34

Rick Denatale wrote:

> require 'rubygems'
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer's perspective William's
solution is far more appealing,

subjective.

much shorter,

certainly, particularly with my pedagogical comments,

and much nicer as well as more elegant, I should add. But more
importantly William's solution is inherently packed with its own
semantics that needs no pedagogue to explain its purpose or meaning!
True, beauty is in the eyes of the beholder, but if you think of all
those engineering accomplishments that defy ageing you will certainly
notice none of them need any pedagogic, aesthetic or any other comments.

Yes, we wouldn't want to expend any unnecessary effort on learning
would we.

No, we most certainly would not, especially when there's absolutely no
need for it! This is why Java is such a drag. There large number of
classes that appear to be relevant to the Java environment itself have
been prolifically growing, to the point that programmers are suffocated
in "alpha.beta.gamma..." notations, never mind the unnecessary clutter
they have to memorize in order to be able to assign semantic value to
each token. You may as well write tons of pedagogic comments for every
line. At the end you do not see the trees because of the forest.
Besides, since when a long learning curve is an appreciable attribute?

... work (in Ruby 1.8) a nuby rubyist would have to learn that
you'd need to include 'enumerable' to get the cons method.

What can I say, any language is a constantly evolving thing but at least
in the case of of Ruby's "enumerable" represents a shift towards better
quality which for the user means less unnecessary overhead and smaller
learning curve. I seriously doubt that now-days any astute Ruby newbie
seeks to learn Ruby 1.8 ignoring Ruby 1.9, I'd much rather say it's just
the opposite, precisely because one would try to avoid learning too much
clutter.

I've had enough experiences with xml to realize that that
hammer is often a very poor tool for parsing html. I'd rather
expend my learning budget in learning how to apply a tool like
Hpricot than to debug my own low-level attempts.

Precisely, if your life revolves around xml and html, Hpricot may be the
better way. However, for an occasional brush with a Markup Language my
old Perl book and core Ruby should do just fine.

Cheers,
igor

···

On Tue, Feb 10, 2009 at 1:15 PM, Igor Pirnovar <gooigpi@gmail.com> > wrote:

--
Posted via http://www.ruby-forum.com/\.

W_James · 11 February 2009 07:58

Rick DeNatale wrote:

And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
have to learn that you'd need to include 'enumerable' to get the cons
method.

I didn't need to, and I'm using

ruby 1.8.7 (2008-05-31 patchlevel 0) [i386-mswin32]

Rick_DeNatale1 · 11 February 2009 15:45

Yes, I guess I should have said Ruby < 1.8.7

But personally, I don't use or recommend 1.8.7, since it's really neither
fish nor fowl. The backporting of some things from 1.9 feels like it has
caused more problems than it is worth.

···

On Wed, Feb 11, 2009 at 2:58 AM, William James <w_a_x_man@yahoo.com> wrote:

Rick DeNatale wrote:

>
> And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
> have to learn that you'd need to include 'enumerable' to get the cons
> method.

I didn't need to, and I'm using

ruby 1.8.7 (2008-05-31 patchlevel 0) [i386-mswin32]

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale

Pit_Capitain · 11 February 2009 16:12

Which problems? As I've written in ruby-core, all (but one) of my
1.8.6 code works flawlessly with 1.8.7. I'm interested why you seem to
have made a different experience.

Regards,
Pit

···

2009/2/11 Rick DeNatale <rick.denatale@gmail.com>:

But personally, I don't use or recommend 1.8.7, since it's really neither
fish nor fowl. The backporting of some things from 1.9 feels like it has
caused more problems than it is worth.

Rick_DeNatale1 · 11 February 2009 21:18

I'm not alone. I'll refer you to the thread which Gregory Brown just opened
to discuss the problems caused by having 1.8.7 be incompatible with 1.8.6.

···

On Wed, Feb 11, 2009 at 11:12 AM, Pit Capitain <pit.capitain@gmail.com>wrote:

2009/2/11 Rick DeNatale <rick.denatale@gmail.com>:
> But personally, I don't use or recommend 1.8.7, since it's really neither
> fish nor fowl. The backporting of some things from 1.9 feels like it has
> caused more problems than it is worth.

Which problems? As I've written in ruby-core, all (but one) of my
1.8.6 code works flawlessly with 1.8.7. I'm interested why you seem to
have made a different experience.

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale

Topic		Replies	Views
Regexp Ruby selection ruby-talk	5	81	25 July 2008
Html parser with regex, how to solve? ruby-talk	4	144	6 January 2008
Ruby screen scraping ruby-talk	27	105	21 November 2006
Rubish Way of extracting elements ruby-talk	12	95	18 August 2004
Scan HTML ruby-talk	15	81	3 March 2008

Regexp problem

m = text.match(/Math\.ceil\(\-(\d+)/) puts m[1] if m

Related topics

m = text.match(/Math\.ceil\(\-(\d+)/)
puts m[1] if m