Regex question: this should be easy but doesn't work as I expect

Hi all,

-----Code------

re = [
  /(one).+?(three).+?(five)/,
  /(one).+?(three)?.+?(five)/,
  /(one).+?(three|).+?(five)/,
  /(one).+(three|).+?(five)/
]

re.each_with_index do |r, idx|
  puts idx
  p "one two three four five".scan(r)
  p "one two four five".scan(r)
end

-----Result---------

0
[["one", "three", "five"]]
[]
1
[["one", nil, "five"]]
[["one", nil, "five"]]
2
[["one", "", "five"]]
[["one", "", "five"]]
3
[["one", "", "five"]]
[["one", "", "five"]]

···

-----------------

All regexes failed my expectation.

What I want is

"one two three four five" #=> [["one", "three", "five']]
"one two four five" #=> [["one", nil, "five']]

In short, in the string, "three" might or might not exist.
What regex can match for both?

Thanks.

Sam

Sam Kong wrote:

Hi all,

-----Code------

re = [
  /(one).+?(three).+?(five)/,
  /(one).+?(three)?.+?(five)/,
  /(one).+?(three|).+?(five)/,
  /(one).+(three|).+?(five)/
]

re.each_with_index do |r, idx|
  puts idx
  p "one two three four five".scan(r)
  p "one two four five".scan(r)
end

-----Result---------

0
[["one", "three", "five"]]

1
[["one", nil, "five"]]
[["one", nil, "five"]]
2
[["one", "", "five"]]
[["one", "", "five"]]
3
[["one", "", "five"]]
[["one", "", "five"]]

-----------------

All regexes failed my expectation.

What I want is

"one two three four five" #=> [["one", "three", "five']]
"one two four five" #=> [["one", nil, "five']]

In short, in the string, "three" might or might not exist.
What regex can match for both?

Thanks.

Sam

  /(one) two (?:(three) )?four (five)/

Sam Kong wrote:

What I want is

"one two three four five" #=> [["one", "three", "five']]
"one two four five" #=> [["one", nil, "five']]

In short, in the string, "three" might or might not exist.
What regex can match for both?

/one|three|five/ can. Although, its result is not exactly in the form
you want:

"one two three four five".scan /one|three|five/

=> ["one", "three", "five"]

"one two four five".scan /one|three|five/

=> ["one", "five"]

···

--
Posted via http://www.ruby-forum.com/\.

[Sam Kong <sam.s.kong@gmail.com>, 2006-12-20 20.10 CET]

What I want is

"one two three four five" #=> [["one", "three", "five']]
"one two four five" #=> [["one", nil, "five']]

In short, in the string, "three" might or might not exist.
What regex can match for both?

Hi. The problem is that you can very easily NOT match "three" even if it's
there. I mean, if you have
   /1.*3?.*5/ =~ '12345'

the engine can succeed matching the 1 at the beginning, the 5 at the end,
and trying to match the 3 where the 4 is... and failing, but since it's
optional, the overall match succeeds.

I think you should try it in two steps: first, try to match with the 3; if
that fails, without the "3". Something like:

/(?:(1).*(3)|(1)).*(5)/

(The '1' will come either on the first or third array position, you'll have
to take care of that.)

Maybe there is a simpler solution, but it doesn't come to my mind.

Good luck.

···

--

[snip]

"one two three four five" #=> [["one", "three", "five']]
"one two four five" #=> [["one", nil, "five']]

[snip]

how about?

irb(main):001:0> r = /(one) (?: (.*?three) | ((?:.(?!>three))*) ) *? (five)/x
=> /(one) (?: (.*?three) | ((?:.(?!>three))*) ) *? (five)/x
irb(main):002:0> "one two three four five".scan(r)
=> [["one", " two three", " four ", "five"]]
irb(main):003:0> "one two four five".scan(r)
=> [["one", nil, " two four ", "five"]]

···

On 12/20/06, Sam Kong <sam.s.kong@gmail.com> wrote:

--
Simon Strandgaard
http://opcoders.com/

Hi William,

William James wrote:

  /(one) two (?:(three) )?four (five)/

I simplified the actual problem.
I guess the simplification did not interpret my problem well.

I was parsing html source into price, image, description, etc.
The image is sometimes missing.

In the example, let's assume that "two" and "four" are arbiturary text.
So the text might be "...one...three...five" where "..." means some
arbiturary text.
If "three" is missing, it will be "...one.....five...".

Can you reconsider the problem please?

Sam

Hi Carlos,

Carlos wrote:

Hi. The problem is that you can very easily NOT match "three" even if it's
there. I mean, if you have
   /1.*3?.*5/ =~ '12345'

the engine can succeed matching the 1 at the beginning, the 5 at the end,
and trying to match the 3 where the 4 is... and failing, but since it's
optional, the overall match succeeds.

I think you should try it in two steps: first, try to match with the 3; if
that fails, without the "3". Something like:

/(?:(1).*(3)|(1)).*(5)/

(The '1' will come either on the first or third array position, you'll have
to take care of that.)

Yes, you understand exactly what my problem is.
Actually I guessed it as you said even if I couldn't explain it as well
as you did.
The solution I found was using 2 regexes.
First, I try to find a match assuming "three" is there.
If it fails, I try to find a match without "three".
This solved my problem.
But I wanted to know that if there's a one-shot solution.

This is the actual problem, just in case someone wants to know.

html = <<END
<tr id="row2_210819526">
    <td class="year">
        <h5>2004 Used</h5></td>
    <td class="carlink"><h5>

            <a name="210819526" href="210819526.html">BMW 325Ci
            Coupe</a><br />

        </h5></td>

    <td class="mileage">

        <span class="body20">38,604<br /></span><span
class="body30">Mileage</span></td>
    <td class="price">
        <span class="body20">

                    $24,995

            <br />
        </span>

            <span class="body30">Price</span>
        </td>

    <td class="distanceFromZip">
        <div class="zip">
            <span class="body20">0 mi<br /></span><span
class="body30">from ZIP</span>
        </div></td>

        <td class="productTileCell" rowspan="2" valign="top">

            <div class="srlProductContainer">

            </div>

        </td>

</tr>
<tr id="row3_210819526">
    <td class="left">

                <a href=210819526.html><img
src="http://images.autotrader.com/images/2006/10/16/210/819/1092478286.210819526.IM1.MAIN.60x45_A.60x45.jpg&quot;
border="0" bordercolor="#000000" width="60" height="45"></a>&nbsp;

                <div class="body40" style="padding-bottom:3px">
                        <img
src="Autotrader - page unavailable;
alt="Actual Photo Available" width="17" height="17" border="0"
/>&nbsp;9 Photos
                    <br />
                </div>

        <img src="Autotrader - page unavailable;
width="60" height="1" /></td>
    <td class="center" colspan="2">
        <div class="centerinfo">

                    <p class="color body20">Color - Mystic Blue
Metallic</p>

                <p class="description">Dark Blue/Beige, Premium Pkg,
Xenon Light, Single Compact Disc, Dual Power Seats, Memory Seat, Still
under Free BMW Maintenance and 4yr/50k Factory...</p>

                    <p class="vin">VIN WBABV13454JT20104</p>

                <div class="body40" style="padding-top:5px;"><a
name="210819526" href="210819526.html">View Car Details</a><br /></div>

        </div></td>
    <td>&nbsp;</td>
    <td valign="top" class="right body30">

        <p class="dealername">

                <a name="210819526" href="210819526.html">null</a>
                <br />

        </p>

        <br />

    </td>
</tr>
END

def parse_row row
  m = row.scan(/.+?<h5>(\d{4}) Used<\/h5>.+?<h5>.+?<a name=\"\d+\"
href=\"(\d+.html)\">(.+?)<\/a><br \/>.+?<\/h5>.+?<span
class=\"body20\">([0-9,]+)<br \/><\/span><span
class=\"body30\">Mileage<\/span>.+?(\$[0-9,]+).+?(http:\/\/[^\"]+?\.jpg).+?Color
- (.+?)<\/p>/m)
  if m[0].nil?
    m = row.scan(/.+?<h5>(\d{4}) Used<\/h5>.+?<h5>.+?<a name=\"\d+\"
href=\"(\d+.html)\">(.+?)<\/a><br \/>.+?<\/h5>.+?<span
class=\"body20\">([0-9,]+)<br \/><\/span><span
class=\"body30\">Mileage<\/span>.+?(\$[0-9,]+).+?(http:\/\/[^\"]+?\.jpg)?.+?Color
- (.+?)<\/p>/m)
  end
  m[0]
end

p parse_row(html)

Sorry about the messy code.

Thanks.

Sam

Sam Kong wrote:

Hi William,

William James wrote:

> /(one) two (?:(three) )?four (five)/

I simplified the actual problem.
I guess the simplification did not interpret my problem well.

I was parsing html source into price, image, description, etc.
The image is sometimes missing.

In the example, let's assume that "two" and "four" are arbiturary text.
So the text might be "...one...three...five" where "..." means some
arbiturary text.
If "three" is missing, it will be "...one.....five...".

Can you reconsider the problem please?

Sam

This is prolix, but it works:

a,b,c = 'one', 'three', 'five'

[
  "one two three four five",
  "one two four five"
].each{|s|
  if s =~ /#{a} (.+ )?#{c}/
    if s =~ / #{b} /
      p [a,b,c]
    else
      p [a,nil,c]
    end
  end
}