RegExp problem

Jf_Rejza · 19 November 2008 09:55

Hy,

I would like to know how to extract a string delimited by too identical
characters from an another string(of any length).

ex string="rzerze@foo@rezrzgrtez" how to get (or match) the foo string

thanks for your help.

···

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 19 November 2008 10:17

irb(main):016:0> string="rzerze@foo@rezrzgrtez"
=> "rzerze@foo@rezrzgrtez"
irb(main):017:0> (string.match /@(.*?)@/)[1]
=> "foo"

I don't quite understand the rest of your requirement. You mean delimited
by a string, i.e. the @ above is a variable, or by two of any character present
in a string? If it's the former:

irb(main):018:0> delimiter = "@"
=> "@"
irb(main):019:0> (string.match /#{delimiter}(.*?)#{delimiter}/)[1]
=> "foo"

if it's the latter, something like this might help:

irb(main):024:0> re = Regexp.new("([#{delimiter}])(.*)?\\1")
=> /([@abcde])(.*)?\1/
irb(main):025:0> string.match(re)[2]
=> "rze@foo@rezrzgrt"

It found the first 'e' as the delimiter, don't know why it took the
last 'e' as the other part of the delimiter, since I used a non-greedy
group for the middle part. Any ideas, someone?

Hope this helps,

Jesus.

···

On Wed, Nov 19, 2008 at 10:55 AM, Jf Rejza <jfferriere@gmail.com> wrote:

Hy,

I would like to know how to extract a string delimited by too identical
characters from an another string(of any length).

ex string="rzerze@foo@rezrzgrtez" how to get (or match) the foo string

Ken_Bloom · 19 November 2008 14:36

string.split(/@/)[1]

--Ken

···

On Wed, 19 Nov 2008 04:55:58 -0500, Jf Rejza wrote:

Hy,

I would like to know how to extract a string delimited by too identical
characters from an another string(of any length).

ex string="rzerze@foo@rezrzgrtez" how to get (or match) the foo string

thanks for your help.

--
Chanoch (Ken) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Florian_Gilcher · 19 November 2008 10:58

Actually the correct regexp is:

delimiter = Regexp.escape(delimiter)
/#{delimiter}[^#{delimiter}]*#{delimiter}/

Read: The delimiter - an unspecified number of non-delimiter-characters - the delimiter.

Otherwise, you too heavily rely on the behaviour of the library, when it coms to the dot.

Regards,
Florian Gilcher

···

On Nov 19, 2008, at 11:17 AM, Jesús Gabriel y Galán wrote:

On Wed, Nov 19, 2008 at 10:55 AM, Jf Rejza <jfferriere@gmail.com> > wrote:

Hy,

I would like to know how to extract a string delimited by too identical
characters from an another string(of any length).

ex string="rzerze@foo@rezrzgrtez" how to get (or match) the foo string

irb(main):016:0> string="rzerze@foo@rezrzgrtez"
=> "rzerze@foo@rezrzgrtez"
irb(main):017:0> (string.match /@(.*?)@/)[1]
=> "foo"

I don't quite understand the rest of your requirement. You mean delimited
by a string, i.e. the @ above is a variable, or by two of any character present
in a string? If it's the former:

irb(main):018:0> delimiter = "@"
=> "@"
irb(main):019:0> (string.match /#{delimiter}(.*?)#{delimiter}/)[1]
=> "foo"

if it's the latter, something like this might help:

irb(main):024:0> re = Regexp.new("([#{delimiter}])(.*)?\\1")
=> /([@abcde])(.*)?\1/
irb(main):025:0> string.match(re)[2]
=> "rze@foo@rezrzgrt"

It found the first 'e' as the delimiter, don't know why it took the
last 'e' as the other part of the delimiter, since I used a non-greedy
group for the middle part. Any ideas, someone?

Hope this helps,

Jesus.

Jesus_Gabriel_y_Gala · 19 November 2008 11:28

Hi Florian,

I don't get what you mean by "too heavily rely", I mean, you are
always relying on the behaviour of the regexp library when it comes to
everything, so if you know how the dot, the * and the ? work in Ruby
why is your regexp better than the proposed one?

Anyway I have realized I had a mistake in my regexp (I had the ?
outside of the group, and not next to the *). So fixing that:

irb(main):034:0> string="rzerze@foo@rezrzgrtez"
=> "rzerze@foo@rezrzgrtez"
irb(main):035:0> re = Regexp.new("([#{delimiter}])(.*?)\\1")
=> /([abcde])(.*?)\1/
irb(main):036:0> string.match(re)[2]
=> "rz"

Now works as I intended. With your version:

irb(main):037:0> re = /([#{delimiter}])([^#{delimiter}]*)\1/
=> /([abcde])([^abcde]*)\1/
irb(main):038:0> string.match(re)[2]
=> "rz"

works too. So I would like to know why would be one preferred over the other.

Jesus.

···

On Wed, Nov 19, 2008 at 11:58 AM, Florian Gilcher <flo@andersground.net> wrote:

Actually the correct regexp is:

delimiter = Regexp.escape(delimiter)
/#{delimiter}[^#{delimiter}]*#{delimiter}/

Read: The delimiter - an unspecified number of non-delimiter-characters -
the delimiter.

Otherwise, you too heavily rely on the behaviour of the library, when it
coms to the dot.

Florian_Gilcher · 19 November 2008 12:01

Hi Florian,

I don't get what you mean by "too heavily rely", I mean, you are
always relying on the behaviour of the regexp library when it comes to
everything, so if you know how the dot, the * and the ? work in Ruby
why is your regexp better than the proposed one?

Anyway I have realized I had a mistake in my regexp (I had the ?
outside of the group, and not next to the *). So fixing that:

irb(main):034:0> string="rzerze@foo@rezrzgrtez"
=> "rzerze@foo@rezrzgrtez"
irb(main):035:0> re = Regexp.new("([#{delimiter}])(.*?)\\1")
=> /([abcde])(.*?)\1/
irb(main):036:0> string.match(re)[2]
=> "rz"

Now works as I intended. With your version:

irb(main):037:0> re = /([#{delimiter}])([^#{delimiter}]*)\1/
=> /([abcde])([^abcde]*)\1/
irb(main):038:0> string.match(re)[2]
=> "rz"

works too. So I would like to know why would be one preferred over the other.

Jesus.

Hi Jesus,

Because you could get problems when you have things like this:

"foo@fooo@bar@batz@foo"

you rely on the the behaviour of the dot. Meaning: yours matches

fooo@bar@batz

Thus, it ignores all inner delimiters.
In any case, mine matches:

[["fooo"],["batz"]] #String#scan output

Depending on how the dot is treated (greediness etc.), yours
could also have other interpretations (like mine) while mine is clearer in
that respect.

In a simple case with only 2 delimiters, it doesn't matter.

Actually, mine wasn't correct either, i also forgot the brackets...

/#{delimiter}([^#{delimiter}])*#{delimiter}/

For more readibility, here is the version with delimiter = @:

/@([^@]*)@/

Regards,
Florian Gilcher

Jesus_Gabriel_y_Gala · 19 November 2008 13:18

Because you could get problems when you have things like this:

"foo@fooo@bar@batz@foo"

you rely on the the behaviour of the dot. Meaning: yours matches

fooo@bar@batz

Thus, it ignores all inner delimiters.

Well, with the correction, it doesn't anymore:

irb(main):071:0> "foo@fooo@bar@batz@foo".match(/([#{delimiter}])(.*?)\1/)[2]
=> "fooo"

In any case, mine matches:

[["fooo"],["batz"]] #String#scan output

irb(main):072:0> "foo@fooo@bar@batz@foo".scan(/([#{delimiter}])(.*?)\1/)
=> [["@", "fooo"], ["@", "batz"]]

Depending on how the dot is treated (greediness etc.), yours
could also have other interpretations (like mine) while mine is clearer in
that respect.

Well, obviously this only works as expected if you use the non-greedy version.
But I understand what you mean about being clearer, more explicit.

In a simple case with only 2 delimiters, it doesn't matter.

With more delimiters present in the string, mine works too.

Thanks,

Jesus.

···

On Wed, Nov 19, 2008 at 1:01 PM, Florian Gilcher <flo@andersground.net> wrote:

Einar_Boson · 19 November 2008 15:19

Now that we are on the topic of Regular Expressions I have a question about the ruby implementation. Like I posted earlier I needed to parse something that looks like this:

- activity
+ name
+ picture

- area
+ name
+ picture

- activities_in_area
+ activity_id
+ area_id

etc...

Last time I did complicated regexps seems to have been C# or possibly java. So I tried to match the whole thing with
/(\- (\w*)\s*?\n([\t ]+\+ (\w+)\s*(\:\s*(\w*))?\s*?\n)+\s*)+/
and then I was gonna extract the captured data but it isn't available. All nested groups have only captured their latest match. Is there no regexp lib for ruby that can handle nested groups and save the captures? I solved it with nested scans instead and I have to admin that it is more readable, so I'm not sure what, exactly, I want with this message, except ask about the design choices involved. Why don't we want proper captures?

    table_name = /\- (\w*)\s*?\n/
    field_name = /(\s+\+ (\w+)\s*(\:\s*(\w*))?\n)/
    doc.scan /#{table_name}(#{field_name}+)/ do |tablename, fields|
      fields.scan field_name do |junk, fieldname, junk2, type|
        # here I can do what I want
      end
    end

And on the topic of pattern matching, can you recommend any good library for parser generation in ruby? I want to write a grammar and get an AST.

einarmagnus

Topic		Replies	Views
To extract a particular string ruby-talk	19	201	21 September 2013
Matching ruby-talk	8	134	14 November 2008
Ruby regexpresion ruby-talk	6	148	17 September 2010
Problem with trivial regular expression ruby-talk	9	143	23 December 2009
String extraction using RegExp ruby-talk	2	96	9 June 2008

RegExp problem

Related topics