A regex

Alexandru_Popescu · 27 October 2006 13:23

Hi!

I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex
(I confess I couldn't figure it out so far ).

many thanks,

./alex

···

--
.w( the_mindstorm )p.

David_A_Black3 · 27 October 2006 13:27

Hi --

···

On Fri, 27 Oct 2006, Alexandru Popescu wrote:

Hi!

I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex
(I confess I couldn't figure it out so far ).

Try this:

str.split("%2F")

to get all the other stuff in an array.

David

--
David A. Black | dblack@wobblini.net
Author of "Ruby for Rails" [1] | Ruby/Rails training & consultancy [3]
DABlog (DAB's Weblog) [2] | Co-director, Ruby Central, Inc. [4]
[1] Ruby for Rails | [3] http://www.rubypowerandlight.com
[2] http://dablog.rubypal.com | [4] http://www.rubycentral.org

Thomas_Adam · 27 October 2006 13:28

``
[thomas@debian ~]% irb
irb(main):001:0> a="2006%2F10%2Fasdfasdf".split('%2F')
=> ["2006", "10", "asdfasdf"]
irb(main):002:0>
''

-- Thomas Adam

···

On Fri, 27 Oct 2006 22:23:42 +0900 "Alexandru Popescu" <the.mindstorm.mailinglist@gmail.com> wrote:

Hi!

I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex
(I confess I couldn't figure it out so far ).

many thanks,

./alex
--
.w( the_mindstorm )p.

Chris_Gernon · 27 October 2006 14:12

Alexandru Popescu wrote:

I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex

String#split would be easier than using a regex in this case:

irb(main):001:0> '2006%2F10%2Fasdfasdf'.split('%2F')
=> ["2006", "10", "asdfasdf"]

···

--
Posted via http://www.ruby-forum.com/\.

Jon_Lim · 27 October 2006 14:39

s = "2006%2F10%2Fasdfasdf"
year,month,other = URL.decode(s).split(/\//)

···

On 27/10/06, Alexandru Popescu <the.mindstorm.mailinglist@gmail.com> wrote:

Hi!

I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex
(I confess I couldn't figure it out so far ).

many thanks,

./alex
--
.w( the_mindstorm )p.

--

Patrick_Hurley1 · 27 October 2006 15:44

I am pretty sure I am missing something from your requirements...

"some character%2Fmore stuff%2Fthe rest...".match(/^(.*?)%2F(.*?)%2F(.*)$/)
puts $1
puts $2
puts $3

pth

···

On 10/27/06, Alexandru Popescu <the.mindstorm.mailinglist@gmail.com> wrote:

Hi!

I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex
(I confess I couldn't figure it out so far ).

many thanks,

./alex
--
.w( the_mindstorm )p.

Alexandru_Popescu · 27 October 2006 14:24

Thanks for all suggestions, but the requirement is to be done thru
regex only :-). I knew how to do it with split, but I need to do it
with regexps only.

./alex

···

On 10/27/06, Chris Gernon <kabigon@gmail.com> wrote:

Alexandru Popescu wrote:
> I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
> generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
> wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex

String#split would be easier than using a regex in this case:

irb(main):001:0> '2006%2F10%2Fasdfasdf'.split('%2F')
=> ["2006", "10", "asdfasdf"]

--
.w( the_mindstorm )p.

--
Posted via http://www.ruby-forum.com/\.

Chris_Gernon · 27 October 2006 14:39

Alexandru Popescu wrote:

Thanks for all suggestions, but the requirement is to be done thru
regex only :-). I knew how to do it with split, but I need to do it
with regexps only.

How about just:

irb(main):001:0> match = '2006%2F10%2Fasdfasdf'.match
/^(.*)%2F(.*)%2F(.*)$/
=> #<MatchData:0x1cf404>
irb(main):002:0> match[1]
=> "2006"
irb(main):003:0> match[2]
=> "10"
irb(main):004:0> match[3]
=> "asdfasdf"

···

--
Posted via http://www.ruby-forum.com/\.

David_A_Black3 · 27 October 2006 14:39

Hi --

···

On Fri, 27 Oct 2006, Alexandru Popescu wrote:

On 10/27/06, Chris Gernon <kabigon@gmail.com> wrote:

Alexandru Popescu wrote:
> I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
> generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
> wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex

String#split would be easier than using a regex in this case:

irb(main):001:0> '2006%2F10%2Fasdfasdf'.split('%2F')
=> ["2006", "10", "asdfasdf"]

Thanks for all suggestions, but the requirement is to be done thru
regex only :-). I knew how to do it with split, but I need to do it
with regexps only.

Regexes alone don't do anything other than specify a pattern. You
need to *use* a regular expression in some operation (like split) to
get a result.

David

--
David A. Black | dblack@wobblini.net
Author of "Ruby for Rails" [1] | Ruby/Rails training & consultancy [3]
DABlog (DAB's Weblog) [2] | Co-director, Ruby Central, Inc. [4]
[1] Ruby for Rails | [3] http://www.rubypowerandlight.com
[2] http://dablog.rubypal.com | [4] http://www.rubycentral.org

Gavin_Kistner2 · 27 October 2006 15:30

Alexandru Popescu wrote:

Thanks for all suggestions, but the requirement is to be done thru
regex only :-). I knew how to do it with split, but I need to do it
with regexps only.

I'm unclear on what your real requirements are, but here are some
possible alternatives:

s = "2006%2F10%2Fasdfasdf"
p s.scan( /.+?(?=%2F|$)/ ).map{ |v| v.gsub( '%2F', '' ) }
#=> ["2006", "10", "asdfasdf"]

p s.gsub( '%2F', "\n" ).scan( /[^\n]+/ )
#=> ["2006", "10", "asdfasdf"]

p s.match( /^(.+?)%2F(.+?)%2F(.+?)$/ ).to_a
#=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

Paul_Lutus · 27 October 2006 17:25

Alexandru Popescu wrote:

Thanks for all suggestions, but the requirement is to be done thru
regex only :-). I knew how to do it with split, but I need to do it
with regexps only.

That seems to be splitting hairs (no pun intended), since "split" uses
regular expressions to split with.

···

--
Paul Lutus
http://www.arachnoid.com

Robert_K1 · 27 October 2006 15:00

Actually the code you presented did not even use a RX. You used the string form of split, didn't you?

Kind regards

robert

···

On 27.10.2006 16:39, dblack@wobblini.net wrote:

Hi --

On Fri, 27 Oct 2006, Alexandru Popescu wrote:

On 10/27/06, Chris Gernon <kabigon@gmail.com> wrote:

Alexandru Popescu wrote:
> I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
> generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
> wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex

String#split would be easier than using a regex in this case:

irb(main):001:0> '2006%2F10%2Fasdfasdf'.split('%2F')
=> ["2006", "10", "asdfasdf"]

Thanks for all suggestions, but the requirement is to be done thru
regex only :-). I knew how to do it with split, but I need to do it
with regexps only.

Regexes alone don't do anything other than specify a pattern. You
need to *use* a regular expression in some operation (like split) to
get a result.

Alexandru_Popescu · 27 October 2006 15:01

Yes... use groupings, but what I wanted to get is not done through
string.split or something, but through string =~ /pattern/ and than
like in Perl whatever to have access to the groups through $1, $2,
etc.

./alex

···

On 10/27/06, dblack@wobblini.net <dblack@wobblini.net> wrote:

Hi --

On Fri, 27 Oct 2006, Alexandru Popescu wrote:

> On 10/27/06, Chris Gernon <kabigon@gmail.com> wrote:
>> Alexandru Popescu wrote:
>> > I have a string in the following form: 2006%2F10%2Fasdfasdf (or more
>> > generic: any_characters+%2Fany_characters+%2Frest_of_it). I am
>> > wondering if I can retrieve the groups 2006, 10, asdfasdf thru a regex
>>
>> String#split would be easier than using a regex in this case:
>>
>> irb(main):001:0> '2006%2F10%2Fasdfasdf'.split('%2F')
>> => ["2006", "10", "asdfasdf"]
>>
>
> Thanks for all suggestions, but the requirement is to be done thru
> regex only :-). I knew how to do it with split, but I need to do it
> with regexps only.

Regexes alone don't do anything other than specify a pattern. You
need to *use* a regular expression in some operation (like split) to
get a result.

--
.w( the_mindstorm )p.

David

--
David A. Black | dblack@wobblini.net
Author of "Ruby for Rails" [1] | Ruby/Rails training & consultancy [3]
DABlog (DAB's Weblog) [2] | Co-director, Ruby Central, Inc. [4]
[1] Ruby for Rails | [3] http://www.rubypowerandlight.com
[2] http://dablog.rubypal.com | [4] http://www.rubycentral.org

Robert_K1 · 27 October 2006 15:05

Alexandru Popescu wrote:

Thanks for all suggestions, but the requirement is to be done thru
regex only :-). I knew how to do it with split, but I need to do it
with regexps only.

How about just:

irb(main):001:0> match = '2006%2F10%2Fasdfasdf'.match /^(.*)%2F(.*)%2F(.*)$/
=> #<MatchData:0x1cf404>
irb(main):002:0> match[1]
=> "2006"
irb(main):003:0> match[2]
=> "10"
irb(main):004:0> match[3]
=> "asdfasdf"

This has a potential for disastrous backtracking with large strings. This one is better - if you can guarantee there there is no "%" besides the one preceding the "2F":

=> "2006%2F10%2Fasdfasdf"
>> s.match(/^([^%]*)%2F([^%]*)%2F(.*)$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

Or maybe even

>> s.match(/^((?>[^%]*))%2F((?>[^%]*))%2F((?>.*))$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

Kind regards

robert

···

On 27.10.2006 16:39, Chris Gernon wrote:

Alexandru_Popescu · 27 October 2006 15:11

Yep... this is the closest I got too :-).

./alex

···

On 10/27/06, Robert Klemme <shortcutter@googlemail.com> wrote:

On 27.10.2006 16:39, Chris Gernon wrote:
> Alexandru Popescu wrote:
>> Thanks for all suggestions, but the requirement is to be done thru
>> regex only :-). I knew how to do it with split, but I need to do it
>> with regexps only.
>
> How about just:
>
> irb(main):001:0> match = '2006%2F10%2Fasdfasdf'.match
> /^(.*)%2F(.*)%2F(.*)$/
> => #<MatchData:0x1cf404>
> irb(main):002:0> match[1]
> => "2006"
> irb(main):003:0> match[2]
> => "10"
> irb(main):004:0> match[3]
> => "asdfasdf"

This has a potential for disastrous backtracking with large strings.
This one is better - if you can guarantee there there is no "%" besides
the one preceding the "2F":

=> "2006%2F10%2Fasdfasdf"
>> s.match(/^([^%]*)%2F([^%]*)%2F(.*)$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

Or maybe even

>> s.match(/^((?>[^%]*))%2F((?>[^%]*))%2F((?>.*))$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

--
.w( the_mindstorm )p.

Kind regards

robert

Chris_Gernon · 27 October 2006 15:15

Robert Klemme wrote:

This has a potential for disastrous backtracking with large strings.
This one is better - if you can guarantee there there is no "%" besides
the one preceding the "2F":

=> "2006%2F10%2Fasdfasdf"
>> s.match(/^([^%]*)%2F([^%]*)%2F(.*)$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

Or maybe even

>> s.match(/^((?>[^%]*))%2F((?>[^%]*))%2F((?>.*))$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

I have a couple of questions about this; I'm always trying to further my
(currently basic) understanding of regular expressions.

1. Why does my first regex have a potential for disastrous backtracking?
(By disastrous I assume you mean inefficient and CPU-time-consuming,
right?)

2. What does the "?>" do in your second regex? I haven't seen that
before.

Thanks!

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 27 October 2006 16:00

Robert Klemme wrote:

This has a potential for disastrous backtracking with large strings.
This one is better - if you can guarantee there there is no "%" besides
the one preceding the "2F":

=> "2006%2F10%2Fasdfasdf"
>> s.match(/^([^%]*)%2F([^%]*)%2F(.*)$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

Or maybe even

>> s.match(/^((?>[^%]*))%2F((?>[^%]*))%2F((?>.*))$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]

I have a couple of questions about this; I'm always trying to further my (currently basic) understanding of regular expressions.

If you are really interested in the matter I can recommend "Mastering Regular Expressions". Even I got valuable insights from it although I would have regarded me "senior" with regard to RX.

1. Why does my first regex have a potential for disastrous backtracking? (By disastrous I assume you mean inefficient and CPU-time-consuming, right?)

Correct. The first ".*" will match greedily as far as it can which means: to the end of the sequence. Then the RX engine (it is a NFA in the case of Ruby) detects that it cannot get an overall match with that because there is no "%2F" following. So it starts backing up by stepping back one character and trying the "%2F" again etc. This will go until the first group matches "2006%2F10". Ah, now we can match the first "%2F" in the pattern. Then comes the next greedy ".*" and the game starts over again with that. Match to the end, then try to back up. Eventually the engine will find out that with the first group eating up the first "%2F" as well there is no overall match since in the remaining portion there is no more "%2F". Then backing up the first group starts again until the first group's match is reduced to "2006".

2. What does the "?>" do in your second regex? I haven't seen that before.

That's an atomic sub RX. Basically it will not give back any characters that it has consumed. Using that in this example with ".*" will make the overall match fail:

>> s.match(/^((?>.*))%2F((?>.*))%2F((?>.*))$/).to_a
=>

Actually I believe atomic grouping is not needed in this case as the [^%] cannot match past a "%" and so there is probably no potential for backtracking. Benchmarking probably shows the whole picture. It is definitively harmful with ".*" because then the backtracking (see above) cannot start and there will be no overall match.

You can easily see the backtracking with a tool like "Regex Coach" with which you can step graphically through the match.

Kind regards

robert

···

On 27.10.2006 17:15, Chris Gernon wrote:

Forum · 28 October 2006 21:40

/(.*?)%2F(.*?)%2F(.*)/
will be save.
It was just the greediness which might be dangerous.

Cheers
Robert

···

On 10/27/06, Robert Klemme <shortcutter@googlemail.com> wrote:

On 27.10.2006 17:15, Chris Gernon wrote:
> Robert Klemme wrote:
>> This has a potential for disastrous backtracking with large strings.
>> This one is better - if you can guarantee there there is no "%" besides
>> the one preceding the "2F":
>>
>> => "2006%2F10%2Fasdfasdf"
>> >> s.match(/^([^%]*)%2F([^%]*)%2F(.*)$/).to_a
>> => ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]
>>
>> Or maybe even
>>
>> >> s.match(/^((?>[^%]*))%2F((?>[^%]*))%2F((?>.*))$/).to_a
>> => ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]
>
> I have a couple of questions about this; I'm always trying to further my
> (currently basic) understanding of regular expressions.

If you are really interested in the matter I can recommend "Mastering
Regular Expressions". Even I got valuable insights from it although I
would have regarded me "senior" with regard to RX.

> 1. Why does my first regex have a potential for disastrous backtracking?
> (By disastrous I assume you mean inefficient and CPU-time-consuming,
> right?)

Correct. The first ".*" will match greedily as far as it can which
means: to the end of the sequence. Then the RX engine (it is a NFA in
the case of Ruby) detects that it cannot get an overall match with that
because there is no "%2F" following. So it starts backing up by
stepping back one character and trying the "%2F" again etc. This will
go until the first group matches "2006%2F10". Ah, now we can match the
first "%2F" in the pattern. Then comes the next greedy ".*" and the
game starts over again with that. Match to the end, then try to back
up. Eventually the engine will find out that with the first group
eating up the first "%2F" as well there is no overall match since in the
remaining portion there is no more "%2F". Then backing up the first
group starts again until the first group's match is reduced to "2006".

> 2. What does the "?>" do in your second regex? I haven't seen that
> before.

That's an atomic sub RX. Basically it will not give back any characters
that it has consumed. Using that in this example with ".*" will make
the overall match fail:

>> s.match(/^((?>.*))%2F((?>.*))%2F((?>.*))$/).to_a
=>

Actually I believe atomic grouping is not needed in this case as the
[^%] cannot match past a "%" and so there is probably no potential for
backtracking. Benchmarking probably shows the whole picture. It is
definitively harmful with ".*" because then the backtracking (see above)
cannot start and there will be no overall match.

You can easily see the backtracking with a tool like "Regex Coach" with
which you can step graphically through the match.

Kind regards

robert

I guess a simple

--
The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all progress
depends on the unreasonable man.

- George Bernard Shaw

Topic		Replies	Views
A regex ruby-talk	0	72	27 October 2006
A regex ruby-talk	0	69	27 October 2006
Problem with trivial regular expression ruby-talk	9	137	23 December 2009
Novice Q: What's the difference between /\s/ and /(\s)/? ruby-talk	12	183	29 August 2005
Ruby regexpresion ruby-talk	6	124	17 September 2010

A regex

Related topics