Regular expression

Harry3 · 29 May 2009 08:29

I want to write a regular expression to do the following.

The first character can be any character.
The second character can be anything except the first character.
The third character can be anything except the first two characters.
(This is a problem. There may be other problems.)
The fourth character is the same as the second character.

st = "abcb"
p st =~ /^(.)([^\1])([^\1\2])(\2)$/ # I expected a match

uv = "abbb"
p uv =~ /^(.)([^\1])([^\1\2])(\2)$/ # I did not expect a match

They both match.
How can I write the regular expression?
Somehow I have the syntax wrong.

Harry

···

--
A Look into Japanese Ruby List in English
http://www.kakueki.com/ruby/list.html

Brian_Candler · 29 May 2009 08:48

Harry Kakueki wrote:

Somehow I have the syntax wrong.

Yes - I don't think a backreference like \1, which could contain any
number of characters, is usable inside a character class [...], which is
a list of individual characters.

irb(main):005:0> /[^\1]/ =~ "a"
=> 0
irb(main):006:0> /[^\1]/ =~ "1"
=> 0
irb(main):007:0> /[^\1]/ =~ "\001"
=> nil

So it seems that [^\1\2] means any character apart from \001 (ctrl-A) or
\002 (ctrl-B)

I think you need a negative lookahead assertion.
* Programming Ruby: The Pragmatic Programmer's Guide
* click on "The Ruby Language"
* scroll to "Extensions"
* look for (?!re)

irb(main):009:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abcb"
=> 0
irb(main):010:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abbb"
=> nil

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 29 May 2009 09:49

I second Brian: to solve this with a regular expression is either
extremely costly (large expression) or even impossible. If I
understand you properly then you want to avoid repetitions. How
about:

require 'set'
def unique_chars(s)
chars = s.scan /./
chars.to_set.size == chars.size
end

Kind regards

robert

···

2009/5/29 Harry Kakueki <list.push@gmail.com>:

I want to write a regular expression to do the following.

The first character can be any character.
The second character can be anything except the first character.
The third character can be anything except the first two characters.
(This is a problem. There may be other problems.)
The fourth character is the same as the second character.

st = "abcb"
p st =~ /^(.)([^\1])([^\1\2])(\2)$/ # I expected a match

uv = "abbb"
p uv =~ /^(.)([^\1])([^\1\2])(\2)$/ # I did not expect a match

They both match.
How can I write the regular expression?
Somehow I have the syntax wrong.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Mark_Kremer · 1 June 2009 07:52

I'm not sure if you still want the regex, but following regex should do the trick: ^(.{1})(?!\1)(.{1})(?!\1|\2).{1}\2$

Harry Kakueki wrote:

···

I want to write a regular expression to do the following.

The first character can be any character.
The second character can be anything except the first character.
The third character can be anything except the first two characters.
(This is a problem. There may be other problems.)
The fourth character is the same as the second character.

st = "abcb"
p st =~ /^(.)([^\1])([^\1\2])(\2)$/ # I expected a match

uv = "abbb"
p uv =~ /^(.)([^\1])([^\1\2])(\2)$/ # I did not expect a match

They both match.
How can I write the regular expression?
Somehow I have the syntax wrong.

Harry

Harry3 · 29 May 2009 10:50

Thanks, Robert.

Sometimes I want to avoid repetition and sometimes I want to have repetition.
My goal here is not just to solve such a problem.
My goal is to learn how to write this type of regular expression using
\1, \2, etc.
I did not understand the syntax.

Thanks,

Harry

···

On Fri, May 29, 2009 at 6:49 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

2009/5/29 Harry Kakueki <list.push@gmail.com>:

I want to write a regular expression to do the following.

The first character can be any character.
The second character can be anything except the first character.
The third character can be anything except the first two characters.
(This is a problem. There may be other problems.)
The fourth character is the same as the second character.

st = "abcb"
p st =~ /^(.)([^\1])([^\1\2])(\2)$/ # I expected a match

uv = "abbb"
p uv =~ /^(.)([^\1])([^\1\2])(\2)$/ # I did not expect a match

They both match.
How can I write the regular expression?
Somehow I have the syntax wrong.

I second Brian: to solve this with a regular expression is either
extremely costly (large expression) or even impossible. If I
understand you properly then you want to avoid repetitions. How
about:

require 'set'
def unique_chars(s)
chars = s.scan /./
chars.to_set.size == chars.size
end

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

--
A Look into Japanese Ruby List in English

Harry3 · 29 May 2009 11:12

Brian,

I need to read that section carefully to be sure I understand.
But, I think that does the trick.

Thank you!

Harry

···

On Fri, May 29, 2009 at 5:48 PM, Brian Candler <b.candler@pobox.com> wrote:

Harry Kakueki wrote:

Somehow I have the syntax wrong.

Yes - I don't think a backreference like \1, which could contain any
number of characters, is usable inside a character class [...], which is
a list of individual characters.

irb(main):005:0> /[^\1]/ =~ "a"
=> 0
irb(main):006:0> /[^\1]/ =~ "1"
=> 0
irb(main):007:0> /[^\1]/ =~ "\001"
=> nil

So it seems that [^\1\2] means any character apart from \001 (ctrl-A) or
\002 (ctrl-B)

I think you need a negative lookahead assertion.
* Programming Ruby: The Pragmatic Programmer's Guide
* click on "The Ruby Language"
* scroll to "Extensions"
* look for (?!re)

irb(main):009:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abcb"
=> 0
irb(main):010:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abbb"
=> nil
--
Posted via http://www.ruby-forum.com/\.

--
A Look into Japanese Ruby List in English

Brian_Candler · 1 June 2009 08:44

Mark Kremer wrote:

I'm not sure if you still want the regex, but following regex should do
the trick: ^(.{1})(?!\1)(.{1})(?!\1|\2).{1}\2$

Which is the same as the one I posted in my initial reply:

irb(main):009:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abcb"
=> 0
irb(main):010:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abbb"
=> nil

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 29 May 2009 11:54

The problem with this approach is that it does not work for strings of
arbitrary length. Even if you adjust it to work for multiple lengths
you always have a fixed upper limit for which it can work.

Kind regards

robert

···

2009/5/29 Harry Kakueki <list.push@gmail.com>:

On Fri, May 29, 2009 at 5:48 PM, Brian Candler <b.candler@pobox.com> wrote:

I think you need a negative lookahead assertion.
* Programming Ruby: The Pragmatic Programmer's Guide
* click on "The Ruby Language"
* scroll to "Extensions"
* look for (?!re)

irb(main):009:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abcb"
=> 0
irb(main):010:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abbb"
=> nil

I need to read that section carefully to be sure I understand.
But, I think that does the trick.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Mark_Kremer · 1 June 2009 10:09

Damn, you're right... sorry

Brian Candler wrote:

···

Mark Kremer wrote:

I'm not sure if you still want the regex, but following regex should do
the trick: ^(.{1})(?!\1)(.{1})(?!\1|\2).{1}\2$

Which is the same as the one I posted in my initial reply:

irb(main):009:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abcb"
=> 0
irb(main):010:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abbb"
=> nil

Brian_Candler · 29 May 2009 12:20

Robert Klemme wrote:

irb(main):009:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abcb"
=> 0
irb(main):010:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abbb"
=> nil

The problem with this approach is that it does not work for strings of
arbitrary length. Even if you adjust it to work for multiple lengths
you always have a fixed upper limit for which it can work.

The OP explicitly said that he wanted to match single characters. It
would also make sense for other fixed-width fields, or delimited fields.

With neither fixed sizes nor delimiters, I don't think it makes any
sense. It would become "match any sequence of characters, followed by
any sequence of characters which isn't the same as the first sequence of
characters, followed by any sequence of characters which isn't the first
or second, followed by the second sequence of characters". And there
would be a squillion different ways to try and slice the string to make
it match.

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 29 May 2009 12:32

Robert Klemme wrote:

irb(main):009:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abcb"
=> 0
irb(main):010:0> /^(.)(?!\1)(.)(?!\1|\2)(.)(\2)$/ =~ "abbb"
=> nil

The problem with this approach is that it does not work for strings of
arbitrary length. Even if you adjust it to work for multiple lengths
you always have a fixed upper limit for which it can work.

The OP explicitly said that he wanted to match single characters. It
would also make sense for other fixed-width fields, or delimited fields.

But he did not state explicitly that the _strings_ that he wanted to
analyze have fixed length. This is where your approach breaks. If
his strings are always of maximum length four or he only needs to make
sure the four first characters do not repeat then it will work.
"Four" could easily be any number up to "nine" but even then naming of
groups might cause trouble; in any case the expression will soon look
ugly, especially if strings can be "up to N" characters long.

With neither fixed sizes nor delimiters, I don't think it makes any
sense. It would become "match any sequence of characters, followed by
any sequence of characters which isn't the same as the first sequence of
characters, followed by any sequence of characters which isn't the first
or second, followed by the second sequence of characters". And there
would be a squillion different ways to try and slice the string to make
it match.

That's beyond the power of regular expressions.

Kind regards

robert

···

2009/5/29 Brian Candler <b.candler@pobox.com>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Brian_Candler · 29 May 2009 14:19

Robert Klemme wrote:

The OP explicitly said that he wanted to match single characters. It
would also make sense for other fixed-width fields, or delimited fields.

But he did not state explicitly that the _strings_ that he wanted to
analyze have fixed length. This is where your approach breaks.

But he's not just looking for strings with non-repeating characters.
Note the rule that says the fourth character must be equal to the second
character. This is an application-specific rule; I don't think we can
logically extend that to guess what the rule for the fifth or subsequent
characters would be.

With neither fixed sizes nor delimiters, I don't think it makes any
sense. It would become "match any sequence of characters, followed by
any sequence of characters which isn't the same as the first sequence of
characters, followed by any sequence of characters which isn't the first
or second, followed by the second sequence of characters". And there
would be a squillion different ways to try and slice the string to make
it match.

That's beyond the power of regular expressions.

It's perfectly possible to express it as a regular expression, but you
have to beware that there could be many ways for it to match, so you
might not get what you expect.

/^(.+)(?!\1)(.+)(?!\1|\2)(.+)(\2)$/.match("foowibbleboingwibble").to_a

=> ["foowibbleboingwibble", "foowibbl", "e", "boingwibbl", "e"]

/^(.+?)(?!\1)(.+?)(?!\1|\2)(.+?)(\2)$/.match("foowibbleboingwibble").to_a

=> ["foowibbleboingwibble", "foo", "wibble", "boing", "wibble"]

···

2009/5/29 Brian Candler <b.candler@pobox.com>:

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 29 May 2009 15:41

Well, yes. I just did it but I see that the conclusion is not
necessarily a valid one. So let's hope for a better requirements spec
next time.

Cheers

robert

···

2009/5/29 Brian Candler <b.candler@pobox.com>:

Robert Klemme wrote:

2009/5/29 Brian Candler <b.candler@pobox.com>:

The OP explicitly said that he wanted to match single characters. It
would also make sense for other fixed-width fields, or delimited fields.

But he did not state explicitly that the _strings_ that he wanted to
analyze have fixed length. This is where your approach breaks.

But he's not just looking for strings with non-repeating characters.
Note the rule that says the fourth character must be equal to the second
character. This is an application-specific rule; I don't think we can
logically extend that to guess what the rule for the fifth or subsequent
characters would be.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
RegExp Problem ruby-talk	17	94	25 May 2006
Regular expressions galore ruby-talk	11	77	26 October 2005
(Maybe) a simple question about regex ruby-talk	8	128	25 March 2005
Regex query ruby-talk	2	113	10 January 2007
Do You Understand Regular Expressions? ruby-talk	19	112	22 June 2007

Regular expression

Related topics