Regex - Exclude Multiple Characters and Global Scanning

Ben_Woodcroft · 21 June 2008 04:49

Hihi,

I have 2 problems.

--------------Question 1-----------------------
Firstly, a Ruby question. I'm confused about how to match a single
regular expression multiple times in a single string. For instance,

'llgllallo'.match(/(ll.)/)[0] #-> 'llg'
'llgllallo'.match(/(ll.)/)[1] #-> 'llg'
'llgllallo'.match(/(ll.)/)[1] #-> nil

How do I access all 3 matches? String#scan will work, but that gives me

'llgllallo'.scan(/(ll.)/) #=> [["llg"], ["lla"], ["llo"]]

But I need the offsets, and this info isn't given to me.

--------------Question 2-----------------------
Now an old gap in my regex understanding. How do I exclude on
consecutive characters? I want something like [^abc], except aba or bbc
is ok, just not 'abc'. Summing this up:

reg = /something/
'abc'.match(reg) #-> no match
'cba'.match(reg) #-> match

And then I want to be able to do OR operations too, like not 'abc' and
not 'bbc', but that is probably another step of complexity.

I don't suppose there is any way to pass a block to the regex to use in
a specific place? That would be cool, though maybe not possible given
optimisations in regex?

Thanks in advance,
ben

···

--
Posted via http://www.ruby-forum.com/.

David_A_Black1 · 21 June 2008 08:00

Hi --

Hihi,

I have 2 problems.

--------------Question 1-----------------------
Firstly, a Ruby question. I'm confused about how to match a single
regular expression multiple times in a single string. For instance,

'llgllallo'.match(/(ll.)/)[0] #-> 'llg'
'llgllallo'.match(/(ll.)/)[1] #-> 'llg'
'llgllallo'.match(/(ll.)/)[1] #-> nil

How do I access all 3 matches? String#scan will work, but that gives me

'llgllallo'.scan(/(ll.)/) #=> [["llg"], ["lla"], ["llo"]]

But I need the offsets, and this info isn't given to me.

You could do:

irb(main):029:0> offsets =
=>
irb(main):030:0> str.scan(/ll./) { offsets << $~.offset(0)[1] }
=> "llgllallo"
irb(main):031:0> offsets
=> [3, 6, 9]

(Pending someone coming up with something slicker. I don't like the
temp variable particularly, but anyway.)

--------------Question 2-----------------------
Now an old gap in my regex understanding. How do I exclude on
consecutive characters? I want something like [^abc], except aba or bbc
is ok, just not 'abc'. Summing this up:

[^abc] means: match one character that is not 'a', not 'b', and not
'c'. I don't think that's what you mean.

reg = /something/
'abc'.match(reg) #-> no match
'cba'.match(reg) #-> match

And then I want to be able to do OR operations too, like not 'abc' and
not 'bbc', but that is probably another step of complexity.

You can use (?!), which is negative lookahead.

irb(main):033:0> reg = /(?!abc)[abc]{3}/
=> /(?!abc)[abc]{3}/

So that means: three of a, b, c, as long as we're not looking at
"abc" when we start looking for those three characters.

irb(main):034:0> reg.match("abc")
=> nil
irb(main):035:0> reg.match("abb")
=> #<MatchData:0x69de8>
irb(main):036:0> reg.match("cba")
=> #<MatchData:0x63de4>

I don't suppose there is any way to pass a block to the regex to use in
a specific place? That would be cool, though maybe not possible given
optimisations in regex?

Blocks get passed to methods, not objects, and regexes are objects.
Some of the methods that use regexes also take blocks, like scan, sub,
and gsub. I'm not sure what you mean about the specific place, though.

David

···

On Sat, 21 Jun 2008, Ben Woodcroft wrote:

--
Rails training from David A. Black and Ruby Power and Light:
ADVANCING WITH RAILS June 16-19 Berlin
ADVANCING WITH RAILS July 21-24 Edison, NJ
See http://www.rubypal.com for details and updates!

Ben_Woodcroft · 22 June 2008 01:51

David A. Black wrote:

You could do:

irb(main):029:0> offsets =
=>
irb(main):030:0> str.scan(/ll./) { offsets << $~.offset(0)[1] }
=> "llgllallo"
irb(main):031:0> offsets
=> [3, 6, 9]

(Pending someone coming up with something slicker. I don't like the
temp variable particularly, but anyway.)

That will work, thanks. It would seem intuitive to me that scan (or a
method like it) would iterate of MatchData objects, but anyway. Thanks.

--------------Question 2-----------------------
Now an old gap in my regex understanding. How do I exclude on
consecutive characters? I want something like [^abc], except aba or bbc
is ok, just not 'abc'. Summing this up:

[^abc] means: match one character that is not 'a', not 'b', and not
'c'. I don't think that's what you mean.

reg = /something/
'abc'.match(reg) #-> no match
'cba'.match(reg) #-> match

And then I want to be able to do OR operations too, like not 'abc' and
not 'bbc', but that is probably another step of complexity.

You can use (?!), which is negative lookahead.

irb(main):033:0> reg = /(?!abc)[abc]{3}/
=> /(?!abc)[abc]{3}/

So that means: three of a, b, c, as long as we're not looking at
"abc" when we start looking for those three characters.

irb(main):034:0> reg.match("abc")
=> nil
irb(main):035:0> reg.match("abb")
=> #<MatchData:0x69de8>
irb(main):036:0> reg.match("cba")
=> #<MatchData:0x63de4>

That is exactly what I meant. I was unaware of the negative lookahead
operator. Thanks!

I don't suppose there is any way to pass a block to the regex to use in
a specific place? That would be cool, though maybe not possible given
optimisations in regex?

Blocks get passed to methods, not objects, and regexes are objects.
Some of the methods that use regexes also take blocks, like scan, sub,
and gsub. I'm not sure what you mean about the specific place, though.

My question was not explained very well, sorry. I meant it would be cool
if you could pass a block that became part of the regex itself. For
instance instead of /(?!abc)/ you could somehow tell it
{|s| s != 'abc'}

Just an idea, doesn't really matter now you've fixed my problem.

Thanks,
ben

···

David

--
Posted via http://www.ruby-forum.com/\.

_Pena_Botp1 · 23 June 2008 01:32

# David A. Black wrote:
# > irb(main):029:0> offsets = []
# > => []
# > irb(main):030:0> str.scan(/ll./) { offsets << $~.offset(0)[1] }
# > => "llgllallo"
# > irb(main):031:0> offsets
# > => [3, 6, 9]
# That will work, thanks. It would seem intuitive to me that scan (or a
# method like it) would iterate of MatchData objects, but

$~ is MatchData
you could wrap dBlack's hint if you want something similar to #scan

eg,

class String
  def mapscan pattern
    atemp=[]
    scan(pattern){ atemp << yield($~)}
    atemp
  end
end
#=> nil

s
#=> "llgllallo"

s.mapscan(/ll./){|md| [md[0],md.offset(0)]}
#=> [["llg", [0, 3]], ["lla", [3, 6]], ["llo", [6, 9]]]

···

From: Ben Woodcroft [mailto:donttrustben@gmail.com]

#
# >> --------------Question 2-----------------------
# >> Now an old gap in my regex understanding. How do I exclude on
# >> consecutive characters? I want something like [^abc],
# except aba or bbc
# >> is ok, just not 'abc'. Summing this up:
# >
# > [^abc] means: match one character that is not 'a', not 'b', and not
# > 'c'. I don't think that's what you mean.
# >> reg = /something/
# >> 'abc'.match(reg) #-> no match
# >> 'cba'.match(reg) #-> match
# >> And then I want to be able to do OR operations too, like
# not 'abc' and
# >> not 'bbc', but that is probably another step of complexity.
# > You can use (?!), which is negative lookahead.
# > irb(main):033:0> reg = /(?!abc)[abc]{3}/
# > => /(?!abc)[abc]{3}/
# > So that means: three of a, b, c, as long as we're not looking at
# > "abc" when we start looking for those three characters.
# > irb(main):034:0> reg.match("abc")
# > => nil
# > irb(main):035:0> reg.match("abb")
# > => #<MatchData:0x69de8>
# > irb(main):036:0> reg.match("cba")
# > => #<MatchData:0x63de4>
# That is exactly what I meant. I was unaware of the negative lookahead
# operator. Thanks!

if you want to compare sequences, you can create a complete sequence for your case, so you do not end up creating many regex pattern. and then test everything from there.

eg,

SEQALPHA=("a".."z").to_a.join
#=> "abcdefghijklmnopqrstuvwxyz"
SEQALPHA.match "abc"
#=> #<MatchData:0x2906288>
SEQALPHA.match "def"
#=> #<MatchData:0x28ff870>
SEQALPHA.match "xyz"
#=> #<MatchData:0x28fb4a0>
SEQALPHA.match "bac"
#=> nil
SEQALPHA.match "cba"
#=> nil
SEQALPHA.match "yyy"
#=> nil

negating it on your case is simple,

not SEQALPHA.match "bac"
#=> true
not SEQALPHA.match "abc"
#=> false

now using mapscan above, you can do,
SEQALPHA.mapscan(/abc|xyz/){|md| [md[0],md.offset(0)]}
#=> [["abc", [0, 3]], ["xyz", [23, 26]]]

btw, index is a faster if you just want simple string compar.
SEQALPHA.index "abc"
#=> 0
SEQALPHA.index "def"
#=> 3
SEQALPHA.index "xxx"
#=> nil
SEQALPHA.index "bac"
#=> nil
SEQALPHA.index /abc/
#=> 0
SEQALPHA.index /def/
#=> 3
SEQALPHA.index /efd/
#=> nil

again, negating is simple

not SEQALPHA.index /def/
#=> false
not SEQALPHA.index /fde/
#=> true

hth.
kind regards -botp

Topic		Replies	Views
(Maybe) a simple question about regex ruby-talk	8	128	25 March 2005
[newbie] How do I match a regex multiple times? ruby-talk	2	118	3 April 2004
Regular expressions question ruby-talk	69	157	19 December 2005
Regular expression ruby-talk	5	76	19 June 2008
Match/scan does not return multiple matches ruby-talk	11	154	9 February 2010

Regex - Exclude Multiple Characters and Global Scanning

Related topics