Help with a regexp

Daniel_Schierbeck · 12 July 2006 14:10

I'm trying to write a regular expression that matches bencoded strings, i.e. strings on the form x:y, where x is the numeric length of y.

This is valid:

6:foobar

while this is not:

4:foo

I've tried using #{$1} inside the regexp, but it seems $1 is still nil at that point. Here's what I've got so far:

/^([1-9]+\d*):(\w){#{$1}}$/

(it ain't working)

I'm matching with `string =~ regexp'; would it be better if I did it another way?

Cheers,
Daniel

Pit · 12 July 2006 16:39

Daniel Schierbeck schrieb:

I'm trying to write a regular expression that matches bencoded strings, i.e. strings on the form x:y, where x is the numeric length of y.

This is valid:

  6:foobar

while this is not:

  4:foo

I've tried using #{$1} inside the regexp, but it seems $1 is still nil at that point. Here's what I've got so far:

  /^([1-9]+\d*):(\w){#{$1}}$/

(it ain't working)

I'm matching with `string =~ regexp'; would it be better if I did it another way?

Daniel, I'm pretty sure you can't do it with a single regexp test. Why not split the test in two parts, as in

str =~ /^([1-9]\d*):(\w+)$/ && $2.size == $1.to_i

Regards,
Pit

_Stud · 12 July 2006 17:05

Daniel,
When you grab the data it will be in a string format, so you need
to convert it to a number (most likely integer). Then you can compare
it with the size of the second value you grabbed. I would write it
like this (modify as needed):

print "enter data"
a = gets
valid = /^(\d*):(\w*)/

check = valid.match(a)

if(check[1].to_i == check[2].size)

end

I hope this helps.

_Steve

Daniel Schierbeck wrote:

···

I'm trying to write a regular expression that matches bencoded strings,
i.e. strings on the form x:y, where x is the numeric length of y.

This is valid:

   6:foobar

while this is not:

   4:foo

I've tried using #{$1} inside the regexp, but it seems $1 is still nil
at that point. Here's what I've got so far:

   /^([1-9]+\d*):(\w){#{$1}}$/

(it ain't working)

I'm matching with `string =~ regexp'; would it be better if I did it
another way?

Cheers,
Daniel

Logan_Capaldo · 12 July 2006 17:21

I believe that this is not in the set of languages a Regexp can match. As others have already suggested you'll have to do it in 2 steps.

···

On Jul 12, 2006, at 10:10 AM, Daniel Schierbeck wrote:

I'm trying to write a regular expression that matches bencoded strings, i.e. strings on the form x:y, where x is the numeric length of y.

This is valid:

  6:foobar

while this is not:

  4:foo

I've tried using #{$1} inside the regexp, but it seems $1 is still nil at that point. Here's what I've got so far:

  /^([1-9]+\d*):(\w){#{$1}}$/

(it ain't working)

I'm matching with `string =~ regexp'; would it be better if I did it another way?

Cheers,
Daniel

Seth_Thomas_Rasmusse · 13 July 2006 13:20

Hi Daniel,

Daniel Schierbeck wrote:

I'm trying to write a regular expression that matches bencoded strings,
i.e. strings on the form x:y, where x is the numeric length of y.

This is valid:

6:foobar

while this is not:

4:foo

I've tried using #{$1} inside the regexp, but it seems $1 is still nil
at that point.

I think you can do what you want there, but if you're using captures
within the regex they are captured in, you denote them as \1, \2, etc.

%r{<(foo)></\1>} # should match a pair of empty "foo" tags

Peas,

···

---
Seth Thomas Rasmussen

Daniel_Martin · 13 July 2006 14:19

Daniel Schierbeck <daniel.schierbeck@gmail.com> writes:

I'm trying to write a regular expression that matches bencoded
strings, i.e. strings on the form x:y, where x is the numeric length
of y.

This is valid:

6:foobar

while this is not:

4:foo

I don't think that what you want to do is possible with a mere regular
expression.

It might be possible using perl's special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn't allow regular expression evaluations to escape
back into the host language.

The problem is that you want to leave crucial portions of the regexp
uncompiled until the moment that half of the regular expression has
matched, and this is not possible.

But matching bencoded data isn't that hard; here's something I just
whipped up that should handle bencoded data:

require 'strscan'

class BencodeScanner
  def BencodeScanner.scan(str)
    scan = StringScanner.new(str)
    r = BencodeScanner.doscan_internal(scan,false)
    raise "Malformed Bencoded String" unless scan.eos?
    r
  end

  private

  @@string_regexps = Hash.new {|h,k| h[k] = /:.{#{k}}/m}

  def BencodeScanner.doscan_internal(scanner, allow_e=true)
    tok = scanner.scan(/\d+|[ilde]/)
    case tok
      when nil
        raise "Malformed Bencoded String"
      when 'e'
        raise "Malformed Bencoded String" unless allow_e
        return nil
      when 'l'
        retval =
        while arritem = BencodeScanner.doscan_internal(scanner)
          retval << arritem
        end
        return retval
      when 'd'
        retval = {}
        while key = BencodeScanner.doscan_internal(scanner)
          val = BencodeScanner.doscan_internal(scanner,false)
          retval[key] = val
        end
        return retval
      when 'i'
        raise "Malformed Bencoded String" unless scanner.scan(/-?\d+e/)
        return scanner.matched[0,scanner.matched.length-1].to_i
      else
        raise "Malformed Bencoded String" unless scanner.scan(@@string_regexps[tok])
        return scanner.matched[1,tok.to_i]
    end
  end
end

Daniel_Schierbeck · 12 July 2006 18:50

studlee2@gmail.com wrote:

Daniel,
When you grab the data it will be in a string format, so you need
to convert it to a number (most likely integer). Then you can compare
it with the size of the second value you grabbed. I would write it
like this (modify as needed):

print "enter data"
a = gets
valid = /^(\d*):(\w*)/

check = valid.match(a)

if(check[1].to_i == check[2].size)

<Your code here>

end

Thank you for your reply.

My problem is that I have a string like this: "3:foo6:monkey5:sheep", which I need to separate into ["foo", "monkey", "sheep"]. The values can contain numeric values, so splitting at \d won't work. This is what makes it difficult:

"3:ab23:cat5:sheep" => ["ab2", "cat", "sheep"]

I need to grab the number, then read that many characters, then read the next number, etc.

Cheers,
Daniel

David_A_Black3 · 13 July 2006 13:27

Hi --

···

On Thu, 13 Jul 2006, Seth Thomas Rasmussen wrote:

Hi Daniel,

Daniel Schierbeck wrote:

I'm trying to write a regular expression that matches bencoded strings,
i.e. strings on the form x:y, where x is the numeric length of y.

This is valid:

6:foobar

while this is not:

4:foo

I've tried using #{$1} inside the regexp, but it seems $1 is still nil
at that point.

I think you can do what you want there, but if you're using captures
within the regex they are captured in, you denote them as \1, \2, etc.

%r{<(foo)></\1>} # should match a pair of empty "foo" tags

It does, but the issue would be getting it to interpolate and be
pre-processed as a quantifier:

/(\d):\w{#{\1}}/ or something

which doesn't seem to be possible, at least as far as I can tell.

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS (reviewed on
Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

Daniel_Schierbeck · 14 July 2006 18:10

Daniel Martin wrote:

Daniel Schierbeck <daniel.schierbeck@gmail.com> writes:

I'm trying to write a regular expression that matches bencoded
strings, i.e. strings on the form x:y, where x is the numeric length
of y.

This is valid:

  6:foobar

while this is not:

  4:foo

I don't think that what you want to do is possible with a mere regular
expression.

It might be possible using perl's special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn't allow regular expression evaluations to escape
back into the host language.

The problem is that you want to leave crucial portions of the regexp
uncompiled until the moment that half of the regular expression has
matched, and this is not possible.

But matching bencoded data isn't that hard; here's something I just
whipped up that should handle bencoded data:

require 'strscan'

class BencodeScanner
  def BencodeScanner.scan(str)
    scan = StringScanner.new(str)
    r = BencodeScanner.doscan_internal(scan,false)
    raise "Malformed Bencoded String" unless scan.eos?
    r
  end
    private
    @@string_regexps = Hash.new {|h,k| h[k] = /:.{#{k}}/m}
    def BencodeScanner.doscan_internal(scanner, allow_e=true)
    tok = scanner.scan(/\d+|[ilde]/)
    case tok
      when nil
        raise "Malformed Bencoded String"
      when 'e'
        raise "Malformed Bencoded String" unless allow_e
        return nil
      when 'l'
        retval =
        while arritem = BencodeScanner.doscan_internal(scanner)
          retval << arritem
        end
        return retval
      when 'd'
        retval = {}
        while key = BencodeScanner.doscan_internal(scanner)
          val = BencodeScanner.doscan_internal(scanner,false)
          retval[key] = val
        end
        return retval
      when 'i'
        raise "Malformed Bencoded String" unless scanner.scan(/-?\d+e/)
        return scanner.matched[0,scanner.matched.length-1].to_i
      else
        raise "Malformed Bencoded String" unless scanner.scan(@@string_regexps[tok])
        return scanner.matched[1,tok.to_i]
    end
  end
end

Thank you all for your responses!

I've been away for the last two days, so I've only just got an opportunity to reply.

Daniel, I've further developed your solution:

module Bencode
class BencodingError < StandardError; end

     class << self
       def dump(obj)
         obj.bencode
       end

def parse(benc)
require 'strscan'

         scanner = StringScanner.new(benc)
         obj = scan(scanner)
         raise BencodingError unless scanner.eos?
         return obj
       end

alias_method :load, :parse

private

       def scan(scanner)
         case token = scanner.scan(/[ild]|\d+:/)
         when nil
           raise BencodingError
         when "i"
           number = scanner.scan(/0|(-?[1-9][0-9]*)/)
           raise BencodingError unless number
           raise BencodingError unless scanner.scan(/e/)
           return number
         when "l"
           ary =
           until scanner.peek(1) == "e"
             ary.push(scan(scanner))
           end
           scanner.pos += 1
           return ary
         when "d"
           hsh = {}
           until scanner.peek(1) == "e"
             hsh.store(scan(scanner), scan(scanner))
           end
           scanner.pos += 1
           return hsh
         when /\d+:/
           length = token.chop.to_i
           str = scanner.peek(length)
           scanner.pos += length
           return str
         else
           raise BencodingError
         end
       end
     end
   end

Cheers, and thank you all for helping me out!
Daniel Schierbeck

David_A_Black3 · 14 July 2006 18:16

Hi --

···

On Thu, 13 Jul 2006, Daniel Martin wrote:

Daniel Schierbeck <daniel.schierbeck@gmail.com> writes:

I'm trying to write a regular expression that matches bencoded
strings, i.e. strings on the form x:y, where x is the numeric length
of y.

This is valid:

6:foobar

while this is not:

4:foo

I don't think that what you want to do is possible with a mere regular
expression.

It might be possible using perl's special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn't allow regular expression evaluations to escape
back into the host language.

Is ??{ code } in Perl different from #{...} in Ruby? (Not that I was
able to solve Daniel's problem with #{...}, but I'm just curious about
the comparison.)

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS (reviewed on
Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

David_A_Black3 · 12 July 2006 18:54

Hi --

···

On Thu, 13 Jul 2006, Daniel Schierbeck wrote:

My problem is that I have a string like this: "3:foo6:monkey5:sheep", which I need to separate into ["foo", "monkey", "sheep"]. The values can contain numeric values, so splitting at \d won't work. This is what makes it difficult:

"3:ab23:cat5:sheep" => ["ab2", "cat", "sheep"]

I need to grab the number, then read that many characters, then read the next number, etc.

How do you know, in the case of:

2:ab23:cat

which of the two is invalid?

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS, the Ruby book for
Rails developers
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

Marcin_Miel_ynski · 12 July 2006 19:10

Daniel Schierbeck wrote:

My problem is that I have a string like this: "3:foo6:monkey5:sheep", which I need to separate into ["foo", "monkey", "sheep"]. The values can contain numeric values, so splitting at \d won't work. This is what makes it difficult:

"3:ab23:cat5:sheep" => ["ab2", "cat", "sheep"]

I need to grab the number, then read that many characters, then read the next number, etc.

"3:foo6:monkey5:sheep".scan(/(\d+)\:([^\d]+)/){|(num,str)|
     if num.to_i == str.length
         # correct
     else
         # not correct
     end
}

lopex

Tom_Werner1 · 12 July 2006 19:20

Daniel Schierbeck wrote:

Daniel,
    When you grab the data it will be in a string format, so you need
to convert it to a number (most likely integer). Then you can compare
it with the size of the second value you grabbed. I would write it
like this (modify as needed):

print "enter data"
a = gets
valid = /^(\d*):(\w*)/

check = valid.match(a)

if(check[1].to_i == check[2].size)

   <Your code here>

end

Thank you for your reply.

My problem is that I have a string like this: "3:foo6:monkey5:sheep", which I need to separate into ["foo", "monkey", "sheep"]. The values can contain numeric values, so splitting at \d won't work. This is what makes it difficult:

  "3:ab23:cat5:sheep" => ["ab2", "cat", "sheep"]

I need to grab the number, then read that many characters, then read the next number, etc.

Cheers,
Daniel

require 'strscan'

array =
ss = StringScanner.new('3:ab23:cat5:sheep')

while !ss.eos?
  len = ss.scan(/\d+:/).chop.to_i
  array << ss.peek(len)
  ss.pos += len
end

p array

=> ["ab2", "cat", "sheep"]

Tom

···

studlee2@gmail.com wrote:

--
Tom Werner
Helmets to Hardhats
Software Developer
tom@helmetstohardhats.org
www.helmetstohardhats.org

Gavin_Kistner2 · 12 July 2006 19:30

require 'strscan'
s = StringScanner.new( "3:ab23:cat5:sheep" )
words = []
until s.eos?
  if digits = s.scan( /\d+/ )
    digits = digits.to_i
    s.pos += 1
    words << s.peek( digits )
    s.pos += digits
  else
    p words
    abort "Couldn't find digits for: #{s.rest}"
  end
end
p words
#=> ["ab2", "cat", "sheep"]

Daniel_Schierbeck · 14 July 2006 18:15

Daniel Schierbeck wrote:

        when "i"
          number = scanner.scan(/0|(-?[1-9][0-9]*)/)
          raise BencodingError unless number
          raise BencodingError unless scanner.scan(/e/)
          return number

That last line should of course read

return number.to_i

Daniel

Logan_Capaldo · 14 July 2006 18:59

Hi --

Daniel Schierbeck <daniel.schierbeck@gmail.com> writes:

I'm trying to write a regular expression that matches bencoded
strings, i.e. strings on the form x:y, where x is the numeric length
of y.

This is valid:

6:foobar

while this is not:

4:foo

I don't think that what you want to do is possible with a mere regular
expression.

It might be possible using perl's special
evaluate-code-while-in-regexp (??{ code }) feature, but not with any
language that doesn't allow regular expression evaluations to escape
back into the host language.

Is ??{ code } in Perl different from #{...} in Ruby? (Not that I was
able to solve Daniel's problem with #{...}, but I'm just curious about
the comparison.)

According to _Programming Perl_ yes indeedy it is. ??{ } is "Match Time Pattern Interpolation", and it lets you do all sorts of evil (like matching nested parens with a regexp).

So in perl his code would be something like:

% cat mtpi.pl
$s1 = "3:abc";
$s2 = "24:abc";

print "Good\n" if ( $s1 =~ /(\d+):(??{'\w{' . $1 . '}'})/);
print "Bad\n" if ( $s2 =~ /(\d+):(??{'\w{' . $1 . '}'})/);

% perl mtpi.pl
Good

I apologize to any perlers if this isn't idiomatic (or clean) perl, I never had to use this kind of magic in my perl days and I had difficulty getting it to work when I stored the regexp in a variable. But the point is, is that it does work. Which is kind of scary.

···

On Jul 14, 2006, at 2:16 PM, dblack@wobblini.net wrote:

On Thu, 13 Jul 2006, Daniel Martin wrote:

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS (reviewed on
Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

Logan_Capaldo · 12 July 2006 19:13

Hi --

My problem is that I have a string like this: "3:foo6:monkey5:sheep", which I need to separate into ["foo", "monkey", "sheep"]. The values can contain numeric values, so splitting at \d won't work. This is what makes it difficult:

"3:ab23:cat5:sheep" => ["ab2", "cat", "sheep"]

I need to grab the number, then read that many characters, then read the next number, etc.

How do you know, in the case of:

2:ab23:cat

which of the two is invalid?

David

I don't believe you do. OTOH I'm not sure if you should care. If one's invalid, the whole thing is invalid.

Anyway, here's my solution:

% cat n_colon_s.rb
require 'strscan'
def parse_n_colon_s(s)
   scanner = StringScanner.new(s)
   results =
   until scanner.empty?
     if scanner.scan(/(\d+):/)
       n = scanner[1].to_i
       raise 'Malformed String' unless (s = scanner.scan(/\w{#{n}}/))
       results << s
     else
       raise 'Malformed string'
     end
   end
   results
end

p parse_n_colon_s("2:ab3:cat")
p parse_n_colon_s("2:ab23:cat")

% ruby n_colon_s.rb
["ab", "cat"]
-:18:in `parse_n_colon_s': Malformed String (RuntimeError)
from -:28

···

On Jul 12, 2006, at 2:54 PM, dblack@wobblini.net wrote:

On Thu, 13 Jul 2006, Daniel Schierbeck wrote:

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS, the Ruby book for
Rails developers
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

Ara.T.Howard6 · 12 July 2006 19:31

the second. it's invalid here

2:ab23:cat
^

virtual position 10.

harp:~ > cat a.rb
require 'strscan'

     class List < ::Array
       def initialize s
         parse s
       end
       class ParseError < ::StandardError; end
       def parse s
         clear
         ss = StringScanner.new s.to_s
         until ss.eos?
           n = ss.scan %r/^\d+:/
           raise ParseError, "syntax error beginning @ pos <#{ ss.pos }>" unless n

n = n.to_i
raise RangeError, "bad length <#{ n }>" if n < 0

word = ss.scan(%r/.{0,#{ n }}/).to_s

raise ParseError, "syntax error beginning @ pos <#{ ss.pos }>" unless
word.size == n

           self << word
         end
       end
     end

list = List.new "3:ab23:cat5:sheep"
p list

list = List.new "2:ab23:cat"
p list

     harp:~ > ruby a.rb
     ["ab2", "cat", "sheep"]
     a.rb:20:in `parse': syntax error beginning @ pos <10> (List::ParseError)
             from a.rb:5:in `initialize'
             from a.rb:31

regards.

-a

···

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:

Hi --

On Thu, 13 Jul 2006, Daniel Schierbeck wrote:

My problem is that I have a string like this: "3:foo6:monkey5:sheep", which I need to separate into ["foo", "monkey", "sheep"]. The values can contain numeric values, so splitting at \d won't work. This is what makes it difficult:

"3:ab23:cat5:sheep" => ["ab2", "cat", "sheep"]

I need to grab the number, then read that many characters, then read the next number, etc.

How do you know, in the case of:

2:ab23:cat

which of the two is invalid?

--
suffering increases your inner strength. also, the wishing for suffering
makes the suffering disappear.
- h.h. the 14th dali lama

Gavin_Kistner2 · 12 July 2006 19:45

Here's what I consider a slightly cleaner, more robust version:

require 'strscan'
inp = "3:ab23:cat5:sheep"
s = StringScanner.new( inp )
words = []
until s.eos?
  begin
    unless digits = s.scan( /\d+(?=:)/ )
      raise "I can't find an integer followed by a colon"
    end
    s.pos += 1 # skip the colon we know is there
    digits = digits.to_i
    unless s.rest_size >= digits
      raise "I ran out of characters; looking for #{digits} characters,
#{s.rest_size} left"
    end
    words << s.peek( digits )
    s.pos += digits
  rescue RuntimeError => e
    warn "Looking at #{s.rest.inspect},"
    warn e.message
    abort "Words found so far: #{words.inspect}"
  end
end
puts "Words found: ", words

Daniel_Martin · 15 July 2006 12:43

Logan Capaldo <logancapaldo@gmail.com> writes:

Is ??{ code } in Perl different from #{...} in Ruby? (Not that I was
able to solve Daniel's problem with #{...}, but I'm just curious about
the comparison.)

I apologize to any perlers if this isn't idiomatic (or clean) perl, I
never had to use this kind of magic in my perl days and I had
difficulty getting it to work when I stored the regexp in a variable.
But the point is, is that it does work. Which is kind of scary.

I think you probably had trouble with the \ when you tried storing it
in a variable because of quoting issues. So use qr, the perl
equivalent of ruby's %r:

$s1 = "3:abc";
$s2 = "24:abc";

$regexp = qr/(\d+):(??{'\w{' . $1 . '}'})/;

print "Good\n" if ( $s1 =~ $regexp);
print "Bad\n" if ( $s2 =~ $regexp);

Although, since bencoded strings can contain any character, and not
just word characters, what you really want is:

$regexp = qr/(\d+):(??{".{$1}"})/;

Perl allows bunches of special constructs in regular expressions that
sane languages, which like to keep the matching of regular expressions
away from being able to jump back into the host language. (Note that
perl combines this feature with extra language-level security support,
since most programmers would assume that a user-supplied regexp
couldn't execute arbitrary code)

For more examples, google "perlre".

Incidentally, I've just been able to reproduce as much of bencoding as
I implemented in ruby earlier in a pair of nasty perl regular
expressions.

I won't post it, since this is ruby-talk and not
perl-regex-nastiness-talk, but people who really want to see it can
look at http://paste.lisp.org/display/22637

It doesn't technically decode every possible bencoded string, because
limitations in perl's regexp engine don't let me say .{n} where "n" is
larger than about 32000 while a bencoded string can in theory have a
length up to 65535. But other than that, it should implement the
entire bencode spec.

···

On Jul 14, 2006, at 2:16 PM, dblack@wobblini.net wrote:

Topic		Replies	Views
Help with a regexp ruby-talk	0	68	12 July 2006
Help for a little regexp ruby-talk	5	123	27 June 2012
Matches -> regexp? ruby-talk	5	57	13 December 2005
Ruby regexp Match ruby-talk	4	51	4 December 2007
Regular expressions ruby-talk	26	103	17 April 2003

Help with a regexp

Related Topics