Help with a regexp

Hi --

···

On Thu, 13 Jul 2006, Logan Capaldo wrote:

On Jul 12, 2006, at 2:54 PM, dblack@wobblini.net wrote:

Hi --

On Thu, 13 Jul 2006, Daniel Schierbeck wrote:

My problem is that I have a string like this: "3:foo6:monkey5:sheep", which I need to separate into ["foo", "monkey", "sheep"]. The values can contain numeric values, so splitting at \d won't work. This is what makes it difficult:

"3:ab23:cat5:sheep" => ["ab2", "cat", "sheep"]

I need to grab the number, then read that many characters, then read the next number, etc.

How do you know, in the case of:

2:ab23:cat

which of the two is invalid?

David

I don't believe you do. OTOH I'm not sure if you should care. If one's invalid, the whole thing is invalid.

OK, but how do you know whether or not this is valid?

   3:ab23:cat

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS, the Ruby book for
                                                     Rails developers
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

Hi --

···

On Thu, 13 Jul 2006, ara.t.howard@noaa.gov wrote:

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:

Hi --

On Thu, 13 Jul 2006, Daniel Schierbeck wrote:

My problem is that I have a string like this: "3:foo6:monkey5:sheep", which I need to separate into ["foo", "monkey", "sheep"]. The values can contain numeric values, so splitting at \d won't work. This is what makes it difficult:

"3:ab23:cat5:sheep" => ["ab2", "cat", "sheep"]

I need to grab the number, then read that many characters, then read the next number, etc.

How do you know, in the case of:

2:ab23:cat

which of the two is invalid?

the second. it's invalid here

2:ab23:cat
           ^

virtual position 10.

How do you know? The first one could be wrong and the second one
right.

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS, the Ruby book for
                                                     Rails developers
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

Phrogz wrote:

Here's what I consider a slightly cleaner, more robust version:

require 'strscan'
inp = "3:ab23:cat5:sheep"
s = StringScanner.new( inp )
words =
until s.eos?
  begin
    unless digits = s.scan( /\d+(?=:)/ )
      raise "I can't find an integer followed by a colon"
    end
    s.pos += 1 # skip the colon we know is there
    digits = digits.to_i
    unless s.rest_size >= digits
      raise "I ran out of characters; looking for #{digits} characters,
#{s.rest_size} left"
    end
    words << s.peek( digits )
    s.pos += digits
  rescue RuntimeError => e
    warn "Looking at #{s.rest.inspect},"
    warn e.message
    abort "Words found so far: #{words.inspect}"
  end
end
puts "Words found: ", words

Thanks for the help!

I ended up doing it slightly different (after a lot of pain, and a sneek peek at a Python script.):

This is an early library that handles bencoding. Note that the error handling is pretty much non-existent at this point, and it'll break if you fart on it.

   class String
     def bencode
       "#{length}:#{self}"
     end
   end

   class Numeric
     def bencode
       "i#{floor}e"
     end
   end

   class Array
     def bencode
       "l#{map{|obj| obj.bencode}}e"
     end
   end

   class Hash
     def bencode
       "d#{map{|key, val| [key.bencode, val.bencode]}}e"
     end
   end

   module Bencode
     @filter = {:integer => /(0|-?[1-9]+\d*)e/,
                :string => /(0|[1-9][0-9]*):/}

     class << self
       attr_reader :map

       def dump(obj)
         obj.bencode
       end

       def load(str)
         decode(str, 0).first
       end

       def decode_integer(data, i)
         match = @filter[:integer].match(data[i..-1])
         raise ArgumentError if match.nil?
         return Integer(match[1]), match.end(0) + i
       end

       def decode_string(data, i)
         match = @filter[:string].match(data[i..-1])
         raise ArgumentError if match.nil?
         length = Integer(match[1])
         offset = match.end(0) + length + i
         return data[(match.end(0) + i)...offset], offset
       end

       def decode_list(data, i)
         ary =
         while data[i..i] != "e"
           val, i = decode(data, i)
           ary << val
         end
         return ary, i + 1
       end

       def decode_dict(data, i)
         hsh = {}
         while data[i..i] != "e"
           key, i = decode_string(data, i)
           val, i = decode(data, i)
           hsh[key] = val
         end
         return hsh, i + 1
       end

       def decode(data, i)
         case data[i..i]
         when "i"
           decode_integer(data, i + 1)
         when "l"
           decode_list(data, i + 1)
         when "d"
           decode_dict(data, i + 1)
         else
           decode_string(data, i)
         end
       end
     end
   end

Eventually I'll upload it to RubyForge. If you've got comments, bring 'em on, but remember that I only just started this today.

Cheers,
Daniel

"Phrogz" <gavin@refinery.com> wrote in news:1152733325.895320.247460
@b28g2000cwb.googlegroups.com:

Here's what I consider a slightly cleaner, more robust version:

require 'strscan'
inp = "3:ab23:cat5:sheep"
s = StringScanner.new( inp )
words =
until s.eos?
  begin
    unless digits = s.scan( /\d+(?=:)/ )
      raise "I can't find an integer followed by a colon"
    end
    s.pos += 1 # skip the colon we know is there
    digits = digits.to_i
    unless s.rest_size >= digits
      raise "I ran out of characters; looking for #{digits} characters,
#{s.rest_size} left"
    end
    words << s.peek( digits )
    s.pos += digits
  rescue RuntimeError => e
    warn "Looking at #{s.rest.inspect},"
    warn e.message
    abort "Words found so far: #{words.inspect}"
  end
end
puts "Words found: ", words

I've been experimenting with Ruby since Tuesday, and I'd like to thank
you all for sharing code with us here - it really speeds us forwards in
picking up the spirit of Ruby coding.

I believe I've simplified your code by using the incredibly full set of
built-in methods in the string object, rather than depending on
"require 'strscan'":
-----------------------8<----------------------------
s = "3:a:23:cat5:sheep"
words =
until s.empty?
  begin
    unless digits = s.slice!(/\d+(?=:)/)
      raise "I can't find an integer followed by a colon"
    end
    words << s.slice!(0..digits.to_i)
    unless words.last.size >= digits.to_i
      raise "I ran out of characters; looking for #{digits} characters,
#{s.size} left"
    end
  rescue RuntimeError => e
    warn "Looking at #{s.inspect},"
    warn e.message
    abort "Words found so far: #{words.inspect}"
  end
end
puts "Words found: ", words
-----------------------8<----------------------------

···

--
Chris

Hi --

···

On Sat, 15 Jul 2006, Daniel Martin wrote:

Logan Capaldo <logancapaldo@gmail.com> writes:

On Jul 14, 2006, at 2:16 PM, dblack@wobblini.net wrote:

Is ??{ code } in Perl different from #{...} in Ruby? (Not that I was
able to solve Daniel's problem with #{...}, but I'm just curious about
the comparison.)

<snip - it is, and it's evil>

I apologize to any perlers if this isn't idiomatic (or clean) perl, I
never had to use this kind of magic in my perl days and I had
difficulty getting it to work when I stored the regexp in a variable.
But the point is, is that it does work. Which is kind of scary.

I think you probably had trouble with the \ when you tried storing it
in a variable because of quoting issues. So use qr, the perl
equivalent of ruby's %r:

$s1 = "3:abc";
$s2 = "24:abc";

$regexp = qr/(\d+):(??{'\w{' . $1 . '}'})/;

print "Good\n" if ( $s1 =~ $regexp);
print "Bad\n" if ( $s2 =~ $regexp);

Although, since bencoded strings can contain any character, and not
just word characters, what you really want is:

$regexp = qr/(\d+):(??{".{$1}"})/;

Perl allows bunches of special constructs in regular expressions that
sane languages, which like to keep the matching of regular expressions
away from being able to jump back into the host language. (Note that
perl combines this feature with extra language-level security support,
since most programmers would assume that a user-supplied regexp
couldn't execute arbitrary code)

Ruby does let you jump back, though, with #{...}. But it looks like
Perl does an extra level of compilation. (Unless I've got that
backwards.)

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS (reviewed on
                                     Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

it is valid.

harp:~ > cat a.rb
require 'strscan'

class List < ::Array
   def initialize s
     parse s
   end
   class ParseError < ::StandardError; end
   def parse s
     clear
     ss = StringScanner.new s.to_s
     until ss.eos?
       n = ss.scan %r/^\d+:/
       raise ParseError, "syntax error beginning @ pos <#{ ss.pos }>" unless n
       n = n.to_i
       raise RangeError, "bad length <#{ n }>" if n < 0
       word = ss.scan(%r/.{0,#{ n }}/).to_s
       raise ParseError, "syntax error beginning @ pos <#{ ss.pos }>" unless
         word.size == n
       self << word
     end
   end
end

list = List.new "3:ab23:cat5:sheep"
p list

list = List.new "3:ab23:cat"
p list

harp:~ > ruby a.rb
["ab2", "cat", "sheep"]
["ab2", "cat"]

regards.

-a

···

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:

OK, but how do you know whether or not this is valid?

3:ab23:cat

--
suffering increases your inner strength. also, the wishing for suffering
makes the suffering disappear.
- h.h. the 14th dali lama

I need to grab the number, then read that many characters, then read the next number, etc.

How do you know, in the case of:

2:ab23:cat

which of the two is invalid?

the second. it's invalid here

2:ab23:cat
           ^

virtual position 10.

How do you know? The first one could be wrong and the second one
right.

because that's the specification:

I need to grab the number, then read that many characters, then read the next number, etc.

so, in this case, the parse follows these steps:

   - read '2:'
   - read 2 chars
   - read '23:'
   - read 23 chars -> ERROR

i think you are confusing the syntax of the structure for the the logic of it.
syntax wise '2:ab' is totally valid and so that token pair is consumed. of
course it's the case that the user meant

   3:ab23:cat

it's not up to parsers to detect logic errors, only syntax errors. in this
case the left to right quality (no backtracking specified) makes it
unambiguous as to which token contains the error - it's the second.

yaml has similar issues, for instance this yaml document

   k : v
   - 0
   - 2
   - 3

loads as

   {"k"=>[0, 2, 3]}

which is obviously not what is meant. but is it

   - k : v
   - 0
   - 2
   - 3

or

   k :
   - 0
   - 2
   - 3

both are one char errors. nevertheless it's too much to expect yaml to
determine which was meant or suspect that an error was made.

regards.

-a

···

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:
--
suffering increases your inner strength. also, the wishing for suffering
makes the suffering disappear.
- h.h. the 14th dali lama

Daniel Schierbeck wrote:

If you've got comments, bring
'em on, but remember that I only just started this today.

One comment: you seem to have intermixed the places for decoding
(module methods) and encoding (instance methods of classes). It would
seem cleaner, to me, to either add class methods to the classes
(Array.from_bencode instead of Bencode.decode_list) or only use the
module (Bencode.from_array instead of Array#bencode).

I've been experimenting with Ruby ... Tuesday, ...

I never even noticed! talk about showing your age :slight_smile:

sorry, I caught the colon at the start of each word - should have been :

s = "3:a:23:cat5:sheep"
words =
until s.empty?
  begin
    unless digits = s.slice!(/\d+(?=:)/)
      raise "I can't find an integer followed by a colon"
    end
    s.slice!(0)
    words << s.slice!(0..digits.to_i-1)
    unless words.last.size >= digits.to_i
      raise "I ran out of characters; looking for #{digits} characters,
#{s.size} left"
    end
  rescue RuntimeError => e
    warn "Looking at #{s.inspect},"
    warn e.message
    abort "Words found so far: #{words.inspect}"
  end
end
puts "Words found: ", words

Goodbye,

···

--
chris

dblack@wobblini.net writes:

Perl allows bunches of special constructs in regular expressions that
sane languages, which like to keep the matching of regular expressions
away from being able to jump back into the host language. (Note that
perl combines this feature with extra language-level security support,
since most programmers would assume that a user-supplied regexp
couldn't execute arbitrary code)

Ruby does let you jump back, though, with #{...}. But it looks like
Perl does an extra level of compilation. (Unless I've got that
backwards.)

Ruby lets you do string interpolation with an easy syntax, and lets
you use that syntax even when writing a string that is being compiled
into a regular expression because it's surrounded by //. Perl has
that, too. This however is something different - the execution of
code in the host language at regular expression long after the
expression has been compiled, in the middle of doing a match.

As I said, most languages don't allow this. The usual pattern is:
1 make string in language-specific way
2 hand string to regexp engine and get back some compiled structure
3 store handle to compiled structure in host language
4 get string to match
5 hand string and compiled structure to regexp engine
6 regexp engine walks the string and compiled structure to determine
  if there's a match.

Now, for speed, step 6 is generally done in C. Sometimes, so is step
2. (I haven't looked at the ruby code, but python does steps 1-5 in
python, and only 6 in C. Java's regexp engine of course does all
those steps in java)

Perl, however, lets you interrupt step 6 and evaluate some perl code
in the midst of the C-based matching code.

···

On Sat, 15 Jul 2006, Daniel Martin wrote:

Hi --

···

On Thu, 13 Jul 2006, ara.t.howard@noaa.gov wrote:

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:

OK, but how do you know whether or not this is valid?

3:ab23:cat

it is valid.

I guess what I mean is: how do you know whether it's valid by
coincidence but actually wrong?

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS, the Ruby book for
                                                     Rails developers
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

Phrogz wrote:

Daniel Schierbeck wrote:

If you've got comments, bring
'em on, but remember that I only just started this today.

One comment: you seem to have intermixed the places for decoding
(module methods) and encoding (instance methods of classes). It would
seem cleaner, to me, to either add class methods to the classes
(Array.from_bencode instead of Bencode.decode_list) or only use the
module (Bencode.from_array instead of Array#bencode).

Actually, I'm imitating the behavior of YAML. I think it's very intuitive that an object creates a bencoded copy of itself, while the parser methods are gathered at one place. Maybe make the /decode(_.+)?/ methods private?

Cheers,
Daniel

Hi --

···

On Sun, 16 Jul 2006, Daniel Martin wrote:

dblack@wobblini.net writes:

On Sat, 15 Jul 2006, Daniel Martin wrote:

Perl allows bunches of special constructs in regular expressions that
sane languages, which like to keep the matching of regular expressions
away from being able to jump back into the host language. (Note that
perl combines this feature with extra language-level security support,
since most programmers would assume that a user-supplied regexp
couldn't execute arbitrary code)

Ruby does let you jump back, though, with #{...}. But it looks like
Perl does an extra level of compilation. (Unless I've got that
backwards.)

Ruby lets you do string interpolation with an easy syntax, and lets
you use that syntax even when writing a string that is being compiled
into a regular expression because it's surrounded by //. Perl has
that, too. This however is something different - the execution of
code in the host language at regular expression long after the
expression has been compiled, in the middle of doing a match.

Right, I see what you mean. I think with "extra level of compilation"
that's what I was groping toward -- really a post-compilation
evaluation in a later context.

Thanks --

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS (reviewed on
                                     Slashdot, 7/12/2006!)
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me

the same way ruby knows that this

   a = false and must_call_this_function_but_do_not!

is valid but actually wrong -- it doesn't! that's why logic errors are the
worst kind to debug...

fortunately for us too - otherwise we programmers would be out of work!

:wink:

-a

···

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:

Hi --

On Thu, 13 Jul 2006, ara.t.howard@noaa.gov wrote:

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:

OK, but how do you know whether or not this is valid?

3:ab23:cat

it is valid.

I guess what I mean is: how do you know whether it's valid by
coincidence but actually wrong?

--
suffering increases your inner strength. also, the wishing for suffering
makes the suffering disappear.
- h.h. the 14th dali lama

Hi --

···

On Thu, 13 Jul 2006, ara.t.howard@noaa.gov wrote:

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:

Hi --

On Thu, 13 Jul 2006, ara.t.howard@noaa.gov wrote:

On Thu, 13 Jul 2006 dblack@wobblini.net wrote:

OK, but how do you know whether or not this is valid?

3:ab23:cat

it is valid.

I guess what I mean is: how do you know whether it's valid by
coincidence but actually wrong?

the same way ruby knows that this

a = false and must_call_this_function_but_do_not!

is valid but actually wrong -- it doesn't! that's why logic errors are the
worst kind to debug...

fortunately for us too - otherwise we programmers would be out of work!

I think what I'm groping toward (or not) is a case where backtracking
is necessary before the conclusion is drawn that it's invalid. I know
backtracking wasn't specified... but I wonder whether there are cases
that can't be resolved otherwise. (Just thinking out loud, so to
speak.)

David

--
http://www.rubypowerandlight.com => Ruby/Rails training & consultancy
Ruby for Rails => RUBY FOR RAILS, the Ruby book for
                                                     Rails developers
http://dablog.rubypal.com => D[avid ]A[. ]B[lack's][ Web]log
dblack@wobblini.net => me