String + Range = Strange

How does this work?

irb(main):002:0> ("a".."z").to_a
=> ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]
irb(main):003:0> ("aa".."z").to_a
=> []
irb(main):004:0> ("aa".."1z").to_a
=> []
irb(main):005:0> ("aa".."b1").to_a
["aa".."zz" elided]

What I mean is, what algorithm is used to determine that the stream of
successors is not going to hit the endpoint of the range? And could this
be used to do a faster-than-O(n) set memnbership test?

martin

Exactly.

And the same holds true of many a sequence, like even numbers. Now how
can membership be optimized in a custom manner?

T.

Martin wrote:

irb(main):005:0> ("aa".."b1").to_a
["aa".."zz" elided]

Then - Trans wrote:

Exactly.

And the same holds true of many a sequence, like even numbers. Now how can membership be optimized in a custom manner?
  

Maybe I'm about to say something silly (by not being up-to speed) but why on earth could anyone consider the output from the above one-liner to be sane? Not only is upper bound exceeded (under any sensible notion of comparison) but many intermediate values are missing for example "b!" and "a~" as well as more esoteric strings with control characters such as "b\0" and all strings with capital letters in them?
To make matters worse, even if we accept that ranges have a peculiar semantics over strings, they are inconsistent.

irb(main):001:0> ("aa".."b1")==="zz"
=> false
irb(main):002:0> ("aa".."b1").to_a.include? "zz"
=> true

The max method also behaves in a very counter intuitive way without converting to an array:

irb(main):003:0> ("A".."d").max
=> "Z"

Establishing set membership is something for which (in other languages) I've long had an active interest - which brings me to comment on the even numbers question. A really powerful resolution of range would support "modifiers":

Projection- Applying a function to each element
Selection- Filtering values which do not exist
Intersection- Retaining only values present in both
Union - Retaining values present in either
Minus - Retaining values present in the first but not the second.

It is interesting to note that it is possible to efficiently compute membership (without enumerating all the possible values) unless Projection is applied. If projection is applied then it is necessary that the projection function can be inverted - which can't be done in the general case. So, given an appropriate implementation of efficient - the following generate the odd numbers below n.

efficient(0..n).filter { |x| x%2==0 }
efficient(0..(n/2)).project { |x| x=x*2 }
efficient(0..n).minus efficient(0..n).filter { |x| x%2==1 }
... and many, many more possibilities...

For each of these (except project, assuming no-one supplies the hand-verified inverse to the supplied function) there are efficient [i.e. far better than O(n)] ways to establish membership.

Does anyone not claim that ranges are broken with respect to strings?

Steve [RubyTalk] wrote:

Martin wrote:

irb(main):005:0> ("aa".."b1").to_a
["aa".."zz" elided]

Then - Trans wrote:

Exactly.

And the same holds true of many a sequence, like even numbers. Now how can membership be optimized in a custom manner?
  
Maybe I'm about to say something silly (by not being up-to speed) but why on earth could anyone consider the output from the above one-liner to be sane? Not only is upper bound exceeded (under any sensible notion of comparison) but many intermediate values are missing for example "b!" and "a~" as well as more esoteric strings with control characters such as "b\0" and all strings with capital letters in them?
To make matters worse, even if we accept that ranges have a peculiar semantics over strings, they are inconsistent.

I don't know about being sane, but the behavior is explicable by the fact that:

    Range#to_a (Enumerable#to_a) calls
    Range#each, which uses
    String#step (for Strings, anyway), which uses
    String#upto, which calls on
    String#succ (to generate successive entries)
    and String#<=> (to compare against the endpoint)

The key is that String#upto stops iterating as soon as String#succ generates either

   a) a string lexically equal to the endpoint (or the endpoint's successor, if the range excludes the endpoint), or
   b) a longer string.

(see rb_str_upto() in string.c in the Ruby source code)

It's this second factor that's coming into play here.

    "aa".succ => "ab"
    "az".succ => "ba"
    ...
    "zy".succ => "zz"
    "zz".succ => "aaa" # stop iterating here, "aaa" is longer than "zz"

The point is that even though "aa".."b1" is a valid range, calling succ repeatedly from the starting point will never yield the endpoint.

irb(main):001:0> ("aa".."b1")==="zz"
=> false

Right, because this is a check on the endpoints.

irb(main):002:0> ("aa".."b1").to_a.include? "zz"
=> true

Also right, because succ eventually generates 'zz'

The max method also behaves in a very counter intuitive way without converting to an array:

irb(main):003:0> ("A".."d").max
=> "Z"

Again, 'Z'.succ is 'AA', so the iteration stops. 'd' is never reached, and 'Z' is the last element generated.

Note that ('A'..'d').last => 'd'

Does anyone not claim that ranges are broken with respect to strings?

The point is that 'A'..'d' is somewhat nonsensical given the default implementation of String#succ. However, you could define String#succ in such a way that 'A'..'d' was logical.

(very) contrived example:

    class String
      def succ
        self.downcase[0].succ.chr
      end
    end

    ('A'..'d').to_a => ["A", "b", "c", "d"]
    ('A'..'d').max => 'd'

Bob Showalter wrote:

The point is that even though "aa".."b1" is a valid range, calling succ repeatedly from the starting point will never yield the endpoint.

This, I think is the crux of the matter. It seems to me that given that if "aa".."b1" is a valid range, that calling calling succ on "aa" should logically eventually reach "b1". I understand that in Ruby, today, it doesn't.

I suppose I can get the behaviour I expect by doing something like this.

class String
  def succ
    s=String.new self
    (s.length-1).downto 0 do |i|
      s[i]=(s[i]+1)&0xff
      return s unless s[i]==0
    end
    1.chr.concat s
  end
end

Is there any good reason that this isn't the default behaviour for succ - given that Ruby claims an objective is to have "sensible defaults"? I can see a justification for having a method defined how String#succ is defined today - but I don't see why it has to be as the default succ method... especially since it gives such counter-intuitive results when used with ranges. Alternatively - I suppose - it might make sense if ranges didn't use succ?

Ranges in regexps don't work the same "smart" way - wouldn't consistent default behaviour be far preferable?

Steve [RubyTalk] wrote:

Bob Showalter wrote:

The point is that even though "aa".."b1" is a valid range, calling succ repeatedly from the starting point will never yield the endpoint.

This, I think is the crux of the matter. It seems to me that given that if "aa".."b1" is a valid range, that calling calling succ on "aa" should logically eventually reach "b1". I understand that in Ruby, today, it doesn't.

I think that String#succ is of limited usefulness, and will *always* be a
source of strange edge cases like this.

Personally, I think that (perhaps) String ranges should be disallowed as
being mostly nonsensical (unless a rigid order, like character set order,
were imposed).

Hal

Hal Fulton wrote:

Steve [RubyTalk] wrote:

This, I think is the crux of the matter. It seems to me that given that if "aa".."b1" is a valid range, that calling calling succ on "aa" should logically eventually reach "b1". I understand that in Ruby, today, it doesn't.

I think that String#succ is of limited usefulness, and will *always* be a
source of strange edge cases like this.

I can see circumstances where it might be useful to have a method defined as succ currently is (for example, maybe, when generating human-readable indices where character-encoding-order makes no sense. Ruby's notion of String, is far more powerful than that suggests - it allows an encoding of an arbitrary byte sequence - i.e. Ruby Strings (unlike strings in most languages) are appropriate type for manipulating binary "messages" - which is an _extremely_ useful feature (at least it is to me!)

Personally, I think that (perhaps) String ranges should be disallowed as
being mostly nonsensical (unless a rigid order, like character set order,
were imposed).

I hope that they are not. Obviously, the current mish-mash is far from desirable but I, for one, will want to determine if a given binary byte sequence is "between" two other byte sequences. The present situation is that I can do this using ranges if I replace succ with a "more sensible" (if likely less efficient) version to the default. I wouldn't want to loose the support for byte-sequence ranges using strings - though I would be wholly in favour of any suggestion to rename succ "alpha_num_succ" (or similar) and to implement succ as iterating of byte sequence. I can't see any reason that, by default, any standard type should not support a consistent total ordering... I hope that, in future, my wish might become reality. :slight_smile:

Steve