Strings vs arrays

Hi.

I am wanting to parse large files (of source code) by reading byte by
byte and converting certain patterns into different strings. I can think
of a couple of ways to do this (starting with the source code in a
String):
     1. String#split(//) -> array, then use Array#shift (maybe slow to
        convert to array?)
     2. Keep a counter and use String#[] (which seems unrubyish to me)

Which of these would be preferable? Does anyone know of any better ways?

···

--
Luke Worth

Regex. You know what these patterns are in advance.

-austin

···

On 7/9/05, Luke Worth <luke@worth.id.au> wrote:

I am wanting to parse large files (of source code) by reading byte by
byte and converting certain patterns into different strings. I can think
of a couple of ways to do this (starting with the source code in a
String):
     1. String#split(//) -> array, then use Array#shift (maybe slow to
        convert to array?)
     2. Keep a counter and use String# (which seems unrubyish to me)

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

"Luke Worth" <luke@worth.id.au> schrieb im Newsbeitrag news:1120914144.28438.10.camel@fish.bwnet.com.au...

Hi.

I am wanting to parse large files (of source code) by reading byte by
byte and converting certain patterns into different strings. I can think
of a couple of ways to do this (starting with the source code in a
String):
    1. String#split(//) -> array, then use Array#shift (maybe slow to
       convert to array?)
    2. Keep a counter and use String# (which seems unrubyish to me)

Which of these would be preferable? Does anyone know of any better ways?

I'd do none of them. If you know that your patterns stay on single lines and don't extend to following lines I'd read the file line by line and use String#gsub. Something along the lines of

while ( line = gets )
  line.gsub!( /pattern/, 'replacement' )
  # or line.gsub!( /pattern/ ) {|match| create replacement }
  puts line
end

If patterns extend lines I'd first try to slurp the whole file into mem and then use gsub (possibly with option /m for multiline). Only if that does not work (because of file size for example) I'd resort to more complicated solutions like the ones you described.

Kind regards

    robert

Luke Worth <luke@worth.id.au> writes:

Hi.

I am wanting to parse large files (of source code) by reading byte by
byte and converting certain patterns into different strings. I can think
of a couple of ways to do this (starting with the source code in a
String):
     1. String#split(//) -> array, then use Array#shift (maybe slow to
        convert to array?)
     2. Keep a counter and use String# (which seems unrubyish to me)

Which of these would be preferable? Does anyone know of any better ways?

Why not use String#shift, which is O(1)?

···

--
Daniel Brockman <daniel@brockman.se>

Because it doesn't exist :slight_smile:

irb(main):001:0> "aoeu".shift
NoMethodError: undefined method `shift' for "aoeu":String
        from (irb):1

To the people suggesting regex, sorry I did think of that. However it
won't work because of the problem: I want to convert all instances of
dot-space into a single space, and all instances of space-dot into a
single dot. All dots not followed or preceded by spaces (excluding
newline) must be turned into newlines.
For example, how would you solve the following:
"bla.h+aoeu.++test.\n+.hello" (where + represents a space)
i think it's hard.

Thanks for your help everyone, I think I've figured out how to do it
with a queue structure though.

···

On Sun, 2005-07-10 at 00:06 +0900, Daniel Brockman wrote:

Why not use String#shift, which is O(1)?

--
Luke Worth

Daniel Brockman wrote:

Why not use String#shift, which is O(1)?

EDIT: Daniel later redacted this to Array#shift. My question remains, however.

Many times when I hear an answer, I am more interested in how the answer was obtained than the answer itself, so that I can go find the answers to related questions myself. This is a perfect example of this.

How do you know Array#shift is O(1)? I'm not challenging you; I'm asking because I really want to know how to find this out. Is it documented somewhere? Did you profile it? Did you read the source code?

-dB

···

--
David Brady
ruby-talk@shinybit.com
I'm having a really surreal day... OR AM I?

Will this work? (I used + instead of space as in your example)

str = "bla.h+aoeu.++test.\n+.hello"
p str.gsub(/\.\+|\+\.|\./) { |s|
  case s
  when '+.' then '.'
  when '.+' then '+'
  else "\n"
  end
}
==> "bla\nh+aoeu++test\n\n.hello"

···

On Saturday 09 July 2005 11:25 am, Luke Worth wrote:

On Sun, 2005-07-10 at 00:06 +0900, Daniel Brockman wrote:
> Why not use String#shift, which is O(1)?

Because it doesn't exist :slight_smile:

irb(main):001:0> "aoeu".shift
NoMethodError: undefined method `shift' for "aoeu":String
        from (irb):1

To the people suggesting regex, sorry I did think of that. However it
won't work because of the problem: I want to convert all instances of
dot-space into a single space, and all instances of space-dot into a
single dot. All dots not followed or preceded by spaces (excluding
newline) must be turned into newlines.
For example, how would you solve the following:
"bla.h+aoeu.++test.\n+.hello" (where + represents a space)
i think it's hard.

--
-- Jim Weirich jim@weirichhouse.org http://onestepback.org
-----------------------------------------------------------------
"Beware of bugs in the above code; I have only proved it correct,
not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)

Luke Worth <luke@worth.id.au> writes:

···

On Sun, 2005-07-10 at 00:06 +0900, Daniel Brockman wrote:

Why not use String#shift, which is O(1)?

Because it doesn't exist :slight_smile:

Oh, that's as good a reason as any. :slight_smile:

Who said there were no subtle differences between String and Array?
I would like that person to explain this logic.

--
Daniel Brockman <daniel@brockman.se>

David Brady <ruby_talk@shinybit.com> writes:

Daniel Brockman wrote:

Why not use String#shift, which is O(1)?

EDIT: Daniel later redacted this to Array#shift. My question
remains, however.

Well, not really. My answer is unaffected, however. :slight_smile:

Many times when I hear an answer, I am more interested in how the
answer was obtained than the answer itself, so that I can go find
the answers to related questions myself. This is a perfect example
of this.

I'm sorry. Had there not been a thread about this particular subject
only a week ago, I would have explained.

How do you know Array#shift is O(1)? I'm not challenging you; I'm
asking because I really want to know how to find this out.

I appreciate you making this clear. :slight_smile:

Is it documented somewhere? Did you profile it? Did you read the
source code?

I did not profile it, and I don't think it's documented --- although I
would agree that it should be.

I know because Eric Mahurin posted about this a week ago in

   <20050629195248.75747.qmail@web41115.mail.yahoo.com>,

and a quick glance at the source confirms it:

   VALUE
   rb_ary_shift(ary)
       VALUE ary;
   {
       VALUE top;
   
       rb_ary_modify_check(ary);
       if (RARRAY(ary)->len == 0) return Qnil;
       top = RARRAY(ary)->ptr[0];
       ary_make_shared(ary);
       RARRAY(ary)->ptr++; /* shift ptr */
       RARRAY(ary)->len--;
   
       return top;
   }

Note also that Array#unshift is amortized O(1); it allocates new
memory in chunks (allocating one additional byte would be insane).

Eric's thread was about similar calls, such as Array#slice!(0, n),
having needlessly large O(n) complexity.

···

--
Daniel Brockman <daniel@brockman.se>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.

Hey that's cool, I didn't know you could do that.
I think i'll stick with my queue solution though. The thing i failed to
mention is that i want to tokenize across spaces not preceded or
followed by a dot. I'm also forming it into a tree structure at the same
time....
Thanks for the gsub tip though!

···

On Sun, 2005-07-10 at 00:47 +0900, Jim Weirich wrote:

Will this work? (I used + instead of space as in your example)

str = "bla.h+aoeu.++test.\n+.hello"
p str.gsub(/\.\+|\+\.|\./) { |s|
  case s
  when '+.' then '.'
  when '.+' then '+'
  else "\n"
  end
}
==> "bla\nh+aoeu++test\n\n.hello"

--
Luke Worth

I agree. We should have shift for String. But, more
importantly, slice!(0) and slice!(0,m) should be O(1) like
Array#shift is. But, I've discussed this before in a previous
thread.

I do have an implementation written in ruby for Array/String
where all insertions/deletions at the front of are O(1), but
this doesn't show its advantage until you get to about 50K
elements or so. This is because the ruby interpret time still
dominates over the fast C copy until that many elements. But,
after that point it is quite obvious.

···

--- Daniel Brockman <daniel@brockman.se> wrote:

Luke Worth <luke@worth.id.au> writes:

> On Sun, 2005-07-10 at 00:06 +0900, Daniel Brockman wrote:
>
>> Why not use String#shift, which is O(1)?
>
> Because it doesn't exist :slight_smile:

Oh, that's as good a reason as any. :slight_smile:

Who said there were no subtle differences between String and
Array?
I would like that person to explain this logic.

____________________________________________________
Sell on Yahoo! Auctions – no fees. Bid on great items.

Luke Worth wrote:

Hey that's cool, I didn't know you could do that.
I think i'll stick with my queue solution though. The thing i failed to
mention is that i want to tokenize across spaces not preceded or
followed by a dot. I'm also forming it into a tree structure at the same
time....
Thanks for the gsub tip though!

Consider StringScanner.

Devin

Disagree. What wold the semantics of String#shift be?
What would it shift? Bytes? Characters? Words? Lines?

There is no clear notion of an "Element" in a String.

You can always use split or implement shift
yourself if you need it for a particular application.

-Levin

···

Eric Mahurin <eric_mahurin@yahoo.com> wrote:

I agree. We should have shift for String.

Eric Mahurin <eric_mahurin@yahoo.com> writes:

Who said there were no subtle differences between String and Array?
I would like that person to explain [why there is no String#shift].

I agree. We should have shift for String. But, more importantly,
slice!(0) and slice!(0,m) should be O(1) like Array#shift is.
But, I've discussed this before in a previous thread.

Yes, I agree with this as well.

I do have an implementation written in ruby for Array/String
where all insertions/deletions at the front of are O(1),

I would like to see this implementation. Particularly the Ruby
implementation of an O(1) String#shift.

···

Daniel Brockman <daniel@brockman.se> wrote:

but this doesn't show its advantage until you get to about 50K
elements or so. This is because the ruby interpret time still
dominates over the fast C copy until that many elements.
But, after that point it is quite obvious.

--
Daniel Brockman <daniel@brockman.se>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.

"Levin Alexander" <levin@grundeis.net> writes:

I agree. We should have shift for String.

Disagree. What wold the semantics of String#shift be?
What would it shift? Bytes? Characters?

Whatever String# and all the other String methods index, of course.

Words?

Please --- don't be silly.

Lines?

I still can't believe String#each does this by default.

There is no clear notion of an "Element" in a String.

If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)

You can always use split or implement shift
yourself if you need it for a particular application.

Implementing an O(1) String#shift seems non-trivial to me.
(Eric said he has done it, but did not yet post the code.)

Anyway, this is irrelevant since, after all, you can implement most
of the Ruby standard library yourself ``if you need it.''

···

Eric Mahurin <eric_mahurin@yahoo.com> wrote:

--
Daniel Brockman <daniel@brockman.se>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.

Below is what I have now. You make one of these objects by
just passing an array or a string to Indexed2.new. It only
uses methods in common to Array and String so it doesn't care
which you use (a nice example of taking advantage of duck
typing). From then on you can treat one of these Indexed2
objects like an Array or String (using only the common
methods).

This class also has algorithmic speedup for all inserts/deletes
in the first half of the array. Unfortunately, since Array and
String don't have an efficient way to copy data within the
object, you only see an advantage for about the first quarter
of the sequence. If all of this was written C, inserts/deletes
on both ends of the Array/String would have the same speed
(O(1)) and the average insert/delete time at any position would
be cut in half. Because it is not written in C, you don't
start seeing an advantage until you get to 50K+ elements.

class Indexed2
    Growth = 0.5
    Max = 1
    Shrink = 0.25
    BreakPoint = 1.0/4 # should be 0.5 with a good copy method
    def initialize(data=@@data_class.new)
        @data = data
        @offset = 0
    end
    def class
        @@data_class = @data.class
        super
    end
    def size
        @data.size-@offset
    end
    alias length size
    def empty?
        (@data.size-@offset).zero?
    end
    def << (v)
        @data << v
    end
    def slice(index,*len)
        index = size+index.to_i if
(index.nonzero?||1.0/index)<0
        @data.slice(index+@offset,*len)
    end
    alias slice
    def slice!(index,*len)
        slice(index,*len)
    ensure
        self[index,len[0]||1] = @data.class.new
    end
    def dup
        self.class.new(@data.dup)
    end
    def =(index,*len_value)
        value = len_value.pop
        return(@data[index+@offset] = value) if
len_value.empty?
        len = len_value[0]
        size = @data.size-@offset
        index = size+index.to_i if
(index.nonzero?||1.0/index)<0
        delta = value.size-len
        if delta.nonzero? and index<(size*BreakPoint).to_i
            size += delta
            len += delta
            offset0 = (offset = @offset-delta)
            if delta<0
                if offset>Max*size
                    offset = (Shrink*size).to_i
                    @data[offset,index] = @data[@offset,index]
                    @data[index+offset,offset0-offset+len] =
value
                else
                    @data[offset,index] = @data[@offset,index]
                    @data[index+offset,len] = value
                end
            else
                if offset<0
                    offset = (Growth*size).to_i
                    @data[index+len+offset,0] =
                        @data[index+@offset,offset-offset0]
                end
                @data[offset,index] = @data[@offset,index]
                @data[index+offset,len] = value
            end
            @offset = offset
        else
            @data[index+@offset,len] = value
        end
    end
    def shift
        slice!(0)
    end
    def pop
        slice!(-1)
    end
    def push(*args)
        args.each { |v| @data << v }
        self
    end
    def unshift(*args)
        value = @data.class.new
        args.each { |v| value << v }
        self[0,0] = value
        self
    end
end

···

--- Daniel Brockman <daniel@brockman.se> wrote:

Eric Mahurin <eric_mahurin@yahoo.com> writes:
> I do have an implementation written in ruby for
Array/String
> where all insertions/deletions at the front of are O(1),

I would like to see this implementation. Particularly the
Ruby
implementation of an O(1) String#shift.

__________________________________
Yahoo! Mail for Mobile
Take Yahoo! Mail with you! Check email on your mobile phone.
http://mobile.yahoo.com/learn/mail

Consider StringScanner.

Ah, thanks. This is much better than converting into an array.

···

--
Luke Worth

Daniel Brockman wrote:

Whatever String# and all the other String methods index, of course.

Depending on the parameter you pass, # can return a String or an Integer.

There is no clear notion of an "Element" in a String.
   

If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)

There's no such thing as a character in Ruby. (See any discussion on Unicode, etc. in Ruby.) Strings are Objects (stored in C as char*s, I'd guess). Call the right methods on them, and you can get an Integer representing the byte value at a given position ("Hello"[0]), or another String object representing some manipulation of the String ("Hello"[0..0]). Those are your only means of inspection. That was just a long-winded way of saying "this is true."

Anything I didn't reply to, I probably agree with. Since the String methods don't have a consistent notion of an "element," it doesn't seem it would hurt to choose whichever notion we want for a potential #shift method.

Devin

There is no clear notion of an "Element" in a String.

If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)

There's no such thing as a character in Ruby. (See any discussion on Unicode, etc. in Ruby.)

What about Regular Expressions?

   str = "ä"
   str.scan(/./).length #=> 2
   $KCODE = 'u'
   str.scan(/./).length #=> 1

Anything I didn't reply to, I probably agree with. Since the String methods don't have a consistent notion of an "element," it doesn't seem it would hurt to choose whichever notion we want for a potential #shift method.

The possibilities would be:

   (a)
     str = "foo"
     c = str.shift #=> 102
     c.class #=> Fixnum
     str #=> "oo"

   (b)
     str = "foo"
     c = str.shift #=> "f"
     c.class #=> String
     str #=> "oo"

   (c)
     # make String#shift act like split(/./).shift

(a) would probably most consistent.

-Levin

···

Devin Mullins <twifkak@comcast.net> wrote:

Devin Mullins <twifkak@comcast.net> writes:

Daniel Brockman wrote:

Whatever String# and all the other String methods index, of course.

Depending on the parameter you pass, # can return a String or an
an Integer.

Correct.

There is no clear notion of an "Element" in a String.

If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)

There's no such thing as a character in Ruby.

Not currently, no.

(See any discussion on Unicode, etc. in Ruby.)

Indeed, you have managed to poke my interest, and I'm going to look
into this further.

Strings are Objects (stored in C as char*s, I'd guess).

Obviously.

Call the right methods on them, and you can get an Integer
representing the byte value at a given position ("Hello"[0]),

Yes, but is this subject to change? If so, I guess String is destined
to become a character array. If not, a byte array.

or another String object representing some manipulation of the
String ("Hello"[0..0]).

I'm not talking about subranges.

Those are your only means of inspection. That was just a
long-winded way of saying "this is true."

Anything I didn't reply to, I probably agree with. Since the String
methods don't have a consistent notion of an "element," it doesn't
seem it would hurt to choose whichever notion we want for a
potential #shift method.

No, it does not. If the semantics of String are normalized in a
future version, the issue of String#shift will be one among many.

···

--
Daniel Brockman <daniel@brockman.se>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.