IO#Foreach -- Max line length

Tristin_Davis · 6 March 2008 21:05

I'm trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?

···

--
Posted via http://www.ruby-forum.com/.

7stud · 6 March 2008 22:37

Tristin Davis wrote:

I'm trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?

max = 3
count = 0

IO.foreach('data.txt') do |line|
  if count == max
    break
  else
    count += 1
  end

puts line
end

···

--
Posted via http://www.ruby-forum.com/\.

Tristin_Davis · 6 March 2008 22:54

But by the time you actually get count, isn't the line already read in
memory. So if the line is 7 gigabytes, it'll probably crash the system.

7stud -- wrote:

···

Tristin Davis wrote:

I'm trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?

max = 3
count = 0

IO.foreach('data.txt') do |line|
  if count == max
    break
  else
    count += 1
  end

  puts line
end

--
Posted via http://www.ruby-forum.com/\.

Arlen_Christian_Mar1 · 6 March 2008 23:44

Hi,

···

On Fri, Mar 7, 2008 at 9:37 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:

max = 3
count = 0

IO.foreach('data.txt') do |line|
if count == max
break
else
count += 1
end

puts line
end

Not quite the solution. This reads a number of lines, as opposed to
limiting the length of a single line read.

Arlen

7stud · 6 March 2008 23:52

Tristin Davis wrote:

But by the time you actually get count, isn't the line already read in
memory. So if the line is 7 gigabytes, it'll probably crash the system.

Is this what you are looking for:

max_bytes = 30
text = IO.read('data.txt', max_bytes)
puts text

···

--
Posted via http://www.ruby-forum.com/\.

_Pena_Botp1 · 7 March 2008 02:34

On Behalf Of Tristin Davis:
# But by the time you actually get count, isn't the line
# already read in
# memory. So if the line is 7 gigabytes, it'll probably crash
# the system.

read will accept arg on how many bytes to read.

so how about,

irb(main):040:0> File.open "test.rb" do |f| f.read end
=> "a=(1..2)\n\na\nputs a\n\nputs a.each{|x| puts x}"

irb(main):041:0> File.open "test.rb" do |f| f.read 2 end
=> "a="

irb(main):042:0> File.open "test.rb" do |f| f.read 2; f.read 2 end
=> "(1"

irb(main):043:0> File.open "test.rb" do |f| while x=f.read(2); p x; end; end
"a="
"(1"
".."
"2)"
"\n\n"
"a\n"
"pu"
"ts"
" a"
"\n\n"
"pu"
"ts"
" a"
".e"
"ac"
"h{"
"|x"
"| "
"pu"
"ts"
" x"
"}"
=> nil

kind regards -botp

Adam_Shelly · 7 March 2008 17:58

On Behalf Of Tristin Davis:
# But by the time you actually get count, isn't the line
# already read in
# memory. So if the line is 7 gigabytes, it'll probably crash
# the system.

read will accept arg on how many bytes to read.

so how about,

...

irb(main):043:0> File.open "test.rb" do |f| while x=f.read(2); p x; end; end

That solution essentially ignores linebreaks.
If you want to read up to a linebreak or N characters, whichever comes
first, you could one of these:

···

On 3/6/08, Peña, Botp <botp@delmonte-phil.com> wrote:

------
class IO
  #read by characters
  def for_eachA(linelen)
    c=0
    while (c)
      buf=''
      linelen.times {
        break unless c=getc
        buf<<c
        break if c.chr== $/
      }
      yield buf
    end
  end

  #read by lines
  def for_eachB(linelen)
    re = Regexp.new(".*?#{Regexp.escape($/)}")
    buf=''
    while (line = read(linelen-buf.length))
      buf = (buf+line).gsub(re){|l| yield l;''}
      if buf.length == linelen
        yield buf
        buf=''
      end
    end
    yield buf
  end
end

File.open("foreach.rb") do |f|
f.for_eachA(10){|l| p l}
end

File.open("foreach.rb") do |f|
f.for_eachB(10){|l| p l}
end
------

I'd guess the second version would be faster, but I didn't time it.

-Adam

Tristin_Davis · 8 March 2008 21:31

Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i'd post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.

module Util

  def too_large?(buffer,max=10)
    return true if buffer.length >= max
    false
  end
end

include Util

file = ARGV.shift #"C:/Documents and Settings/trdavi/Desktop/a1-1k.aa"
buf=''
record = 1
frequency = 100

f = File.open(file,'r')

while c=f.getc
buf << c

    if too_large?(buf,max=102400)
        p "record #{record} is too long, skipping to end"
        while(x=f.getc)
          if x.chr == $/
            buf=''
            record += 1
            p "At record #{record}" if( (record % frequency ) == 0 )
            break
          end
        end
    end

    if c.chr == $/
       record += 1
       print "At record #{record}" if( (record % frequency ) == 0 )
       buf = ''
    end
end

#If we still have something in the buffer, then it is probably the last
record.
unless buf.empty?
#record += 1
p "Last record is:" + buf
end

f.close
p record

Adam Shelly wrote:

···

On 3/6/08, Pe�a, Botp <botp@delmonte-phil.com> wrote:

On Behalf Of Tristin Davis:
# But by the time you actually get count, isn't the line
# already read in
# memory. So if the line is 7 gigabytes, it'll probably crash
# the system.

read will accept arg on how many bytes to read.

so how about,

...

irb(main):043:0> File.open "test.rb" do |f| while x=f.read(2); p x; end; end

That solution essentially ignores linebreaks.
If you want to read up to a linebreak or N characters, whichever comes
first, you could one of these:

------
class IO
  #read by characters
  def for_eachA(linelen)
    c=0
    while (c)
      buf=''
      linelen.times {
        break unless c=getc
        buf<<c
        break if c.chr== $/
      }
      yield buf
    end
  end

  #read by lines
  def for_eachB(linelen)
    re = Regexp.new(".*?#{Regexp.escape($/)}")
    buf=''
    while (line = read(linelen-buf.length))
      buf = (buf+line).gsub(re){|l| yield l;''}
      if buf.length == linelen
        yield buf
        buf=''
      end
    end
    yield buf
  end
end

File.open("foreach.rb") do |f|
f.for_eachA(10){|l| p l}
end

File.open("foreach.rb") do |f|
f.for_eachB(10){|l| p l}
end
------

I'd guess the second version would be faster, but I didn't time it.

-Adam

--
Posted via http://www.ruby-forum.com/\.

7stud · 9 March 2008 05:03

Tristin Davis wrote:

Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i'd post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.

module Util

  def too_large?(buffer,max=10)
    return true if buffer.length >= max
    false
  end
end

include Util

file = ARGV.shift #"C:/Documents and Settings/trdavi/Desktop/a1-1k.aa"
buf=''
record = 1
frequency = 100

f = File.open(file,'r')

while c=f.getc

        if buf.length < max #(but what if you find a '\n' before max?)
          buf << c
        else
           buf = ''
           f.gets
        end

···

--
Posted via http://www.ruby-forum.com/\.

Tristin_Davis · 9 March 2008 10:45

That's what the 2nd if statement is; for catching the delimiter if the
buffer isn't too large. I can't use gets b/c I may expend all the
memory before the actual line is read. I'm reading variable length
records, but some of them are bad data and exceed a max length of 100k.
That's what the script is scanning for.

7stud -- wrote:

···

Tristin Davis wrote:

Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i'd post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.

module Util

  def too_large?(buffer,max=10)
    return true if buffer.length >= max
    false
  end
end

include Util

file = ARGV.shift #"C:/Documents and Settings/trdavi/Desktop/a1-1k.aa"
buf=''
record = 1
frequency = 100

f = File.open(file,'r')

while c=f.getc

        if buf.length < max #(but what if you find a '\n' before max?)
          buf << c
        else
           buf = ''
           f.gets
        end

--
Posted via http://www.ruby-forum.com/\.

7stud · 9 March 2008 21:34

Tristin Davis wrote:

That's what the 2nd if statement is; for catching the delimiter if the
buffer isn't too large. I can't use gets b/c I may expend all the
memory before the actual line is read.

Look. A string and a file are really no different--except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.

···

--
Posted via http://www.ruby-forum.com/\.

Tristin_Davis · 9 March 2008 22:01

Gotcha, I'll post the code once i revamp

7stud -- wrote:

···

Tristin Davis wrote:

That's what the 2nd if statement is; for catching the delimiter if the
buffer isn't too large. I can't use gets b/c I may expend all the
memory before the actual line is read.

Look. A string and a file are really no different--except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.

--
Posted via http://www.ruby-forum.com/\.

Tristin_Davis · 10 March 2008 06:22

Here's the benchmarks for the old and new code:
Old: 5.484000 0.031000 5.515000 ( 5.782000)
New: 5.094000 0.047000 5.141000 ( 5.407000)

=cut

module DataVerifier
require 'strscan'

  def too_large?(buffer,max=1024)
    return true if buffer.length >= max
    false
  end

def verify_vbl(file,frequency,max,delimiter,out,cache_size)
$/=delimiter

    buffer=''
    buf=''
    record = 1
    o = File.new(out,"w")
    f = File.open(file,'r')

while(buffer=f.read(cache_size=1048576))
cache=StringScanner.new(buffer)

while(c = cache.getch)
buf << c

        if too_large?(buf,max)
            o.print "record #{record} is too long, skipping to end\n"
            while(x=cache.getch)
              if x == $/
                buf=''
                record += 1
                print "At record #{record}\n" if( (record % frequency )
== 0 ) unless frequency.nil?
                break
              end
            end
        end

        if c == $/
           record += 1
           print "At record #{record}\n" if( (record % frequency ) == 0
) unless frequency.nil?
           buf = ''
        end
      end
    end
    f.close
    o.close
    record
  end
end

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Some file operations ruby-talk	8	80	5 October 2007
Writing UNIX 'wc' program ruby-talk	12	86	27 June 2004
File Question ruby-talk	14	91	3 March 2007
Iterators and blocks question ruby-talk	13	66	9 February 2007
Number of lines in a text file ruby-talk	9	186	21 October 2013

IO#Foreach -- Max line length

Related topics