How to quickly find a string towards the end of a large io object

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Thanks in advance!

binary search using seek. in all honestly though, it'd have to be really big to beat a regex if the entire file contents. define big?

a @ http://codeforpeople.com/

···

On Nov 5, 2008, at 11:03 PM, bwv549 wrote:

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Thanks in advance!

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I

What is a "massive IO object"? Are we talking about a large file? A
socket or pipe which yields a lot data?

need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Start scanning at the beginning and travel forward in time. :slight_smile:

No seriously, we need a bit more detail.

Kind regards

robert

···

2008/11/6 bwv549 <jtprince@gmail.com>:

--
remember.guy do |as, often| as.you_can - without end

bwv549 wrote:

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What's the fastest way to do this?

Thanks in advance!

They are correct that more info is needed. If it is just a huge amount
of text and it absolutely *had* to run faster, in other languages I
would break up the string and have separate threads search.

···

--
Posted via http://www.ruby-forum.com/\.

ara.t.howard wrote:

> In other words, I have a massive IO object that I cannot reverse and I
> need to very quickly find a string towards the end (but still fairly
> far from the end). What's the fastest way to do this?
>
> Thanks in advance!

binary search using seek.

He didn't say his data was sorted, did he?

···

--
Jabber: sepp2k@jabber.org
ICQ: 205544826

bwv549 wrote:
> How do I scan starting at the end of a big io object to find a string
> that is always closer to then end?

        [...]

> In other words, I have a massive IO object that I cannot reverse and I
> need to very quickly find a string towards the end (but still fairly
> far from the end). What's the fastest way to do this?
>
> Thanks in advance!

They are correct that more info is needed. If it is just a huge amount
of text and it absolutely *had* to run faster, in other languages I
would break up the string and have separate threads search.

Other things to take into account would include:
* Will be big string be kept for some time?
* Will there be multiple searches on it?
* Is it bigger than RAM + swap?

You might find that if you're doing multiple searches that a suffix array
(see Programming Pearls) is useful. You may wish to look at Ferret. You
may want to call out to another program to do this if is is obtained from
a file.

ri String#rindex gives:

---------------------------------------------------------- String#rindex
     str.rindex(substring [, fixnum]) => fixnum or nil
     str.rindex(fixnum [, fixnum]) => fixnum or nil
     str.rindex(regexp [, fixnum]) => fixnum or nil

···

On Thu, 6 Nov 2008, Lloyd Linklater wrote:
------------------------------------------------------------------------
     Returns the index of the last occurrence of the given _substring_,
     character (_fixnum_), or pattern (_regexp_) in _str_. Returns +nil+
     if not found. If the second parameter is present, it specifies the
     position in the string to end the search---characters beyond this
     point will not be considered.

        "hello".rindex('e') #=> 1
        "hello".rindex('l') #=> 3
        "hello".rindex('a') #=> nil
        "hello".rindex(101) #=> 1
        "hello".rindex(/[aeiou]/, -2) #=> 1

Does this actually search from the end backwards? Not checked the source
(recently enough || at all) to know, but I'd expect so.

        HTH
        Hugh

I really appreciate all the comments. They are very helpful. I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB. It may easily be up to 1GB or more. The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it. I need to check a parameter in the file before processing the
complete binary format (which I've decoded). At this point I only
expect to see files, but I was hoping to make it generic to any IO
object. I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn't).

Again, the basic layout of the file:

................................my_param = X........

I need the value 'X' following the 'my_para = '

Thanks again!

Here's roughly what I'm using right now (sort of an amalgamation of
everyone's thoughts):

File.open(filename) do |handle|
      halfway = handle.stat.size / 2
      handle.seek halfway
      last_half = handle.read
      params_start_index = last_half.rindex(my_substring) + halfway
end

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone's help and I will check back to see if there are
any more thoughts.

For simplicity reasons I'd start with this:

def get_param(file_name)
  size = File.size file_name
  File.open file_name do |io|
    io.seek(-(size * 0.1).to_i, IO::SEEK_END)
    io.read[%r{my_param\s*=\s*(\S+)}, 1]
  end
end

Kind regards

robert

···

2008/11/6 bwv549 <jtprince@gmail.com>:

I really appreciate all the comments. They are very helpful. I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB. It may easily be up to 1GB or more. The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it. I need to check a parameter in the file before processing the
complete binary format (which I've decoded). At this point I only
expect to see files, but I was hoping to make it generic to any IO
object. I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn't).

Again, the basic layout of the file:

................................my_param = X........

I need the value 'X' following the 'my_para = '

--
remember.guy do |as, often| as.you_can - without end

I really appreciate all the comments. They are very helpful. I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB. It may easily be up to 1GB or more. The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it. I need to check a parameter in the file before processing the
complete binary format (which I've decoded). At this point I only
expect to see files, but I was hoping to make it generic to any IO

You can do that with read.

object. I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn't).

Other languages may be faster of course, but first make it work...

Again, the basic layout of the file:

................................my_param = X........

let the length of your data ^^^^^^^^^^^^ be k

First try reading the whole lot. If it's too big you'll get an exception.
Otherwise use rindex on the buffer from the successful read.

If it's a stream you'll have to read it on order anyway,
so read it in chunks with a 2*k overlap so you can definitely
fit the whole string in a chunk, and boundaries between reads
don't matter. That should work on 10GB files. You might want
to read in multiples of 512 bytes on Unix, maybe 4k (cluster size??)
on Windows, not sure.

I need the value 'X' following the 'my_para = '

Thanks again!

        Hugh

···

On Fri, 7 Nov 2008, bwv549 wrote:

cfp:~ > cat a.rb
def tail_search io, re, options = {}
   io = open io unless io.respond_to?(:read)
   percent = Float(options['percent']||options[:percent]||0.10)

   buf = ''
   size = io.stat.size
   pagesize = Integer(size * percent)
   pos = 0

   loop do
     pos -= pagesize
     break if pos.abs > size
     io.seek(pos, IO::SEEK_END)
     buf = buf + io.read(pagesize)
     match = re.match(buf)
     return match if match
   end

   return nil
ensure
   io.close rescue nil
end

match = tail_search(__FILE__, /(key)=(val)/, :percent => 0.02)
p match.to_a if match

p tail_search(__FILE__, /(key)=(value)/)

__END__
key=val

cfp:~ > ruby a.rb
["key=val", "key", "val"]
nil

you may want to modify to use rindex instead of a regex - up to you.

a @ http://codeforpeople.com/

···

On Nov 6, 2008, at 10:18 AM, bwv549 wrote:

Here's roughly what I'm using right now (sort of an amalgamation of
everyone's thoughts):

File.open(filename) do |handle|
     halfway = handle.stat.size / 2
     handle.seek halfway
     last_half = handle.read
     params_start_index = last_half.rindex(my_substring) + halfway
end

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone's help and I will check back to see if there are
any more thoughts.

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

this one never reads more than pagesize into memory and deals with the fact that a needle could straddle two pages by keeping the current page plus the previous page's first bit (only need maximum of needle size bytes) as the search target.

the extra code is just showing you that it does, in fact, find it's target.

you can up the percent for speed or crank it down to save on memory.

cfp:~ > cat a.rb
def tail_search io, needle, options = {}
   io = open io unless io.respond_to?(:read)
   percent = Float(options['percent']||options[:percent]||0.10)

   buf = ''
   size = io.stat.size
   pagesize = Integer(size * percent)
   pos = 0

   loop do
     pos -= pagesize
     break if pos.abs > size
     io.seek(pos, IO::SEEK_END)
     buf = io.read(pagesize) + buf[0, needle.size]
     relative_index = buf.index(needle)
     if relative_index
       absolute_index = size + pos + relative_index
       return absolute_index
     end
   end

   return nil
ensure
   io.close rescue nil
end

needle = 'key=val'
index = tail_search(__FILE__, needle, :percent => 0.02)
if index
   open(__FILE__) do |fd|
     fd.seek index
     puts fd.read(needle.size)
   end
end

needle = 'io.close rescue nil'
index = tail_search(__FILE__, needle, :percent => 0.02)
if index
   open(__FILE__) do |fd|
     fd.seek index
     puts fd.read(needle.size)
   end
end

__END__
key=val

cfp:~ > ruby a.rb
key=val
io.close rescue nil

a @ http://codeforpeople.com/

···

On Nov 6, 2008, at 10:18 AM, bwv549 wrote:

Here's roughly what I'm using right now (sort of an amalgamation of
everyone's thoughts):

File.open(filename) do |handle|
     halfway = handle.stat.size / 2
     handle.seek halfway
     last_half = handle.read
     params_start_index = last_half.rindex(my_substring) + halfway
end

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone's help and I will check back to see if there are
any more thoughts.

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama