Fast searching of large files

Stuart_Clarke · 1 July 2010 09:47

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

Thanks

···

--
Posted via http://www.ruby-forum.com/.

Michael_Fellinger1 · 1 July 2010 10:40

You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).
Oh, and also don't use binary encoding if you are dealing with another one

···

On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke <stuart.clarke1986@gmail.com> wrote:

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

--
Michael Fellinger
CTO, The Rubyists, LLC

Roger_Pack4 · 1 July 2010 19:33

Stuart Clarke wrote:

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

a fast way is to do it in C

Here are a few other helpers, though:

1.9 has faster regexes
boost regexes: GitHub - michaeledgar/ruby-boost-regex: Wraps Boost::Regex in a Ruby binding (you
could probably optimize it more than it currently is, as well...)

Rubinius also might help.

Also make sure to open your file in binary mode if you're on 1.9. That
reads much faster. If that's an option, anyway.
GL.
-rp

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 1 July 2010 11:03

If you only need to know whether the string occurs in the file you can do

found = File.foreach("foo").any? {|line| /apiKey/ =~ line}

This will stop searching as soon as the sequence is found.

"fgrep -l foo" is likely faster.

Kind regards

robert

···

2010/7/1 Michael Fellinger <m.fellinger@gmail.com>:

On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke > <stuart.clarke1986@gmail.com> wrote:

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Stuart_Clarke · 1 July 2010 11:58

Thanks.

This seems to be pretty much the best logic for me, however it takes a
good 20 minutes to scan a 2Gb file.

Any ideas?

Thanks

Michael Fellinger wrote:

···

On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke > <stuart.clarke1986@gmail.com> wrote:

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended, so just substract the string
length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).
Oh, and also don't use binary encoding if you are dealing with another
one

--
Posted via http://www.ruby-forum.com/\.

Joel_VanderWerf · 1 July 2010 17:03

Michael Fellinger wrote:

···

On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke > <stuart.clarke1986@gmail.com> wrote:

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended

Actually, pos will be the position of the end of the line on which the match was found, because #grep works line by line.

Brabuhr · 1 July 2010 19:28

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).

If you only need to know whether the string occurs in the file you can do
found = File.foreach("foo").any? {|line| /apiKey/ =~ line}
This will stop searching as soon as the sequence is found.

"fgrep -l foo" is likely faster.

`fgrep -l waters /usr/share/dict/words`.size > 0

=> true

`fgrep -l watershed /usr/share/dict/words`.size > 0

=> true

`fgrep -l watershedz /usr/share/dict/words`.size > 0

=> false

`fgrep -ob waters /usr/share/dict/words`.split.map{|s| s.split(':').first}

=> ["153088", "153102", "204143", "234643", "472357", "856441",
"913606", "913613", "913623", "913635", "913646", "913656", "913668",
"913679", "913690", "913703"]

`fgrep -ob watershed /usr/share/dict/words`.split.map{|s|

s.split(':').first}
=> ["913613", "913623", "913635"]

`fgrep -ob watershedz /usr/share/dict/words`.split.map{|s|

s.split(':').first}
=>

···

On Thu, Jul 1, 2010 at 7:03 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

2010/7/1 Michael Fellinger <m.fellinger@gmail.com>:

On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke >> <stuart.clarke1986@gmail.com> wrote:

Topic		Replies	Views
Fastest way to parse millions of file ruby-talk	6	202	23 October 2013
How to quickly find a string towards the end of a large io object ruby-talk	11	109	7 November 2008
Reading particular text from a txt file ruby-talk	0	126	6 March 2012
The faster way to read files ruby-talk	17	149	29 December 2011
Newbie question: how to seek in a file ruby-talk	7	105	10 August 2007

Fast searching of large files

Related topics