Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?
You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }
The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).
If you only need to know whether the string occurs in the file you can do
found = File.foreach("foo").any? {|line| /apiKey/ =~ line}
This will stop searching as soon as the sequence is found.
"fgrep -l foo" is likely faster.
`fgrep -l waters /usr/share/dict/words`.size > 0
=> true
`fgrep -l watershed /usr/share/dict/words`.size > 0
=> true
`fgrep -l watershedz /usr/share/dict/words`.size > 0
=> false
`fgrep -ob waters /usr/share/dict/words`.split.map{|s| s.split(':').first}
=> ["153088", "153102", "204143", "234643", "472357", "856441",
"913606", "913613", "913623", "913635", "913646", "913656", "913668",
"913679", "913690", "913703"]
`fgrep -ob watershed /usr/share/dict/words`.split.map{|s|
s.split(':').first}
=> ["913613", "913623", "913635"]
`fgrep -ob watershedz /usr/share/dict/words`.split.map{|s|
s.split(':').first}
=>
···
On Thu, Jul 1, 2010 at 7:03 AM, Robert Klemme <shortcutter@googlemail.com> wrote:
2010/7/1 Michael Fellinger <m.fellinger@gmail.com>:
On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke >> <stuart.clarke1986@gmail.com> wrote: