Regex. How to return multiple lines

Damjan_Rems · 8 August 2008 10:54

I have a text file containing some data. Rows are normaly delimited with
newline char. I assume that file would be read in one chunk into memory.

Is it possible with regex and how to return all lines containing some
string.

My first thought was to read file in array and process each line, but I
guess if possible, operating on single string would be faster.

by
TheR

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 8 August 2008 11:07

13:07:47 Temp$ ./l.rb
["aaa\n", "a\n", "aaa\n", "a\n"]
13:07:53 Temp$ cat x
aaa
b
c
dd
a
aaa
s
a
13:07:56 Temp$ cat l.rb
#!/bin/env ruby

c = File.read "x"

p c.grep /a+/

13:08:49 Temp$

There are of course other possible approaches and depending on what
you want to do they might be more efficient.

Kind regards

robert

···

2008/8/8 Damjan Rems <d_rems@yahoo.com>:

I have a text file containing some data. Rows are normaly delimited with
newline char. I assume that file would be read in one chunk into memory.

Is it possible with regex and how to return all lines containing some
string.

My first thought was to read file in array and process each line, but I
guess if possible, operating on single string would be faster.

--
use.inject do |as, often| as.you_can - without end

Robert_K1 · 8 August 2008 11:09

PS: one bit of explanation: my piece of code works because String#each
returns each line individually.

···

2008/8/8 Robert Klemme <shortcutter@googlemail.com>:

--
use.inject do |as, often| as.you_can - without end

Damjan_Rems · 8 August 2008 12:49

There are of course other possible approaches and depending on what
you want to do they might be more efficient.

I would baisicly like to have somekind of fulltext search.

Your example finds 4 letter word in 7.5MB file containing 65000 lines in
0.13 seconds on C2DUO 2.4 Ghz and 0.35 seconds on PIII 1.3Ghz which is
quite good.

by
TheR

···

--
Posted via http://www.ruby-forum.com/\.

Sebastian_Hungereck1 · 10 August 2008 13:49

Robert Klemme wrote:

c = File.read "x"

p c.grep /a+/

Why not:
p File.open("x") {|f| f.grep /a+/}
Avoids loading the whole file into memory.

HTH,
Sebastian

···

--
NP: Moonspell - Crystal gazing
Jabber: sepp2k@jabber.org
ICQ: 205544826

Robert_K1 · 8 August 2008 13:15

If you just want to find all lines with certain words and print them
I'd prefer the streamed approach since it works for arbitrary large
files as it does not require the whole file to be in memory:

File.foreach "x.dat" do |line|
puts line if /a+/ =~ line
end

Of course, using grep or egrep would be even faster.

Kind regards

robert

···

2008/8/8 Damjan Rems <d_rems@yahoo.com>:

There are of course other possible approaches and depending on what
you want to do they might be more efficient.

I would baisicly like to have somekind of fulltext search.

Your example finds 4 letter word in 7.5MB file containing 65000 lines in
0.13 seconds on C2DUO 2.4 Ghz and 0.35 seconds on PIII 1.3Ghz which is
quite good.

--
use.inject do |as, often| as.you_can - without end

Gavin_Kistner3 · 8 August 2008 13:53

In case it helps, here is my own ruby script for searching for files
with names and/or contents matching a particular regex:

Slim2:~ phrogz$ cat /usr/local/bin/findfile
#!/usr/bin/env ruby

USAGE = <<ENDUSAGE
Usage:
  findfile [-d max_depth] [-a] [-c] [-i] name_regexp
[content_regexp]
  -d,--depth the maximum depth to recurse to (defaults to no
limit)
  -a,--showall with content_regexp, show every match per file
              (defaults to only show the first-match per file)
  -c,--usecase with content_regexp, use case-sensitive matching
              (defaults to case-insensitive)
  -i,--includedirs also find directories matching name_regexp
              (defaults to files only; not with content_regexp)
  -h,--help show some help examples
ENDUSAGE

EXAMPLES = <<ENDEXAMPLES

Examples:
findfile foo
# Print the path to all files with 'foo' in the name

findfile -i foo
# Print the path to all files and directories with 'foo' in the name

findfile js$
# Print the path to all files whose name ends in "js"

  findfile js$ vector
  # Print the path to all files ending in "js" with "Vector" or
"vector"
  # (or "vEcTOr", "VECTOR", etc.) in the contents, and print some of
the
  # first line that has that content.

  findfile js$ -c Vector
  # Like above, but must match exactly "Vector"
  # (not 'vector' or 'VECTOR').

  findfile . vector -a
  # Print the path to every file with "Vector" (any case) in it
somewhere
  # printing every line (with line numbers) with that content.

findfile -d 0 .
# Print the path to every file that is in the current directory.

  findfile -d 1 .
  # Print the path to every file that is in the current directory or
any
  # of its child directories (but no subdirectories of the children).
ENDEXAMPLES

ARGS = {}
UNFLAGGED_ARGS = [ :name_regexp, :content_regexp ]
next_arg = UNFLAGGED_ARGS.first
ARGV.each{ |arg|
  case arg
    when '-d','--depth'
     next_arg = :max_depth
    when '-a','--showall'
     ARGS[:showall] = true
    when '-c','--usecase'
     ARGS[:usecase] = true
    when '-i','--includedirs'
     ARGS[:includedirs] = true
    when '-h','--help'
     ARGS[:help] = true
    else
     if next_arg
      if next_arg==:max_depth
        arg = arg.to_i + 1
      end
      ARGS[next_arg] = arg
      UNFLAGGED_ARGS.delete( next_arg )
     end
     next_arg = UNFLAGGED_ARGS.first
  end
}

if ARGS[:help] or !ARGS[:name_regexp]
  puts USAGE
  puts EXAMPLES if ARGS[:help]
  exit
end

class Dir
  def
self.crawl(path,max_depth=nil,include_directories=false,depth=0,&blk)
    return if max_depth && depth > max_depth
    begin
     if File.directory?( path )
      yield( path, depth ) if include_directories
      files = Dir.entries( path ).select{ |f| true unless f=~/^\.
{1,2}$/ }
      unless files.empty?
        files.collect!{ |file_path|
         Dir.crawl( path+'/'+file_path, max_depth,
                    include_directories, depth+1, &blk )
        }.flatten!
      end
      return files
     else
      yield( path, depth )
     end
    rescue SystemCallError => the_error
     warn "ERROR: #{the_error}"
    end
  end

end

start_time = Time.new
name_match = Regexp.new(ARGS[:name_regexp], true )
content_match = ARGS[:content_regexp] && Regexp.new( ".
{0,20}#{ARGS[:content_regexp]}.{0,20}", !ARGS[:usecase] )

file_count = 0
matching_count = 0
Dir.crawl(
  '.',
  ARGS[:max_depth],
  ARGS[:includedirs] && !content_match
){ |file_path, depth|
  if File.split( file_path )[ 1 ] =~ name_match
    if content_match
     if ARGS[:showall]
      shown_file = false
      IO.readlines( file_path ).each_with_index{ |
line_text,line_number|
        if match = line_text[content_match]
         unless shown_file
          puts file_path
          matching_count += 1
          shown_file = true
         end
         puts ( "%5d: " % (line_number+1) ) + match
        end
      }
      puts " " if shown_file
     elsif IO.read( file_path ) =~ content_match
      puts file_path," #{$~}"," "
      matching_count += 1
     end
    else
     puts file_path
     matching_count += 1
    end
  end
  file_count += 1
}
elapsed = Time.new - start_time
puts "Found %d file%s (out of %d) in %.2f seconds" % [
  matching_count,
  matching_count==1 ? '' : 's',
  file_count,
  elapsed
]

···

On Aug 8, 6:49 am, Damjan Rems <d_r...@yahoo.com> wrote:

I would baisicly like to have somekind of fulltext search.

Robert_K1 · 10 August 2008 14:42

The original request stated "I assume that file would be read in one chunk into memory." which I choose to respect. But see my disclaimer at the end and also my other reply that hinted at this.

Cheers

robert

···

On 10.08.2008 15:49, Sebastian Hungerecker wrote:

Robert Klemme wrote:

c = File.read "x"

p c.grep /a+/

Why not:
p File.open("x") {|f| f.grep /a+/}
Avoids loading the whole file into memory.

Sebastian_Hungereck1 · 10 August 2008 15:03

Robert Klemme wrote:

The original request stated "I assume that file would be read in one
chunk into memory." which I choose to respect.

Oh, sorry, I did not notice that.

···

--
Jabber: sepp2k@jabber.org
ICQ: 205544826

Damjan_Rems · 11 August 2008 06:53

Robert Klemme wrote:

Robert Klemme wrote:

c = File.read "x"

p c.grep /a+/

Why not:
p File.open("x") {|f| f.grep /a+/}
Avoids loading the whole file into memory.

The original request stated "I assume that file would be read in one
chunk into memory." which I choose to respect. But see my disclaimer at
the end and also my other reply that hinted at this.

Cheers

robert

Thank you very much guys.

As I wrote I assume that data would reside in the memory for the life of
a program. Since search argument would be entered by user the final
result looks close to this:

what = get_from_input()
r = Regexp.new(what, true)
a = s.grep(r)

by
TheR

···

On 10.08.2008 15:49, Sebastian Hungerecker wrote:

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Extract lines with regular expressions ruby-talk	2	141	21 April 2008
Checking File for a line ruby-talk	8	128	25 November 2007
Search string in a file ruby-talk	6	114	15 October 2003
Regexing a file's contents without reading the whole thing? ruby-talk	3	123	2 December 2009
Search ruby-talk	3	67	20 June 2007

Regex. How to return multiple lines

Related topics