Howto get array.agrep (NOT array.grep)

People,

Is there some way to get agrep working with Ruby arrays? - agrep has
some nice, useful features that grep doesn't . .

Thanks,

Phil.

···

--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

Phil Rhoades wrote:

Is there some way to get agrep working with Ruby arrays? - agrep has
some nice, useful features that grep doesn't . .

Perhaps if you explained what this mysterious 'agrep' was, we might
help.
Something from another language? A unix utility?

Give us a sample array, and what you'd like the result to be after
calling this method on that array.

NAME
       agrep - print lines approximately matching a pattern

SYNOPSIS
       agrep [OPTION]... PATTERN [FILE]...

DESCRIPTION
       Searches for approximate matches of PATTERN in each FILE or
standard input. Exam-
       ple: 'agrep -2 optimize foo.txt' outputs all lines in file
'foo.txt' that match
       "optimize" within two errors. E.g. lines which contain
"optimise", "optmise", and
       "opitmize" all match.

···

On Sat, 2008-04-26 at 13:15 +0900, Phrogz wrote:

Phil Rhoades wrote:
> Is there some way to get agrep working with Ruby arrays? - agrep has
> some nice, useful features that grep doesn't . .

Perhaps if you explained what this mysterious 'agrep' was, we might
help.
Something from another language? A unix utility?

Give us a sample array, and what you'd like the result to be after
calling this method on that array.

--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

* Phil Rhoades <phil@pricom.com.au> (06:24) schrieb:

NAME
       agrep - print lines approximately matching a pattern

Enurable#grep can do that, if you pass it the right block. When you pass
a block to grep it's the block's job to match the elements.

Now the interesting question is: How would that block look like?

mfg, simon .... l

NAME
      agrep - print lines approximately matching a pattern

Enurable#grep can do that, if you pass it the right block. When you pass
a block to grep it's the block's job to match the elements.

no.

     enum.grep(pattern) => array
     enum.grep(pattern) {| obj | block } => array
------------------------------------------------------------------------
     Returns an array of every element in _enum_ for which +Pattern ===
     element+. If the optional _block_ is supplied, each matching
     element is passed to it, and the block's result is stored in the
     output array.

The block just morphs the result, it doesn't morph the match.

···

On Apr 26, 2008, at 03:35 , Simon Krahnke wrote:

hi phil!

if all you want is getting all the strings within a certain edit
distance of your pattern, have a look at [1]. it doesn't support
regular expressions in the pattern because i don't how to achieve
that easily without re-implementing agrep's algorithm :wink: it's
really just a quick hack that might get you started, hopefully.

[1]
<http://prometheus.rubyforge.org/ruby-nuggets/classes/Enumerable.html#M000091>

cheers
jens

···

--
Jens Wille, Dipl.-Bibl. (FH)
prometheus - Das verteilte digitale Bildarchiv für Forschung & Lehre
Kunsthistorisches Institut der Universität zu Köln
Albertus-Magnus-Platz, D-50923 Köln
Tel.: +49 (0)221 470-6668, E-Mail: jens.wille@uni-koeln.de
http://www.prometheus-bildarchiv.de/

jens,

hi phil!

if all you want is getting all the strings within a certain edit
distance of your pattern, have a look at [1]. it doesn't support
regular expressions in the pattern because i don't how to achieve
that easily without re-implementing agrep's algorithm :wink: it's
really just a quick hack that might get you started, hopefully.

[1]
<http://prometheus.rubyforge.org/ruby-nuggets/classes/Enumerable.html#M000091&gt;

This might work but it would be more difficult without regexs - the
current application does a system call to agrep but of course it is very
slow for large numbers of calls. A typical call is something like:

  agrep -2 "Smith\|J.*12345" list1.txt list2.txt list3.txt

This allows two differences on a minimum amount of information
consisting of last name, first initial and zip code. If I use the
Enumerable version, I would have to use the whole, delimited, name &
address string and increase the differences/distance number?

Did you just do that hack now? - how do I get/install it? (Fedora 8).

Thanks,

Phil.

···

On Sat, 2008-04-26 at 23:15 +0900, Jens Wille wrote:
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

Phil Rhoades [2008-04-26 19:13]:

This might work but it would be more difficult without regexs -
the current application does a system call to agrep but of course
it is very slow for large numbers of calls. A typical call is
something like:

  agrep -2 "Smith\|J.*12345" list1.txt list2.txt list3.txt

This allows two differences on a minimum amount of information
consisting of last name, first initial and zip code. If I use
the Enumerable version, I would have to use the whole, delimited,
name & address string and increase the differences/distance
number?

i think something like that could work in your case (requires the
Text gem):

  File.open('list1.txt').select { |line|
    # extract name and zip code from line
    line =~ /\A(.*?\|.).*\b(\d{5})\b/ # adjust appropriately!

    # name may have two errors, zip only one -- or whatever...
    Text::Levenshtein.distance($1, 'Smith|J') <= 2 &&
Text::Levenshtein.distance($2, '12345') <= 1
  }

Did you just do that hack now?

that's right. but i just read a bit on agrep's algorithm and it
might be fun to implement it in ruby (though a bit slow, probably).
as an alternative, it might be even worth writing ruby bindings to
agrep. who knows, if time permits... :wink:

- how do I get/install it? (Fedora 8).

well, i don't think that particular implementation suits your needs
and is obviously easily adapted (after all, it's just a select with
an appropriate block utilizing Text::Levenshtein.distance). but you
can get ruby-nuggets from rubyforge (gem install ruby-nuggets), or,
if the new version hasn't found its way onto the mirrors yet, from
our own gem server at http://prometheus.khi.uni-koeln.de/rubygems/\.

cheers
jens

jens,

Phil Rhoades [2008-04-26 19:13]:
> This might work but it would be more difficult without regexs -
> the current application does a system call to agrep but of course
> it is very slow for large numbers of calls. A typical call is
> something like:
>
> agrep -2 "Smith\|J.*12345" list1.txt list2.txt list3.txt
>
> This allows two differences on a minimum amount of information
> consisting of last name, first initial and zip code. If I use
> the Enumerable version, I would have to use the whole, delimited,
> name & address string and increase the differences/distance
> number?

i think something like that could work in your case (requires the
Text gem):

  File.open('list1.txt').select { |line|
    # extract name and zip code from line
    line =~ /\A(.*?\|.).*\b(\d{5})\b/ # adjust appropriately!

    # name may have two errors, zip only one -- or whatever...
    Text::Levenshtein.distance($1, 'Smith|J') <= 2 &&
Text::Levenshtein.distance($2, '12345') <= 1
  }

I see what you are doing but this would have to be repeated for the
three different lists (list1.txt, list2.txt, list3.txt) - I guess that
should still be faster than a single system call . .

> Did you just do that hack now?
that's right. but i just read a bit on agrep's algorithm and it
might be fun to implement it in ruby (though a bit slow, probably).

I don't know if it helps but there is this:

http://www.koders.com/ruby/fidCEAEDCAA28D4A59A76ADF20A0DA2A3858438834D.aspx

as an alternative, it might be even worth writing ruby bindings to
agrep. who knows, if time permits... :wink:

I was wondering about something like that but I have never created a
Ruby binding before . .

> - how do I get/install it? (Fedora 8).
well, i don't think that particular implementation suits your needs
and is obviously easily adapted (after all, it's just a select with
an appropriate block utilizing Text::Levenshtein.distance). but you
can get ruby-nuggets from rubyforge (gem install ruby-nuggets), or,
if the new version hasn't found its way onto the mirrors yet, from
our own gem server at http://prometheus.khi.uni-koeln.de/rubygems/\.

Thanks!

Phil.

···

On Sun, 2008-04-27 at 02:50 +0900, Jens Wille wrote:
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

Phil Rhoades [2008-04-26 22:26]:

I see what you are doing but this would have to be repeated for
the three different lists (list1.txt, list2.txt, list3.txt)

well, yeah. but that's not really a problem, is it?

  %w[list1.txt list2.txt list3.txt].inject() { |matches, file|
    matches + File.open(file).select { |line|
      # ...same as before...
    }
  }

I don't know if it helps but there is this:

http://www.koders.com/ruby/fidCEAEDCAA28D4A59A76ADF20A0DA2A3858438834D.aspx

=> http://amatch.rubyforge.org

silly me!! totally forgot about that one :wink: thanks for the reminder!

maybe i'll be able to come up with something that wraps flori's
Amatch into (Enumerable|File)#agrep.

I was wondering about something like that but I have never
created a Ruby binding before . .

neither have i. but that shouldn't stop us, right? :wink:

cheers
jens

Jens Wille [2008-04-26 22:45]:

maybe i'll be able to come up with something that wraps flori's
Amatch into (Enumerable|File)#agrep.

that was actually pretty easy and is definitely an improvement (see
ruby-nuggets v0.1.9), but it still won't give us support for regular
expression patterns :frowning:

i also added IO::agrep, so you would now be able to do:

  %w[list1.txt list2.txt list3.txt].inject() { |matches, file|
    matches + File.agrep(file, /Smith\|J.*12345/, 2)
  }

-- if only you had regular expressions at your disposal!

cheers
jens

jens,

···

On Sun, 2008-04-27 at 07:03 +0900, Jens Wille wrote:

Jens Wille [2008-04-26 22:45]:
> maybe i'll be able to come up with something that wraps flori's
> Amatch into (Enumerable|File)#agrep.
that was actually pretty easy and is definitely an improvement (see
ruby-nuggets v0.1.9), but it still won't give us support for regular
expression patterns :frowning:

i also added IO::agrep, so you would now be able to do:

  %w[list1.txt list2.txt list3.txt].inject() { |matches, file|
    matches + File.agrep(file, /Smith\|J.*12345/, 2)
  }

-- if only you had regular expressions at your disposal!

Yes, that would be nice! . . I guess it will be there sometime.

Thanks for looking at this!

Regards,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au