Using reg expr with array.index

Jordan_Callicoat · 27 December 2007 02:24

['aaaa', '>bbbb', 'cccc'].find { | e | e =~ /^>/ }

Regards,
Jordan

···

On Dec 26, 4:32 pm, Esmail <ebonak_de...@hotmail.com> wrote:

If I have an array ar of strings that contains
for instance

aaaa
bbbb
cccc
>dddd
eeee
dddd
cccc
etc.

is there a way to use ar.index with a regular
expression to get the index of the line >dddd

I've tried ar.index(/^>/) and (/^\>/) without
much luck.

In other words, I'm trying to match on the first
character which is a >

Thanks.

Jordan_Callicoat · 27 December 2007 05:00

Hi,

There's no built-in way that I'm aware of. You have to iterate over
the array yourself. If you want all the indices you could something
like...

indices =
['aaaa', '>bbbb', '>cccc'].each_with_index { | e, i |
indices << i if e =~ /^>/
}
p indices # => [1, 2]

But given the description of what you're trying to do in the other
thread, you probably just want to use Array#reject...

a = ['aaaa', '>bbbb', 'cccc'].reject { | e | e =~ /^>/ }
p a # => ["aaaa", "cccc"]

Regards,
Jordan

···

On Dec 26, 10:20 pm, Esmail <ebonak_de...@hotmail.com> wrote:

MonkeeSage wrote:
> On Dec 26, 4:32 pm, Esmail <ebonak_de...@hotmail.com> wrote:
>> If I have an array ar of strings that contains
>> for instance

>> aaaa
>> bbbb
>> cccc
>> >dddd
>> eeee
>> dddd
>> cccc
>> etc.

>> is there a way to use ar.index with a regular
>> expression to get the index of the line >dddd

>> I've tried ar.index(/^>/) and (/^\>/) without
>> much luck.

>> In other words, I'm trying to match on the first
>> character which is a >

>> Thanks.

> ['aaaa', '>bbbb', 'cccc'].find { | e | e =~ /^>/ }

Hi Jordan,

Is there a way to use this regular expression to return the
index value of the position where this string is found? That
is the main thing I am interested in.

It seems there ought to be an easy way ('cept I don't know it

Esmail

Robert_K1 · 27 December 2007 14:23

How about this one - kind of pseudo built in.

irb(main):007:0> a=['aaaa', '>bbbb', 'cccc']
=> ["aaaa", ">bbbb", "cccc"]
irb(main):008:0> a.to_enum(:each_with_index).find {|e,i| /^>/ =~ e}.last
=> 1

A similar approach also works when looking for multiple indexes:

irb(main):009:0> a.to_enum(:each_with_index).select {|e,i| /^>|c+/ =~
e}.map {|e,i| i}
=> [1, 2]

But I agree, usually indexes are fairly seldom needed with Arrays.

Kind regards

robert

···

2007/12/27, MonkeeSage <MonkeeSage@gmail.com>:

On Dec 26, 10:20 pm, Esmail <ebonak_de...@hotmail.com> wrote:
> MonkeeSage wrote:
> > On Dec 26, 4:32 pm, Esmail <ebonak_de...@hotmail.com> wrote:
> >> If I have an array ar of strings that contains
> >> for instance
>
> >> aaaa
> >> bbbb
> >> cccc
> >> >dddd
> >> eeee
> >> dddd
> >> cccc
> >> etc.
>
> >> is there a way to use ar.index with a regular
> >> expression to get the index of the line >dddd
>
> >> I've tried ar.index(/^>/) and (/^\>/) without
> >> much luck.
>
> >> In other words, I'm trying to match on the first
> >> character which is a >
>
> >> Thanks.
>
> > ['aaaa', '>bbbb', 'cccc'].find { | e | e =~ /^>/ }
>
> Hi Jordan,
>
> Is there a way to use this regular expression to return the
> index value of the position where this string is found? That
> is the main thing I am interested in.
>
> It seems there ought to be an easy way ('cept I don't know it
>
> Esmail

Hi,

There's no built-in way that I'm aware of.

--
use.inject do |as, often| as.you_can - without end

Jordan_Callicoat · 27 December 2007 14:50

Hi Jordan,

I didn't know about each_with_index until after I posted my last
message and read more on Ruby .. clearly I have to do more reading,
but I have found one of the best ways to learn is to do

> There's no built-in way that I'm aware of. You have to iterate over
> the array yourself. If you want all the indices you could something
> like...

> indices =
> ['aaaa', '>bbbb', '>cccc'].each_with_index { | e, i |
> indices << i if e =~ /^>/
> }
> p indices # => [1, 2]

> But given the description of what you're trying to do in the other
> thread, you probably just want to use Array#reject...

> a = ['aaaa', '>bbbb', 'cccc'].reject { | e | e =~ /^>/ }
> p a # => ["aaaa", "cccc"]

This would delete only the one element, but I am trying to delete a range
of data (a record). I may have duplicate records, so I am trying to get
rid of them. They have different identifiers, each starting with a '>'.
Here's a test file that mimics this:

>88888/Bla08/the/rest8
888888888888888
888888888888888
888888888888888
888888888888888
888888888888888
88888 -- last line --
>77777/Bla07/the/rest7
777777777777777
777777777777777
777777777777777
777777777777777
777777777777777
77777 -- last line --
>66666/Bla06/the/rest6
666666666666666
666666666666666
666666666666666
666666666666666
666666666666666
66666 -- last line --
>77777/Bla07/the/rest7
777777777777777
777777777777777
777777777777777
777777777777777
777777777777777
77777 -- last line --
>

(I add the last > and later remove it)

So, this is what I came up with (with suggestions from you):

######################################
# delete duplicate records
######################################
def deleteDuplicates(data, dups)

   dups.each do |name|
     puts "\n****deleting duplicate \"#{name}\"...\n"
     s = data.index(name)
     e = 0
     data[s+1..-1].each_with_index{ |v, i|
       if v =~ /^>/
         e = i
         break
       end
     }

     puts "deleting ... ", data[s..s+e], "..done"
     data.slice!(s..s+e)
   end

   data
end
######################################

What do you think? It seems to work, but I'm always interested in
learning to do things better.

Thanks again!

Esmail

Hi Esmail,

A couple points:

- It's not very efficient to do all that iteration and slicing.

- The regexp won't work since #each and #each_with_index iterate over
lines and not characters (so v == " >...", so /^ >/ would be needed).

- #index returns nil if there is no matching index (error when you get
to s+1 in that case).

How about using Array#uniq, as in:

def no_dups(path)
IO.read(path).split(" >").uniq.join(" >")
end
fixed = no_dups("testfile")
puts fixed

# =>
>88888/Bla08/the/rest8
888888888888888
888888888888888
888888888888888
888888888888888
888888888888888
88888 -- last line --
>77777/Bla07/the/rest7
777777777777777
777777777777777
777777777777777
777777777777777
777777777777777
77777 -- last line --
>66666/Bla06/the/rest6
666666666666666
666666666666666
666666666666666
666666666666666
666666666666666
66666 -- last line --
>

Regards,
Jordan

···

On Dec 27, 7:17 am, Esmail <ebonak_de...@hotmail.com> wrote:

Jordan_Callicoat · 30 December 2007 01:00

MonkeeSagewrote:

> def no_dups(path)
> IO.read(path).split(" >").uniq.join(" >")
> end
> fixed = no_dups("testfile")
> puts fixed

One more quick questions (ha .. see, that's what you get
for being so helpful) - please feel free to ignore this.

No problem. What is really cool about the ruby community is that
everyone is willing to help (and to learn!). I've never found a better
programming community. So don't feel like any question is dumb or that
you're asking too much.

For the above solution which I really like, is there an
easy way to get the duplicate records? (I'd like to display
the name lines ie the ones that start with > as a possible
check of what I am eliminating from the original data).

I know how to do this if I reread the file again and traverse
it but that's certainly not an efficient way to do this.

I am doing a lot of reading on Ruby right now, so I'll
may come across the solution, so only reply if you are bored

(ps: I suppose if there was a to_set and to_array functionality
in Ruby - for all I know there is - it would have yet provided
another approach to solve the original problem)

Without making assumptions about ordering and such, I'm not sure it's
possible to avoid multiple iterations (and probably polynomial time)
if you want to roll your own #uniq method to return an array of (or
otherwise process) duplicate elements. Off the cuff, I'd say that
something like this is probably the most efficient (but please correct
if there's a better way):

def no_dups(path)
  seen =
  dups =
  IO.read(path).split(">").each { | item |
    if seen.include?(item)
      dups << item
      # or, for example...
      # puts %{Removed dup: >#{item.split("\n")[0]}}
    else
      seen << item
    end
  }
  [seen.join(">"), dups]
end
fixed, dups = no_dups("testfile")

Ps. I think google is indenting the ">" because it thinks it's the
start of a quote.

Regards,
Jordan

···

On Dec 27, 7:17 pm, Esmail <ebonak_de...@hotmail.com> wrote:

Robert_K1 · 30 December 2007 12:15

Esmail (btw, is that a real name?), since you are dealing with files I would like to revisit this. Since files a) are slower to read and b) are potentially large - especially larger than main memory - you might want to look for different solutions. Basically, unless you need the file's contents otherwise you should try to avoid having to store the whole file in memory at one point in time and strive to process the whole file as seldom as possible (ideally only once).

Here are two typical approaches:

1. if you know the file is ordered

File.open("foo") do |io|
last = nil

   io.each do |line|
     if line == last
       $stderr.puts "Duplicate line no #{io.lineno}"
     else
       puts line
       last = line
     end
   end
end

2. if the file can be unordered

File.open("foo") do |io|
dups = Hash.new 0

   io.each do |line|
     line.freeze # optimization for Hash key
     c = (dups[line] += 1)

     if c > 1
       $stderr.puts "#{c}. occurrence of line at #{io.lineno}"
     else
       puts line
     end
   end
end

Advantage of both these approaches is that the file has to be read only once. However, the second solution still has the whole file's contents in memory at some point in time. If you know more about your data (for example, that repetitions always occur within n lines) you can create more efficient algorithms (with the mentioned restriction it is sufficient to just remember the last n lines, similar to the first approach).

An alternative would be to store a more compact representation of lines, e.g. as a MD5 hash and do the lookups of the second solution based on hash codes of lines. However this approach is less strict, i.e. although unlikely there might be lines reported as duplicates because they accidentally yield the same hash code despite having different content. But if you need to manually edit the file anyway this approach might be sufficient for large files.

Once you dive into the matter, all sorts of interesting problems surface. But it is generally good to know the nature of the data; with this often great optimizations can be done.

Kind regards

robert

···

On 28.12.2007 02:17, Esmail wrote:

MonkeeSage wrote:

def no_dups(path)
  IO.read(path).split(" >").uniq.join(" >")
end
fixed = no_dups("testfile")
puts fixed

One more quick questions (ha .. see, that's what you get
for being so helpful) - please feel free to ignore this.

For the above solution which I really like, is there an
easy way to get the duplicate records? (I'd like to display
the name lines ie the ones that start with > as a possible
check of what I am eliminating from the original data).

I know how to do this if I reread the file again and traverse
it but that's certainly not an efficient way to do this.

I am doing a lot of reading on Ruby right now, so I'll
may come across the solution, so only reply if you are bored

(ps: I suppose if there was a to_set and to_array functionality
     in Ruby - for all I know there is - it would have yet provided
     another approach to solve the original problem)

Jordan_Callicoat · 31 December 2007 07:15

MonkeeSage wrote:

> No problem. What is really cool about the ruby community is that
> everyone is willing to help (and to learn!). I've never found a better
> programming community. So don't feel like any question is dumb or that
> you're asking too much.

Thanks, I know that's the spirit of usenet, but there are some
groups where that idea has been lost unfortunately. I can hack
code together to get it to work in Ruby, but I want it to be a
good solution too Ruby is sufficiently different from the
other languages I have experience with.

For me, one of the greatest things about the ruby community is that
the people who write books on the language, and even the language
designer himself (matz), take the time to answer questions and
interact with the community. Of course, you'll see a few RTFM replies
now and then, in response to "do my CompSci homework for me" type
posts; but on the whole, there really is no "ivory tower" in the ruby
community. We're all just trying to learn and grow as programmers, and
it's pretty much a level playing field.

> Without making assumptions about ordering and such, I'm not sure it's
> possible to avoid multiple iterations (and probably polynomial time)
> if you want to roll your own #uniq method to return an array of (or
> otherwise process) duplicate elements. Off the cuff, I'd say that
> something like this is probably the most efficient (but please correct
> if there's a better way):

> def no_dups(path)
> seen =
> dups =
> IO.read(path).split(">").each { | item |
> if seen.include?(item)
> dups << item
> # or, for example...
> # puts %{Removed dup: >#{item.split("\n")[0]}}
> else
> seen << item
> end
> }
> [seen.join(">"), dups]
> end
> fixed, dups = no_dups("testfile")

> Ps. I think google is indenting the ">" because it thinks it's the
> start of a quote.

> Regards,
> Jordan

This looks very much like what I wrote:

GMTA, heh.

######################################
# strip \n from data, find name of
# sequences and duplicate names
######################################
def processNames(data)

   names=
   dups=

   data.each do |line|

     line.chomp!

     # find line with >
     if line[0,1] =='>'
       if !names.include?(line)
         names.push(line)
       else
         dups.push(line)
       end
     end

   end #do

   return names, dups
end
######################################

begin
   if ARGV.length != 1
     puts "need to supply one command line arg"
   else
     file=File.open(ARGV[0])
     data=file.readlines
   end
rescue
   puts "Could not open file \"#{ARGV[0]}\""
   exit 1
end

#find names and duplicates
names, dups = processNames(data)

I am using this code to process bioinformatics data in fasta format
(in case anyone's curious). I know there's a bioruby somewhere (I think)
but I am using this also as an opportunity to learn more Ruby.

Esmail

Ps. ruby will normally close open file handles on garbage collection
or in finalization, but just in case of some catastrophic failure
(what, I'm not sure), it's usually considered good practice to close
file handles manually:

...
     file=File.open(ARGV[0])
     data=file.readlines
     file.close
...

Regards,
Jordan

···

On Dec 30, 7:55 am, Esmail <ebonak_de...@hotmail.com> wrote:

Robert_K1 · 2 January 2008 11:29

Robert Klemme wrote:
Hi there Robert,

Esmail (btw, is that a real name?),

Yes, it is .. there are a lot of different alternate spellings
of this, Ismael, Ismail are quite common. In the US I have also
seen (and unfortunately heard) Ishmael (as written in Moby Dick I
believe). Unfortunately, that pronunciation sounds to me as if
someone is saying "Shteve" instead of "Steve"

Ah, ok. I knew "Ismael" but did not know that it could also be spelled with an "E" in front. Thanks for the explanation! Learn something new every day...

Ok, back to Ruby ...

since you are dealing with files I would like to revisit this. Since files a) are slower to read and b) are potentially large - especially larger than main memory - you might want to look for different solutions. Basically, unless you need the file's contents otherwise you should try to avoid having to store the whole file in memory at one point in time and strive to process the whole file as seldom as possible (ideally only once).

Agreed, file io is a bottleneck and it may not even be possible
sometimes to store a huge file in RAM. In this case I'm working with
the knowledge (? .. or pretty good assumption) that the files will fit
into memory, so the ease of implementation becomes a factor.

Absolutely.

But I am always looking for alternative ways of doing this, just in
case I run into trouble with one way of solving/tackling a problem.

The files contain multiple DNA sequences. Each sequence starts on a
line with a '>' in column 1 and its header, and then is followed by an unknown number
of lines with data. If there is another record, it will start with
a '>' in column one etc.

The problem is that since I am concatenating a number of different sequences
into a large file, the possibility of duplicate sequences exists, which I
need to identify and eliminate.

I could scan the file once to determine which sequences are duplicates
and then process the file a 2nd time eliminating those. In fact that was
my first approach, but then Jordan's suggestions were so much cleaner
and simpler that I went with them. My file aren't very large but it's
good to have some other approaches in mind.

If the volume of data is always so that it will fit into memory (and if you think about it, it *has* to be in memory for the duplicate detection). Then I'd probably write a class (or find a class somewhere, probably in RAA) that will represent a sequence and have proper comparison methods (#==, #eql?, #hash etc.). That way you can even use a Hash for fast duplicate checks. And you might even be able to internally represent those sequences with less memory (compressed, encoded or whatever suits you best). For example, since you just need two bits to represent one element of the sequence you can achieve compression of factor 4 easily by not using a char per entry but just two bits.

Kind regards

robert

···

On 02.01.2008 05:52, Esmail wrote:

Robert_K1 · 31 December 2007 15:05

... in which case I'd rather use the block form because then the FH is closed under all circumstances.

File.open ARGV[0] do |file|
data = file.readlines
end

Although in this case you can simply do

date = File.readlines ARGV[0]

Kind regards

robert

···

On 31.12.2007 08:14, MonkeeSage wrote:

Ps. ruby will normally close open file handles on garbage collection
or in finalization, but just in case of some catastrophic failure
(what, I'm not sure), it's usually considered good practice to close
file handles manually:

...
     file=File.open(ARGV[0])
     data=file.readlines
     file.close
...

Robert_K1 · 2 January 2008 11:19

MonkeeSage wrote:

Ps. ruby will normally close open file handles on garbage collection
or in finalization, but just in case of some catastrophic failure
(what, I'm not sure), it's usually considered good practice to close
file handles manually:

...
     file=File.open(ARGV[0])
     data=file.readlines
     file.close

Ooops, absolutely, that sort of houskeeping needs to be done, I'm good
doing with C and Java .. not sure why I didn't think about this with Ruby.

By the way, I know if I do something like

File.open("bla").each { |line|
  stuff
}

the file gets automagically closed at the end of the block.

I'm sorry, but this is wrong. You either need to do

File.open("bla") do |io|
   io.each do |line|
     # stuff
   end
end

or

File.foreach "bla" do |line|
# stuff
end

What about the example you gave, data=IO.read("bla") ... I assume the
file gets opened, read and closed in one fell swoop, correct?

This is correct.

Kind regards

robert

···

On 02.01.2008 05:40, Esmail wrote:

Robert_K1 · 2 January 2008 16:24

Good morning!

Robert Klemme wrote:

I could scan the file once to determine which sequences are duplicates
and then process the file a 2nd time eliminating those. In fact that was
my first approach, but then Jordan's suggestions were so much cleaner
and simpler that I went with them. My file aren't very large but it's
good to have some other approaches in mind.

If the volume of data is always so that it will fit into memory (and if you think about it, it *has* to be in memory for the duplicate detection).

Actually, while it is easier this way, I don't think it has to be
all in memory at once. The two step procedure above would work too
I believe.

I.e., first do the equivalent of a grep for > and identify and
store the duplicate headers this way.

I thought you needed the whole sequence to detect duplicates. If you can do it on headers only then that's probably also good.

Then re-read the file, line by line and stop copying input to
output file when one of the duplicate headers is read, until
the next legitimate header is found.

You could even store file offsets along with headers from the first reading pass. That way you can faster skip duplicates during the second pass.

Not perhaps the most efficient method, but it would work with
any size file regardless of RAM since it essentially works on
a line-by-line basis.

Well, it would work with significantly larger files. "Any size" is a dangerous term - what do you do if there are 2^40 headers in there? But I guess in those cases you would rather resort to using some kind of hashing anyway.

Then I'd probably write a class (or find a class somewhere, probably in RAA)

Ah .. thanks for the pointer to RAA, I just googled it and
hadn't come across it before. It looks like a useful resource.

Definitively is.

that will represent a sequence and have proper comparison methods (#==, #eql?, #hash etc.).

The equivalent of compareTo in Java right, yes, that would make
a lot of sense, though again, for this simple script a simple
string comparison of the header is all that's needed.

No, #== and #eql? are like Java's equals(), #hash is like Java's hashCode(). Java's compareTo() is in Ruby called #<=>. Please also look at module Comparable which will give you all the other comparison operators for free.

That way you can even use a Hash for fast duplicate checks. And you might even be able to internally represent those sequences with less memory (compressed, encoded or whatever suits you best). For example, since you just need two bits to represent one element of the sequence you can achieve compression of factor 4 easily by not using a char per entry but just two bits.

Neat, you are right, there are lots of ways to work this.

Yep, and I'd add that this is what makes SE so interesting. There are often so many solutions and a slight variation of the requirements and preconditions can make another solution much better than the one you had. As we say so often "it all depends"...

Kind regards

robert

···

On 02.01.2008 13:45, Esmail wrote:

Robert_K1 · 2 January 2008 16:15

Robert Klemme wrote:

MonkeeSage wrote:

Ps. ruby will normally close open file handles on garbage collection
or in finalization, but just in case of some catastrophic failure
(what, I'm not sure), it's usually considered good practice to close
file handles manually:

...
     file=File.open(ARGV[0])
     data=file.readlines
     file.close

Ooops, absolutely, that sort of houskeeping needs to be done, I'm good
doing with C and Java .. not sure why I didn't think about this with Ruby.

By the way, I know if I do something like

File.open("bla").each { |line|
  stuff
}

the file gets automagically closed at the end of the block.

I'm sorry, but this is wrong.

I have used this code snippet in the past without problems in
several scripts.

This is where I got it originally from:

File Access

Note that this page does not claim automated closing of the IO object for this code snippet as far as I can see. It just uses this to demonstrate that you can use #each with an IO object.

After looking at this I suggest you rather use the Pickaxe as your reference - even the first version (which is online) is better than what this page suggests. There are various issues with the code on the page that you referred, e.g. not closing IO objects under all circumstances, especially the example with "bla.txt" is overly complex.

http://ruby-doc.org/docs/ProgrammingRuby/

IO is here:
http://ruby-doc.org/docs/ProgrammingRuby/html/tut_io.html

Are you saying I am wrong about the file being closed at the
end of the block,

Exactly.

or the whole construct?

Well, the former kind of obsoletes the whole construct, doesn't it?

You either need to do

File.open("bla") do |io|
  io.each do |line|
    # stuff
  end
end

or

File.foreach "bla" do |line|
  # stuff
end

What about the example you gave, data=IO.read("bla") ... I assume the
file gets opened, read and closed in one fell swoop, correct?

This is correct.

Cool .. thanks for confirming this.

You're welcome.

Kind regards

robert

···

On 02.01.2008 13:30, Esmail wrote:

On 02.01.2008 05:40, Esmail wrote:

Topic		Replies	Views
One-liner removing duplicate lines ruby-talk	38	196	11 October 2005
Detecting duplicates in an array, anything in the standard library? ruby-talk	34	200	22 August 2007
Duplicate elements in array ruby-talk	14	122	29 October 2007
How to get non-unique elements from an array? ruby-talk	64	458	19 October 2005
Finding string matches, in order, in a file ruby-talk	18	136	20 September 2007

Using reg expr with array.index

Related topics