Help with ideas for finding dups on very large file

Ruby_Student1 · 24 October 2013 18:17

Hello Team,

I have a file with over a million records ordered in chronological order. A
piece of the file looks like this:

START= '2013-10-22 00:00:00.436810' END= 2013-10-22 00:00:00.448811 ID=
'3b9e6ca5-27fc-4481-af6a-e0860689c054' INSTANCE=
'8e09cdcc-8597-4a05-84c0-ccb53be3a50b' TYPE= 'read'
START= '2013-10-22 00:00:00.570842' END= 2013-10-22 00:00:00.578333 ID=
'0138c106-1879-458a-bdba-d3074c700e54' INSTANCE=
'044ddf01-6cca-4c4f-bc74-c48b859cbea7' TYPE= 'read'
START= '2013-10-22 00:00:00.649034' END= 2013-10-22 00:00:00.656009 ID=
'b2f445ab-23f8-4859-a16d-7eecb58f378b' INSTANCE=
'1fc40423-7645-48ff-854d-a5359691ac67' TYPE= 'read'
START= '2013-10-22 00:00:00.658711' END= 2013-10-22 00:00:00.669888 ID=
'4677fd94-7f69-4296-b8f0-d02d3a034cf7' INSTANCE=
'5bf6e04f-be12-4132-946a-d62a2ffc41d3' TYPE= 'read'
START= '2013-10-22 00:00:00.760461' END= 2013-10-22 00:00:00.767010 ID=
'a1489b31-375f-4793-9ed2-72c391918f40' INSTANCE=
'4a7de027-3bb3-4586-acc4-559d1dd52e27' TYPE= 'read'
START= '2013-10-22 00:00:00.969661' END= 2013-10-22 00:00:01.005119 ID=
'71f3f1f8-3e54-4f5c-a673-841578b3b118' INSTANCE=
'ae2a659a-a482-4caa-a69f-81e8d422bd06' TYPE= 'read'
START= '2013-10-22 00:00:01.204287' END= 2013-10-22 00:00:01.215604 ID=
'fd27a3f7-35a6-4de9-bfa1-c416737a3cc5' INSTANCE=
'ec438e7e-7801-4b4a-8559-26a279347bc2' TYPE= 'read'
START= '2013-10-22 00:00:01.383654' END= 2013-10-22 00:00:01.407878 ID=
'324c5c6c-ad2a-4389-b26d-0354f030da68' INSTANCE=
'384e9658-1e13-47fe-a1de-2542f04dea2f' TYPE= 'read'
START= '2013-10-22 00:00:01.485704' END= 2013-10-22 00:00:01.494108 ID=
'11ca58b5-5e7c-4f17-a4a0-75383d1d8a13' INSTANCE=
'49bb66e4-d771-4aee-9a36-d5fba2babdb5' TYPE= 'read'
START= '2013-10-22 00:00:01.554328' END= 2013-10-22 00:00:01.555390 ID=
'0138c106-1879-458a-bdba-d3074c700e54' INSTANCE=
'044ddf01-6cca-4c4f-bc74-c48b859cbea7' TYPE= 'read'
START= '2013-10-22 00:00:01.699761' END= 2013-10-22 00:00:01.731712 ID=
'374af9f1-d344-4e89-9f82-33c8960edd77' INSTANCE=
'dcab9192-1ec5-4246-a133-599c5107781c' TYPE= 'read'

The issue here is that we could have duplicate entries. A duplicate entry
is when both the ID & the INSTANCE from one record is found on another
record. M task is to identified and all dups and create a file with them. A
dup can occur at any time during the 24 hours period.
My idea was taking each ID and doing a grep against the file. That will
certainly find a match because at the very minimum, it will find itself.
That's no good because I only want to out the record when I find two or
more instances of the ID/INSTANCE. I also thought about extracting just the
ID/INSTANCE and creating an array such as this:

519233e9-04fe-45c2-80c1-f1dbae752fee 1141003f-9af1-4868-861b-10ba9f54478b
6d1a98e7-b74b-4ed3-a1fd-a29b231dae84 8ac2385b-bbfc-4e2f-ae03-ed5fddbc8285
0137f425-6770-4fd4-abf0-b4827d6f563f 6cb955a9-fd29-4787-82af-72120fa9362b
3acf4174-3b21-4816-b44d-4b8379e9ea5e 64d5423d-2a4b-4e74-a87c-1f15b13330bc
91c3a54b-6487-4bf2-9a95-7a661f05661c 75b9896c-7f5f-433b-8613-fcea32ed22bf
ce3ddaf5-7a43-45f8-b6a7-667c64bb9183 d8810bd7-edcc-4033-8b73-10a04e7b994b
b2658eba-7186-43f4-b0f1-bef322a6cfd5 66a378d0-32fe-4305-9529-42524ec69cc8
78b31a85-e8c8-42cc-b6d4-6ca3e6f86776 ecdfc94c-8996-41bf-8ceb-f9b647065a4d
a0a27c60-9b5d-44a8-9ac2-73dd619c63cf f4716c2e-bac2-4090-8a01-fb11693b58e0
8f6357df-532b-4f30-bc11-8e775137c957 a71de32e-9a12-4003-987c-9a7115fb3f2a
253fb047-0b21-4550-be95-1f86779f361c cb1c8e25-7e6f-4851-a74c-291a910fe93f

The idea was to take each element of the first column and navigate the
entire file looking for dups. I'm stuck here also finding a good way to do
this. The other problem is if I will be able to create an array so large.
Remember that there could be potentially 2 million records.

Any thought?

Thank you

···

--
Ruby Student

Hassan_Schroeder · 24 October 2013 19:21

I'd create a DB table with fields for ID and INSTANCE and put a
unique constraint on them. Then insert your records, rescuing the
ActiveRecord::RecordNotUnique errors and writing those entries
to your dup file (or another DB table).

FWIW,

···

On Thu, Oct 24, 2013 at 11:17 AM, Ruby Student <ruby.student@gmail.com> wrote:

The issue here is that we could have duplicate entries. A duplicate entry is
when both the ID & the INSTANCE from one record is found on another record.
M task is to identified and all dups and create a file with them.

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

twitter: @hassan

Justin_Collins · 24 October 2013 19:37

I think you are close, but what you are considering is searching the entire file for each [ID, Instance] pair, in essence doing n*n comparisons. This is not good, especially when n = 1000000.

Instead, go through the file once and maintain a Set containing each [ID, Instance] pair. For each record, check if the pair is already in the Set. If it is, you have a duplicate. If not, add it to the Set and move on.

-Justin

···

On 10/24/2013 11:17 AM, Ruby Student wrote:

Hello Team,

I have a file with over a million records ordered in chronological
order. A piece of the file looks like this:

START= '2013-10-22 00:00:00.436810' END= 2013-10-22 00:00:00.448811
  ID= '3b9e6ca5-27fc-4481-af6a-e0860689c054' INSTANCE=
'8e09cdcc-8597-4a05-84c0-ccb53be3a50b' TYPE= 'read'
START= '2013-10-22 00:00:00.570842' END= 2013-10-22 00:00:00.578333
  ID= '0138c106-1879-458a-bdba-d3074c700e54' INSTANCE=
'044ddf01-6cca-4c4f-bc74-c48b859cbea7' TYPE= 'read'
START= '2013-10-22 00:00:00.649034' END= 2013-10-22 00:00:00.656009
  ID= 'b2f445ab-23f8-4859-a16d-7eecb58f378b' INSTANCE=
'1fc40423-7645-48ff-854d-a5359691ac67' TYPE= 'read'
START= '2013-10-22 00:00:00.658711' END= 2013-10-22 00:00:00.669888
  ID= '4677fd94-7f69-4296-b8f0-d02d3a034cf7' INSTANCE=
'5bf6e04f-be12-4132-946a-d62a2ffc41d3' TYPE= 'read'
START= '2013-10-22 00:00:00.760461' END= 2013-10-22 00:00:00.767010
  ID= 'a1489b31-375f-4793-9ed2-72c391918f40' INSTANCE=
'4a7de027-3bb3-4586-acc4-559d1dd52e27' TYPE= 'read'
START= '2013-10-22 00:00:00.969661' END= 2013-10-22 00:00:01.005119
  ID= '71f3f1f8-3e54-4f5c-a673-841578b3b118' INSTANCE=
'ae2a659a-a482-4caa-a69f-81e8d422bd06' TYPE= 'read'
START= '2013-10-22 00:00:01.204287' END= 2013-10-22 00:00:01.215604
  ID= 'fd27a3f7-35a6-4de9-bfa1-c416737a3cc5' INSTANCE=
'ec438e7e-7801-4b4a-8559-26a279347bc2' TYPE= 'read'
START= '2013-10-22 00:00:01.383654' END= 2013-10-22 00:00:01.407878
  ID= '324c5c6c-ad2a-4389-b26d-0354f030da68' INSTANCE=
'384e9658-1e13-47fe-a1de-2542f04dea2f' TYPE= 'read'
START= '2013-10-22 00:00:01.485704' END= 2013-10-22 00:00:01.494108
  ID= '11ca58b5-5e7c-4f17-a4a0-75383d1d8a13' INSTANCE=
'49bb66e4-d771-4aee-9a36-d5fba2babdb5' TYPE= 'read'
START= '2013-10-22 00:00:01.554328' END= 2013-10-22 00:00:01.555390
  ID= '0138c106-1879-458a-bdba-d3074c700e54' INSTANCE=
'044ddf01-6cca-4c4f-bc74-c48b859cbea7' TYPE= 'read'
START= '2013-10-22 00:00:01.699761' END= 2013-10-22 00:00:01.731712
  ID= '374af9f1-d344-4e89-9f82-33c8960edd77' INSTANCE=
'dcab9192-1ec5-4246-a133-599c5107781c' TYPE= 'read'

The issue here is that we could have duplicate entries. A duplicate
entry is when both the ID & the INSTANCE from one record is found on
another record. M task is to identified and all dups and create a file
with them. A dup can occur at any time during the 24 hours period.
My idea was taking each ID and doing a grep against the file. That will
certainly find a match because at the very minimum, it will find itself.
That's no good because I only want to out the record when I find two or
more instances of the ID/INSTANCE. I also thought about extracting just
the ID/INSTANCE and creating an array such as this:

519233e9-04fe-45c2-80c1-f1dbae752fee 1141003f-9af1-4868-861b-10ba9f54478b
6d1a98e7-b74b-4ed3-a1fd-a29b231dae84 8ac2385b-bbfc-4e2f-ae03-ed5fddbc8285
0137f425-6770-4fd4-abf0-b4827d6f563f 6cb955a9-fd29-4787-82af-72120fa9362b
3acf4174-3b21-4816-b44d-4b8379e9ea5e 64d5423d-2a4b-4e74-a87c-1f15b13330bc
91c3a54b-6487-4bf2-9a95-7a661f05661c 75b9896c-7f5f-433b-8613-fcea32ed22bf
ce3ddaf5-7a43-45f8-b6a7-667c64bb9183 d8810bd7-edcc-4033-8b73-10a04e7b994b
b2658eba-7186-43f4-b0f1-bef322a6cfd5 66a378d0-32fe-4305-9529-42524ec69cc8
78b31a85-e8c8-42cc-b6d4-6ca3e6f86776 ecdfc94c-8996-41bf-8ceb-f9b647065a4d
a0a27c60-9b5d-44a8-9ac2-73dd619c63cf f4716c2e-bac2-4090-8a01-fb11693b58e0
8f6357df-532b-4f30-bc11-8e775137c957 a71de32e-9a12-4003-987c-9a7115fb3f2a
253fb047-0b21-4550-be95-1f86779f361c cb1c8e25-7e6f-4851-a74c-291a910fe93f

The idea was to take each element of the first column and navigate the
entire file looking for dups. I'm stuck here also finding a good way to
do this. The other problem is if I will be able to create an array so
large. Remember that there could be potentially 2 million records.

Any thought?

Thank you

--
Ruby Student

Reid_Thompson1 · 24 October 2013 21:00

--
Ruby Student

use [gn]awk

$ cat /tmp/data
START= '2013-10-22 00:00:00.436810' END= 2013-10-22 00:00:00.448811 ID= '3b9e6ca5-27fc-4481-af6a-e0860689c054' INSTANCE= '8e09cdcc-8597-4a05-84c0-ccb53be3a50b' TYPE= 'read'
START= '2013-10-22 00:00:00.570842' END= 2013-10-22 00:00:00.578333 ID= '0138c106-1879-458a-bdba-d3074c700e54' INSTANCE= '044ddf01-6cca-4c4f-bc74-c48b859cbea7' TYPE= 'read'
START= '2013-10-22 00:00:00.649034' END= 2013-10-22 00:00:00.656009 ID= 'b2f445ab-23f8-4859-a16d-7eecb58f378b' INSTANCE= '1fc40423-7645-48ff-854d-a5359691ac67' TYPE= 'read'
START= '2013-10-22 00:00:00.658711' END= 2013-10-22 00:00:00.669888 ID= '4677fd94-7f69-4296-b8f0-d02d3a034cf7' INSTANCE= '5bf6e04f-be12-4132-946a-d62a2ffc41d3' TYPE= 'read'
START= '2013-10-22 00:00:00.760461' END= 2013-10-22 00:00:00.767010 ID= 'a1489b31-375f-4793-9ed2-72c391918f40' INSTANCE= '4a7de027-3bb3-4586-acc4-559d1dd52e27' TYPE= 'read'
START= '2013-10-22 00:00:00.969661' END= 2013-10-22 00:00:01.005119 ID= '71f3f1f8-3e54-4f5c-a673-841578b3b118' INSTANCE= 'ae2a659a-a482-4caa-a69f-81e8d422bd06' TYPE= 'read'
START= '2013-10-22 00:00:01.204287' END= 2013-10-22 00:00:01.215604 ID= 'fd27a3f7-35a6-4de9-bfa1-c416737a3cc5' INSTANCE= 'ec438e7e-7801-4b4a-8559-26a279347bc2' TYPE= 'read'
START= '2013-10-22 00:00:01.383654' END= 2013-10-22 00:00:01.407878 ID= '324c5c6c-ad2a-4389-b26d-0354f030da68' INSTANCE= '384e9658-1e13-47fe-a1de-2542f04dea2f' TYPE= 'read'
START= '2013-10-22 00:00:01.485704' END= 2013-10-22 00:00:01.494108 ID= '11ca58b5-5e7c-4f17-a4a0-75383d1d8a13' INSTANCE= '49bb66e4-d771-4aee-9a36-d5fba2babdb5' TYPE= 'read'
START= '2013-10-22 00:00:01.554328' END= 2013-10-22 00:00:01.555390 ID= '0138c106-1879-458a-bdba-d3074c700e54' INSTANCE= '044ddf01-6cca-4c4f-bc74-c48b859cbea7' TYPE= 'read'
START= '2013-10-22 00:00:01.699761' END= 2013-10-22 00:00:01.731712 ID= '374af9f1-d344-4e89-9f82-33c8960edd77' INSTANCE= 'dcab9192-1ec5-4246-a133-599c5107781c' TYPE= 'read'

[16:35:08][0s] rthompso@raker2.ateb.com> ~
$ gawk 'BEGIN{}{ COLUMN[sprintf("%51s %21s", $8, $10)]++}END{for (j in COLUMN)print j,""COLUMN[j]""}' /tmp/data
             'b2f445ab-23f8-4859-a16d-7eecb58f378b' '1fc40423-7645-48ff-854d-a5359691ac67' 1
             '324c5c6c-ad2a-4389-b26d-0354f030da68' '384e9658-1e13-47fe-a1de-2542f04dea2f' 1
             '4677fd94-7f69-4296-b8f0-d02d3a034cf7' '5bf6e04f-be12-4132-946a-d62a2ffc41d3' 1
             '11ca58b5-5e7c-4f17-a4a0-75383d1d8a13' '49bb66e4-d771-4aee-9a36-d5fba2babdb5' 1
             '71f3f1f8-3e54-4f5c-a673-841578b3b118' 'ae2a659a-a482-4caa-a69f-81e8d422bd06' 1
             'fd27a3f7-35a6-4de9-bfa1-c416737a3cc5' 'ec438e7e-7801-4b4a-8559-26a279347bc2' 1
             '374af9f1-d344-4e89-9f82-33c8960edd77' 'dcab9192-1ec5-4246-a133-599c5107781c' 1
             '0138c106-1879-458a-bdba-d3074c700e54' '044ddf01-6cca-4c4f-bc74-c48b859cbea7' 2<----
             '3b9e6ca5-27fc-4481-af6a-e0860689c054' '8e09cdcc-8597-4a05-84c0-ccb53be3a50b' 1 |
             'a1489b31-375f-4793-9ed2-72c391918f40' '4a7de027-3bb3-4586-acc4-559d1dd52e27' 1 |
                                                                                                  >
your sample data has one duplicate instance in it--------------------------------------------------

so .... minor changes...

$ gawk 'BEGIN{}{ COLUMN[sprintf("grep \"%s *.* %s\"", $8, $10)]++}END{for (j in COLUMN) if (COLUMN[j] > 1){ print j,""FILENAME""} }' /tmp/data
grep "'0138c106-1879-458a-bdba-d3074c700e54' *.* '044ddf01-6cca-4c4f-bc74-c48b859cbea7'" /tmp/data

copy and paste as command ...
[16:58:36][0s] rthompso@raker2.ateb.com> ~
$ grep "'0138c106-1879-458a-bdba-d3074c700e54' *.* '044ddf01-6cca-4c4f-bc74-c48b859cbea7'" /tmp/data
START= '2013-10-22 00:00:00.570842' END= 2013-10-22 00:00:00.578333 ID= '0138c106-1879-458a-bdba-d3074c700e54' INSTANCE= '044ddf01-6cca-4c4f-bc74-c48b859cbea7' TYPE= 'read'
START= '2013-10-22 00:00:01.554328' END= 2013-10-22 00:00:01.555390 ID= '0138c106-1879-458a-bdba-d3074c700e54' INSTANCE= '044ddf01-6cca-4c4f-bc74-c48b859cbea7' TYPE= 'read'

write a script to combine all this into one...

···

On Thu, 2013-10-24 at 14:17 -0400, Ruby Student wrote:

Robert_K1 · 25 October 2013 08:59

I also think we need a two step approach because storing whole files
in memory is not an option.

Since this is a Ruby group I put this into Ruby code

gist.github.com

https://gist.github.com/rklemme/7151692

find-duplicate-id-instance.rb

#!/usr/bin/ruby -w

Entry = Struct.new :id, :instance do
  def self.parse(line)
    if /ID=\s*'([^']*)'\s+INSTANCE=\s*'([^']*)'/ =~ line
      new $1, $2
    else
      raise "Cannot parse: %p" % line
    end
  end

This file has been truncated. show original

Room for improvement if memory is really low:
- store pairs in a compressed form
- only look at entries from a single day at a time

Cheers

robert

···

On Thu, Oct 24, 2013 at 11:00 PM, Reid Thompson <Reid.Thompson@ateb.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Justin_Collins · 25 October 2013 16:12

Why parse each file twice?

gist.github.com

https://gist.github.com/presidentbeef/7156955

find-duplicate-id-instance.rb

#!/usr/bin/ruby -w
require 'set'

Entry = Struct.new :id, :instance do
  def self.parse(line)
    if /ID=\s*'([^']*)'\s+INSTANCE=\s*'([^']*)'/ =~ line
      new $1, $2
    else
      raise "Cannot parse: %p" % line
    end

This file has been truncated. show original

Agree that you could save a lot of memory by considering a day at a time instead of all entires.

-Justin

···

On 10/25/2013 01:59 AM, Robert Klemme wrote:

On Thu, Oct 24, 2013 at 11:00 PM, Reid Thompson <Reid.Thompson@ateb.com> wrote:

I also think we need a two step approach because storing whole files
in memory is not an option.

Since this is a Ruby group I put this into Ruby code
Solution suggestion for "Help with ideas for finding dups on very large file" · GitHub

Room for improvement if memory is really low:
- store pairs in a compressed form
- only look at entries from a single day at a time

Cheers

robert

Robert_K1 · 26 October 2013 10:25

I also think we need a two step approach because storing whole files
in memory is not an option.

Since this is a Ruby group I put this into Ruby code
Solution suggestion for "Help with ideas for finding dups on very large file" · GitHub

Room for improvement if memory is really low:
- store pairs in a compressed form
- only look at entries from a single day at a time

Why parse each file twice?

Because of the subject.

https://gist.github.com/presidentbeef/7156955

That program does not satisfy the requirement:

M[y] task is to identified and all dups and create a file with them.

Your program just writes the second, third and any further occurrence.

Btw. you can simplify lines 21 to 25 to

puts line unless entries.add? entry

alternatively

entries.add? entry or puts line

Agree that you could save a lot of memory by considering a day at a time
instead of all entires.

It's not about saving memory. With large files it may not even be
possible to execute a program which holds all file content in memory
because it dies from memory exhaustion. If OTOH there is enough
memory in the machine then chances are that the second run through the
files is done from cached file content and will be much cheaper than
the first one.

Cheers

robert

···

On Fri, Oct 25, 2013 at 6:12 PM, Justin Collins <justincollins@ucla.edu> wrote:

On 10/25/2013 01:59 AM, Robert Klemme wrote:

On Thu, Oct 24, 2013 at 11:00 PM, Reid Thompson <Reid.Thompson@ateb.com> >> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Using reg expr with array.index ruby-talk	11	138	2 January 2008
Database speed issues ruby-talk	9	90	4 November 2007
Looking for suggestions processing and comparing 2 very large files ruby-talk	6	182	24 October 2012
Seeking the Ruby way ruby-talk	9	86	3 February 2006
Implementing a simple and efficient index system ruby-talk	17	148	9 July 2008

Help with ideas for finding dups on very large file

Related topics