Text processing

Stephen_Smith · 30 March 2007 03:37

Hi all,

So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.

I figure a script is the best way, but I'm stumped since the fields aren't
delimited.

Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.

I am very new to Ruby, but I am impressed so far.

And I know there's a way to do this, I just don't know how it's done.

Best,
Steve (ruby noob)

Harry4 · 30 March 2007 05:25

This is sloppy but you get the idea.
Try something like this. You can improve it.

arr = ["abcdefgh","12345678","867-5309"]
mydata = ""
arr.each do |x|
x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
mydata << $1 + "," + $2 + "," + $3 + "\n"
end

puts mydata

Harry

···

On 3/30/07, Stephen Smith <4fires@gmail.com> wrote:

The fields are fixed widths, each record is separated by a new line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.

I figure a script is the best way, but I'm stumped since the fields aren't
delimited.

Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.

--

Japanese Ruby List Subjects in English

James_Edward_Gray_II · 30 March 2007 11:59

So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new line.

Here's another option:

>> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
=> ["XXX", "YYYYY", "ZZ"]
>> xs
=> "XXX"
>> ys
=> "YYYYY"
>> zs
=> "ZZ"

Hope that helps.

James Edward Gray II

···

On Mar 29, 2007, at 10:37 PM, Stephen Smith wrote:

Kiran_Kumar_K · 30 March 2007 21:14

Stephen,

You could iterate on each line using the each method then use the string as
an array as the fields are fixed length.

File.open(filename) |f|
  f.each |line|
   field1 = line[0..2]
   field2 = line[3..4]
   .......
  end
end

Kiran Kumar

···

On 3/30/07, Stephen Smith <4fires@gmail.com> wrote:

Hi all,

So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new
line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.

I figure a script is the best way, but I'm stumped since the fields aren't
delimited.

Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.

I am very new to Ruby, but I am impressed so far.

And I know there's a way to do this, I just don't know how it's done.

Best,
Steve (ruby noob)

Gary_Wright · 30 March 2007 05:49

mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"

Gary Wright

···

On Mar 30, 2007, at 1:25 AM, Harry wrote:

x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
mydata << $1 + "," + $2 + "," + $3 + "\n"

Andrew_Stewart · 30 March 2007 14:14

I like the way your mind works. Nice!

Regards,
Andy Stewart

···

On 30 Mar 2007, at 12:59, James Edward Gray II wrote:

Here's another option:

>> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
=> ["XXX", "YYYYY", "ZZ"]
>> xs
=> "XXX"
>> ys
=> "YYYYY"
>> zs
=> "ZZ"

Harry4 · 30 March 2007 06:24

'captures' is new to me.
Thanks.

Harry

···

On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:

On Mar 30, 2007, at 1:25 AM, Harry wrote:
> x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
> mydata << $1 + "," + $2 + "," + $3 + "\n"

mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"

Gary Wright

--

Japanese Ruby List Subjects in English

Andrew_Stewart · 30 March 2007 08:28

If your fixed-width fields are more than, say, five characters wide, your regexp would be clearer by specifying the number of occurrences of each group rather than writing a . for each occurrence. I.e.:

mydata << /(.{3})(.)(.{4})/.match(x).captures.join(',') << "\n"

Regards,
Andy Stewart

···

On 30 Mar 2007, at 06:49, Gary Wright wrote:

On Mar 30, 2007, at 1:25 AM, Harry wrote:

x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
mydata << $1 + "," + $2 + "," + $3 + "\n"

mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"

Stephen_Smith · 30 March 2007 19:31

Thanks James. Thanks Andy.

So I've cleaned up the regular expression, and I like the simplicity of the
upack message.

But I think that the format Harry and Gary suggested clearly represents the
pattern I'm matching.

Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.

Here's what I wound up using:

line = ""
record = ""
oldLog = File.open("filename.txt")
newLog = File.new("filename_formatted.csv", "w")
hdr = "field1, field2, field3,...."
newLog << hdr
arr = oldLog.readlines
arr.each do |line|
record <<
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(line).captures.join(',')
<< "\n"
newLogg << record
record = ""
end

···

+

LOAD DATA INFILE '*/filename.csv*'
INTO TABLE *data_table*
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(field1, field2, field3,...);

And now I can happily munge my monthly data dump to my boss's hearts
content.

Thanks again guys.

I officially love Ruby.

Steve

On 3/30/07, Andrew Stewart <boss@airbladesoftware.com> wrote:

On 30 Mar 2007, at 12:59, James Edward Gray II wrote:
> Here's another option:
>
> >> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
> => ["XXX", "YYYYY", "ZZ"]
> >> xs
> => "XXX"
> >> ys
> => "YYYYY"
> >> zs
> => "ZZ"

I like the way your mind works. Nice!

Regards,
Andy Stewart

Stephen_Smith · 30 March 2007 07:43

Nice. You just made my month.

Thank you Harry. Thank you Gary.

Freekin' elegant.

Steve

···

On 3/29/07, Harry <ruby.hardware@gmail.com> wrote:

On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:
>
> On Mar 30, 2007, at 1:25 AM, Harry wrote:
> > x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
> > mydata << $1 + "," + $2 + "," + $3 + "\n"
>
> mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"
>
> Gary Wright
>

'captures' is new to me.
Thanks.

Harry

--

http://www.kakueki.com/ruby/list.html
Japanese Ruby List Subjects in English

James_Edward_Gray_II · 30 March 2007 19:47

Just FYI, when speed matters it may be worth using unpack:

#!/usr/bin/env ruby -w

require "benchmark"

TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
   results.report("regex:") do
     TESTS.times do
       /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(LINE).captures.join(",")
     end
   end
   results.report("unpack:") do
     TESTS.times do
       LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
     end
   end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

__END__

James Edward Gray II

···

On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:

But I think that the format Harry and Gary suggested clearly represents the pattern I'm matching.

Gary_Wright · 30 March 2007 20:15

How about making it so you don't have to type that pattern next time:

def columns(*widths)
Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
newLog.print pattern.match(line).captures.join(',')
}

···

On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:

line = ""
record = ""
oldLog = File.open("filename.txt")
newLog = File.new("filename_formatted.csv", "w")
hdr = "field1, field2, field3,...."
newLog << hdr
arr = oldLog.readlines
arr.each do |line|
record <<
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(line).captures.join(',')
<< "\n"
newLogg << record
record = ""
end

Brian_Candler · 31 March 2007 08:19

Which could be a constant at the top of the source:

LINE_PATTERN = /^(...)(..)(....)/

Then later on in your code you can say:

record << LINE_PATTERN.match(line).captures.join(',') << "\n"

Regards,

Brian.

···

On Sat, Mar 31, 2007 at 04:31:56AM +0900, Stephen Smith wrote:

So I've cleaned up the regular expression, and I like the simplicity of the
upack message.

But I think that the format Harry and Gary suggested clearly represents the
pattern I'm matching.

Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.

_Pena_Botp1 · 30 March 2007 10:52

# Nice. You just made my month.
# Thank you Harry. Thank you Gary.
# Freekin' elegant.
# Steve

···

From: Stephen Smith [mailto:4fires@gmail.com] :
#
# On 3/29/07, Harry <ruby.hardware@gmail.com> wrote:
# > On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:
# > > On Mar 30, 2007, at 1:25 AM, Harry wrote:
# > > > x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
# > > > mydata << $1 + "," + $2 + "," + $3 + "\n"
# > >
# > > mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"

ruby is fun, elegant, and powerful.
some more samples.

irb(main):030:0> x
=> "01234567890"
irb(main):031:0> x[/(..)...(....)/]
=> "012345678"
irb(main):032:0> $1
=> "01"
irb(main):033:0> $2
=> "5678"
irb(main):034:0> x[0..1]
=> "01"
irb(main):035:0> x[5..8]
=> "5678"
irb(main):036:0> field1=0..1
=> 0..1
irb(main):037:0> field2=5..8
=> 5..8
irb(main):038:0> x[field1]
=> "01"
irb(main):039:0> x[field2]
=> "5678"
irb(main):040:0>

kind regards -botp

Stephen_Smith · 30 March 2007 20:09

Sweet!

Thanks James.

This will probably help quite a bit as our user base (fingers crossed)
grows.

Steve

···

On 3/30/07, James Edward Gray II <james@grayproductions.net> wrote:

On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:

> But I think that the format Harry and Gary suggested clearly
> represents the pattern I'm matching.

Just FYI, when speed matters it may be worth using unpack:

#!/usr/bin/env ruby -w

require "benchmark"

TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
   results.report("regex:") do
     TESTS.times do
       /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
{9})/.match(LINE).captures.join(",")
     end
   end
   results.report("unpack:") do
     TESTS.times do
       LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
     end
   end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

__END__

James Edward Gray II

Stephen_Smith · 30 March 2007 20:39

Gary,

Thanks for the tip. I love the native pattern matching. I had no idea
regular expressions could be such powerful objects. And I can't think of a
good reason not to use Ruby around the house from now on. I'm going to sit
down with a book tonight.

And I killed the internal array, btw. I didn't know Ruby iterates through
files as easily as arrays.

This is like going back to school.

Grinning,
Steve

···

On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:

On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
> line = ""
> record = ""
> oldLog = File.open("filename.txt")
> newLog = File.new("filename_formatted.csv", "w")
> hdr = "field1, field2, field3,...."
> newLog << hdr
> arr = oldLog.readlines
> arr.each do |line|
> record <<
> /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
> {9})/.match(line).captures.join(',')
> << "\n"
> newLogg << record
> record = ""
> end

How about making it so you don't have to type that pattern next time:

def columns(*widths)
Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
newLog.print pattern.match(line).captures.join(',')
}

Bernard_Kenik · 30 March 2007 21:10

your LINE is 300 characters but you are only unpacking 103 ???

···

On Mar 30, 3:47 pm, James Edward Gray II <j...@grayproductions.net> wrote:

On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:

> But I think that the format Harry and Gary suggested clearly
> represents the pattern I'm matching.

Just FYI, when speed matters it may be worth using unpack:

#!/usr/bin/env ruby -w

require "benchmark"

TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
   results.report("regex:") do
     TESTS.times do
       /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
{9})/.match(LINE).captures.join(",")
     end
   end
   results.report("unpack:") do
     TESTS.times do
       LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
     end
   end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

__END__

James Edward Gray II

Ruby_Kk · 7 March 2009 02:38

Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.

any ideas?

Cheers!

Gary Wright wrote:

···

On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:

{9})/.match(line).captures.join(',')
<< "\n"
newLogg << record
record = ""
end

How about making it so you don't have to type that pattern next time:

def columns(*widths)
Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
newLog.print pattern.match(line).captures.join(',')
}

--
Posted via http://www.ruby-forum.com/\.

James_Edward_Gray_II · 30 March 2007 21:32

Yeah, I was too lazy to count it, so I just picked something big enough.

James Edward Gray II

···

On Mar 30, 2007, at 4:10 PM, bbiker wrote:

your LINE is 300 characters but you are only unpacking 103 ???

Chris6 · 7 March 2009 04:38

sprintf allows you specify formatting that can be used to print fixed
width.
An example from the docs (ruby-doc.org) is:
sprintf("%08b '%4s'", 123, 123) #=> "01111011 ' 123'"
Note how the %4s puts 3 characters in a space 4 characters wide.

cheers

···

On Mar 6, 9:38 pm, Ruby Kk <kumar...@gmail.com> wrote:

Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.

any ideas?

Cheers!

Gary Wright wrote:
> On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
>> {9})/.match(line).captures.join(',')
>> << "\n"
>> newLogg << record
>> record = ""
>> end

> How about making it so you don't have to type that pattern next time:

> def columns(*widths)
> Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
> end
> pattern = columns(6,15,3,30,4,15,1,12,9)

> You also can get rid of the intermediate strings and the potentially
> huge
> internal array (from slurping up your oldLog into an array):

> oldLog.each { |line|
> newLog.print pattern.match(line).captures.join(',')
> }

--
Posted viahttp://www.ruby-forum.com/.

Topic		Replies	Views
Text processing library ruby-talk	9	63	18 April 2007
Can someone point me in the right direction? Pipe Delimited Files & Hashes ruby-talk	7	108	7 March 2007
String Spliting/CSV question ruby-talk	2	113	29 June 2010
Converting CSV ruby-talk	13	109	19 April 2012
Read column from text or csv file and append "varchar," ruby-talk	2	163	1 August 2013

Text processing

Related topics