So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.
I figure a script is the best way, but I'm stumped since the fields aren't
delimited.
Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.
I am very new to Ruby, but I am impressed so far.
And I know there's a way to do this, I just don't know how it's done.
The fields are fixed widths, each record is separated by a new line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.
I figure a script is the best way, but I'm stumped since the fields aren't
delimited.
Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.
So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new
line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.
I figure a script is the best way, but I'm stumped since the fields aren't
delimited.
Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.
I am very new to Ruby, but I am impressed so far.
And I know there's a way to do this, I just don't know how it's done.
If your fixed-width fields are more than, say, five characters wide, your regexp would be clearer by specifying the number of occurrences of each group rather than writing a . for each occurrence. I.e.:
So I've cleaned up the regular expression, and I like the simplicity of the
upack message.
But I think that the format Harry and Gary suggested clearly represents the
pattern I'm matching.
Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.
Here's what I wound up using:
line = ""
record = ""
oldLog = File.open("filename.txt")
newLog = File.new("filename_formatted.csv", "w")
hdr = "field1, field2, field3,...."
newLog << hdr
arr = oldLog.readlines
arr.each do |line|
record <<
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(line).captures.join(',')
<< "\n"
newLogg << record
record = ""
end
···
+
LOAD DATA INFILE '*/filename.csv*'
INTO TABLE *data_table*
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(field1, field2, field3,...);
And now I can happily munge my monthly data dump to my boss's hearts
content.
Just FYI, when speed matters it may be worth using unpack:
#!/usr/bin/env ruby -w
require "benchmark"
TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
results.report("regex:") do
TESTS.times do
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(LINE).captures.join(",")
end
end
results.report("unpack:") do
TESTS.times do
LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
end
end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)
__END__
James Edward Gray II
···
On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:
But I think that the format Harry and Gary suggested clearly represents the pattern I'm matching.
Which could be a constant at the top of the source:
LINE_PATTERN = /^(...)(..)(....)/
Then later on in your code you can say:
record << LINE_PATTERN.match(line).captures.join(',') << "\n"
Regards,
Brian.
···
On Sat, Mar 31, 2007 at 04:31:56AM +0900, Stephen Smith wrote:
So I've cleaned up the regular expression, and I like the simplicity of the
upack message.
But I think that the format Harry and Gary suggested clearly represents the
pattern I'm matching.
Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.
Thanks for the tip. I love the native pattern matching. I had no idea
regular expressions could be such powerful objects. And I can't think of a
good reason not to use Ruby around the house from now on. I'm going to sit
down with a book tonight.
And I killed the internal array, btw. I didn't know Ruby iterates through
files as easily as arrays.
sprintf allows you specify formatting that can be used to print fixed
width.
An example from the docs (ruby-doc.org) is:
sprintf("%08b '%4s'", 123, 123) #=> "01111011 ' 123'"
Note how the %4s puts 3 characters in a space 4 characters wide.
Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.
any ideas?
Cheers!
Gary Wright wrote:
> On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
>> {9})/.match(line).captures.join(',')
>> << "\n"
>> newLogg << record
>> record = ""
>> end
> How about making it so you don't have to type that pattern next time: