So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.
I figure a script is the best way, but I'm stumped since the fields aren't
delimited.
Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.
I am very new to Ruby, but I am impressed so far.
And I know there's a way to do this, I just don't know how it's done.
This is sloppy but you get the idea.
Try something like this. You can improve it.
arr = ["abcdefgh","12345678","867-5309"]
mydata = ""
arr.each do |x|
x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
mydata << $1 + "," + $2 + "," + $3 + "\n"
end
puts mydata
Harry
···
On 3/30/07, Stephen Smith <4fires@gmail.com> wrote:
The fields are fixed widths, each record is separated by a new line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.
I figure a script is the best way, but I'm stumped since the fields aren't
delimited.
Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.
You could iterate on each line using the each method then use the string as
an array as the fields are fixed length.
File.open(filename) |f|
f.each |line|
field1 = line[0..2]
field2 = line[3..4]
.......
end
end
Kiran Kumar
···
On 3/30/07, Stephen Smith <4fires@gmail.com> wrote:
Hi all,
So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new
line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.
I figure a script is the best way, but I'm stumped since the fields aren't
delimited.
Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.
I am very new to Ruby, but I am impressed so far.
And I know there's a way to do this, I just don't know how it's done.
If your fixed-width fields are more than, say, five characters wide, your regexp would be clearer by specifying the number of occurrences of each group rather than writing a . for each occurrence. I.e.:
So I've cleaned up the regular expression, and I like the simplicity of the
upack message.
But I think that the format Harry and Gary suggested clearly represents the
pattern I'm matching.
Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.
Here's what I wound up using:
line = ""
record = ""
oldLog = File.open("filename.txt")
newLog = File.new("filename_formatted.csv", "w")
hdr = "field1, field2, field3,...."
newLog << hdr
arr = oldLog.readlines
arr.each do |line|
record <<
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(line).captures.join(',')
<< "\n"
newLogg << record
record = ""
end
···
+
LOAD DATA INFILE '*/filename.csv*'
INTO TABLE *data_table*
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(field1, field2, field3,...);
And now I can happily munge my monthly data dump to my boss's hearts
content.
Thanks again guys.
I officially love Ruby.
Steve
On 3/30/07, Andrew Stewart <boss@airbladesoftware.com> wrote:
On 30 Mar 2007, at 12:59, James Edward Gray II wrote:
> Here's another option:
>
> >> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
> => ["XXX", "YYYYY", "ZZ"]
> >> xs
> => "XXX"
> >> ys
> => "YYYYY"
> >> zs
> => "ZZ"
Just FYI, when speed matters it may be worth using unpack:
#!/usr/bin/env ruby -w
require "benchmark"
TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
results.report("regex:") do
TESTS.times do
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(LINE).captures.join(",")
end
end
results.report("unpack:") do
TESTS.times do
LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
end
end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)
__END__
James Edward Gray II
···
On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:
But I think that the format Harry and Gary suggested clearly represents the pattern I'm matching.
Which could be a constant at the top of the source:
LINE_PATTERN = /^(...)(..)(....)/
Then later on in your code you can say:
record << LINE_PATTERN.match(line).captures.join(',') << "\n"
Regards,
Brian.
···
On Sat, Mar 31, 2007 at 04:31:56AM +0900, Stephen Smith wrote:
So I've cleaned up the regular expression, and I like the simplicity of the
upack message.
But I think that the format Harry and Gary suggested clearly represents the
pattern I'm matching.
Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.
Thanks for the tip. I love the native pattern matching. I had no idea
regular expressions could be such powerful objects. And I can't think of a
good reason not to use Ruby around the house from now on. I'm going to sit
down with a book tonight.
And I killed the internal array, btw. I didn't know Ruby iterates through
files as easily as arrays.
This is like going back to school.
Grinning,
Steve
···
On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:
On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
> line = ""
> record = ""
> oldLog = File.open("filename.txt")
> newLog = File.new("filename_formatted.csv", "w")
> hdr = "field1, field2, field3,...."
> newLog << hdr
> arr = oldLog.readlines
> arr.each do |line|
> record <<
> /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
> {9})/.match(line).captures.join(',')
> << "\n"
> newLogg << record
> record = ""
> end
How about making it so you don't have to type that pattern next time:
def columns(*widths)
Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)
You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):
sprintf allows you specify formatting that can be used to print fixed
width.
An example from the docs (ruby-doc.org) is:
sprintf("%08b '%4s'", 123, 123) #=> "01111011 ' 123'"
Note how the %4s puts 3 characters in a space 4 characters wide.
cheers
···
On Mar 6, 9:38 pm, Ruby Kk <kumar...@gmail.com> wrote:
Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.
any ideas?
Cheers!
Gary Wright wrote:
> On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
>> {9})/.match(line).captures.join(',')
>> << "\n"
>> newLogg << record
>> record = ""
>> end
> How about making it so you don't have to type that pattern next time: