Text processing

Hi all,

So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.

I figure a script is the best way, but I'm stumped since the fields aren't
delimited.

Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.

I am very new to Ruby, but I am impressed so far.

And I know there's a way to do this, I just don't know how it's done.

Best,
Steve (ruby noob)

This is sloppy but you get the idea.
Try something like this. You can improve it.

arr = ["abcdefgh","12345678","867-5309"]
mydata = ""
arr.each do |x|
x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
mydata << $1 + "," + $2 + "," + $3 + "\n"
end

puts mydata

Harry

···

On 3/30/07, Stephen Smith <4fires@gmail.com> wrote:

The fields are fixed widths, each record is separated by a new line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.

I figure a script is the best way, but I'm stumped since the fields aren't
delimited.

Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.

--

Japanese Ruby List Subjects in English

So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new line.

Here's another option:

>> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
=> ["XXX", "YYYYY", "ZZ"]
>> xs
=> "XXX"
>> ys
=> "YYYYY"
>> zs
=> "ZZ"

Hope that helps.

James Edward Gray II

···

On Mar 29, 2007, at 10:37 PM, Stephen Smith wrote:

Stephen,

You could iterate on each line using the each method then use the string as
an array as the fields are fixed length.

File.open(filename) |f|
  f.each |line|
   field1 = line[0..2]
   field2 = line[3..4]
   .......
  end
end

Kiran Kumar

···

On 3/30/07, Stephen Smith <4fires@gmail.com> wrote:

Hi all,

So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new
line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.

I figure a script is the best way, but I'm stumped since the fields aren't
delimited.

Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.

I am very new to Ruby, but I am impressed so far.

And I know there's a way to do this, I just don't know how it's done.

Best,
Steve (ruby noob)

mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"

Gary Wright

···

On Mar 30, 2007, at 1:25 AM, Harry wrote:

x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
mydata << $1 + "," + $2 + "," + $3 + "\n"

I like the way your mind works. Nice!

Regards,
Andy Stewart

···

On 30 Mar 2007, at 12:59, James Edward Gray II wrote:

Here's another option:

>> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
=> ["XXX", "YYYYY", "ZZ"]
>> xs
=> "XXX"
>> ys
=> "YYYYY"
>> zs
=> "ZZ"

'captures' is new to me.
Thanks.

Harry

···

On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:

On Mar 30, 2007, at 1:25 AM, Harry wrote:
> x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
> mydata << $1 + "," + $2 + "," + $3 + "\n"

mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"

Gary Wright

--

Japanese Ruby List Subjects in English

If your fixed-width fields are more than, say, five characters wide, your regexp would be clearer by specifying the number of occurrences of each group rather than writing a . for each occurrence. I.e.:

   mydata << /(.{3})(.)(.{4})/.match(x).captures.join(',') << "\n"

Regards,
Andy Stewart

···

On 30 Mar 2007, at 06:49, Gary Wright wrote:

On Mar 30, 2007, at 1:25 AM, Harry wrote:

x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
mydata << $1 + "," + $2 + "," + $3 + "\n"

mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"

Thanks James. Thanks Andy.

So I've cleaned up the regular expression, and I like the simplicity of the
upack message.

But I think that the format Harry and Gary suggested clearly represents the
pattern I'm matching.

Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.

Here's what I wound up using:

line = ""
record = ""
oldLog = File.open("filename.txt")
newLog = File.new("filename_formatted.csv", "w")
hdr = "field1, field2, field3,...."
newLog << hdr
arr = oldLog.readlines
arr.each do |line|
record <<
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(line).captures.join(',')
<< "\n"
newLogg << record
record = ""
end

···

+

LOAD DATA INFILE '*/filename.csv*'
INTO TABLE *data_table*
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(field1, field2, field3,...);

And now I can happily munge my monthly data dump to my boss's hearts
content.

Thanks again guys.

I officially love Ruby.

Steve

On 3/30/07, Andrew Stewart <boss@airbladesoftware.com> wrote:

On 30 Mar 2007, at 12:59, James Edward Gray II wrote:
> Here's another option:
>
> >> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
> => ["XXX", "YYYYY", "ZZ"]
> >> xs
> => "XXX"
> >> ys
> => "YYYYY"
> >> zs
> => "ZZ"

I like the way your mind works. Nice!

Regards,
Andy Stewart

Nice. You just made my month.

Thank you Harry. Thank you Gary.

Freekin' elegant.

Steve

···

On 3/29/07, Harry <ruby.hardware@gmail.com> wrote:

On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:
>
> On Mar 30, 2007, at 1:25 AM, Harry wrote:
> > x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
> > mydata << $1 + "," + $2 + "," + $3 + "\n"
>
> mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"
>
> Gary Wright
>

'captures' is new to me.
Thanks.

Harry

--

http://www.kakueki.com/ruby/list.html
Japanese Ruby List Subjects in English

Just FYI, when speed matters it may be worth using unpack:

#!/usr/bin/env ruby -w

require "benchmark"

TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
   results.report("regex:") do
     TESTS.times do
       /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(LINE).captures.join(",")
     end
   end
   results.report("unpack:") do
     TESTS.times do
       LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
     end
   end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

__END__

James Edward Gray II

···

On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:

But I think that the format Harry and Gary suggested clearly represents the pattern I'm matching.

How about making it so you don't have to type that pattern next time:

def columns(*widths)
   Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
   newLog.print pattern.match(line).captures.join(',')
}

···

On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:

line = ""
record = ""
oldLog = File.open("filename.txt")
newLog = File.new("filename_formatted.csv", "w")
hdr = "field1, field2, field3,...."
newLog << hdr
arr = oldLog.readlines
arr.each do |line|
record <<
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(line).captures.join(',')
<< "\n"
newLogg << record
record = ""
end

Which could be a constant at the top of the source:

  LINE_PATTERN = /^(...)(..)(....)/

Then later on in your code you can say:

  record << LINE_PATTERN.match(line).captures.join(',') << "\n"

Regards,

Brian.

···

On Sat, Mar 31, 2007 at 04:31:56AM +0900, Stephen Smith wrote:

So I've cleaned up the regular expression, and I like the simplicity of the
upack message.

But I think that the format Harry and Gary suggested clearly represents the
pattern I'm matching.

Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.

# Nice. You just made my month.
# Thank you Harry. Thank you Gary.
# Freekin' elegant.
# Steve

···

From: Stephen Smith [mailto:4fires@gmail.com] :
#
# On 3/29/07, Harry <ruby.hardware@gmail.com> wrote:
# > On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:
# > > On Mar 30, 2007, at 1:25 AM, Harry wrote:
# > > > x =~ /(...)(.)(....)/ #grabs 3 characters, then 1, then 4
# > > > mydata << $1 + "," + $2 + "," + $3 + "\n"
# > >
# > > mydata << /(...)(.)(....)/.match(x).captures.join(',') << "\n"

ruby is fun, elegant, and powerful.
some more samples.

irb(main):030:0> x
=> "01234567890"
irb(main):031:0> x[/(..)...(....)/]
=> "012345678"
irb(main):032:0> $1
=> "01"
irb(main):033:0> $2
=> "5678"
irb(main):034:0> x[0..1]
=> "01"
irb(main):035:0> x[5..8]
=> "5678"
irb(main):036:0> field1=0..1
=> 0..1
irb(main):037:0> field2=5..8
=> 5..8
irb(main):038:0> x[field1]
=> "01"
irb(main):039:0> x[field2]
=> "5678"
irb(main):040:0>

kind regards -botp

Sweet!

Thanks James.

This will probably help quite a bit as our user base (fingers crossed)
grows. :slight_smile:

Steve

···

On 3/30/07, James Edward Gray II <james@grayproductions.net> wrote:

On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:

> But I think that the format Harry and Gary suggested clearly
> represents the pattern I'm matching.

Just FYI, when speed matters it may be worth using unpack:

#!/usr/bin/env ruby -w

require "benchmark"

TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
   results.report("regex:") do
     TESTS.times do
       /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
{9})/.match(LINE).captures.join(",")
     end
   end
   results.report("unpack:") do
     TESTS.times do
       LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
     end
   end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

__END__

James Edward Gray II

Gary,

Thanks for the tip. I love the native pattern matching. I had no idea
regular expressions could be such powerful objects. And I can't think of a
good reason not to use Ruby around the house from now on. I'm going to sit
down with a book tonight.

And I killed the internal array, btw. I didn't know Ruby iterates through
files as easily as arrays.

This is like going back to school.

Grinning,
Steve

···

On 3/30/07, Gary Wright <gwtmp01@mac.com> wrote:

On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
> line = ""
> record = ""
> oldLog = File.open("filename.txt")
> newLog = File.new("filename_formatted.csv", "w")
> hdr = "field1, field2, field3,...."
> newLog << hdr
> arr = oldLog.readlines
> arr.each do |line|
> record <<
> /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
> {9})/.match(line).captures.join(',')
> << "\n"
> newLogg << record
> record = ""
> end

How about making it so you don't have to type that pattern next time:

def columns(*widths)
   Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
   newLog.print pattern.match(line).captures.join(',')
}

your LINE is 300 characters but you are only unpacking 103 ???

···

On Mar 30, 3:47 pm, James Edward Gray II <j...@grayproductions.net> wrote:

On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:

> But I think that the format Harry and Gary suggested clearly
> represents the pattern I'm matching.

Just FYI, when speed matters it may be worth using unpack:

#!/usr/bin/env ruby -w

require "benchmark"

TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
   results.report("regex:") do
     TESTS.times do
       /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
{9})/.match(LINE).captures.join(",")
     end
   end
   results.report("unpack:") do
     TESTS.times do
       LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
     end
   end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

__END__

James Edward Gray II

Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.

any ideas?

Cheers!

Gary Wright wrote:

···

On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:

{9})/.match(line).captures.join(',')
<< "\n"
newLogg << record
record = ""
end

How about making it so you don't have to type that pattern next time:

def columns(*widths)
   Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
   newLog.print pattern.match(line).captures.join(',')
}

--
Posted via http://www.ruby-forum.com/\.

Yeah, I was too lazy to count it, so I just picked something big enough. :wink:

James Edward Gray II

···

On Mar 30, 2007, at 4:10 PM, bbiker wrote:

your LINE is 300 characters but you are only unpacking 103 ???

sprintf allows you specify formatting that can be used to print fixed
width.
An example from the docs (ruby-doc.org) is:
sprintf("%08b '%4s'", 123, 123) #=> "01111011 ' 123'"
Note how the %4s puts 3 characters in a space 4 characters wide.

cheers

···

On Mar 6, 9:38 pm, Ruby Kk <kumar...@gmail.com> wrote:

Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.

any ideas?

Cheers!

Gary Wright wrote:
> On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
>> {9})/.match(line).captures.join(',')
>> << "\n"
>> newLogg << record
>> record = ""
>> end

> How about making it so you don't have to type that pattern next time:

> def columns(*widths)
> Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
> end
> pattern = columns(6,15,3,30,4,15,1,12,9)

> You also can get rid of the intermediate strings and the potentially
> huge
> internal array (from slurping up your oldLog into an array):

> oldLog.each { |line|
> newLog.print pattern.match(line).captures.join(',')
> }

--
Posted viahttp://www.ruby-forum.com/.