Parsing non-delimited text file

Following post is a bit lengthy and I am not expecting to get fully
working code to solve this problem, rather some pointers on how to
approach parsing in this example.

I have large text files(s) that I need to parse into more usable form.
Contents is output from legacy ERP-system.

Source files look like example file I have attached. Report includes
shipped amounts per week range by customer by customer department by
product by weekday. Page breaks are hard coded so customer and week info
can be found several times for given week if there are lots of products.

I am hoping to get into form where each line includes all relevant
information for given product on given day. So columns could be for
example:

Year, Week, Weekday, CustomerID, Customer, Sub-custID, Sub-cust,
ProductID, Product, Quantity
->
2008, 39, MON, 97, CUSTOMER A, 999, DEPARTMENT A, 123, PRODUCT A, 150
2008, 39, TUE, 97, CUSTOMER A, 999, DEPARTMENT A, 123, PRODUCT A, 50

I am pretty much noob with ruby so bear with me, but I had this kind of
idea on how this could work.

First file is read into array and array fed to method that does the
work. Finished rows are saved in array in array and that array is later
written to file

Check if row includes 'Customer' and compare it to current customer to
see if customer has changed. Check for Sub-cust to see if department has
changed and save these in variable. Same checks for week and year.

If row starts with number it is product heading row and that row
includes qty for monday (this is consistent). Six rows after that start
with TUE-SUN, except when page break breaks flow. Quantity is always
first number after day.

I have managed to extract customer id and I think other heading info
won't differ much from that, but now I got stuck when I tried to
determine if string starts with number to see if that row includes
product info. I tried start_with?(/\d+/)[0] but that didn't work.

def readfile(file)
  IO.readlines(file)
end

def getcust(arr)
  custs = []
  arr.each do |fl|
    if fl.include? 'Customer'
      custs.push fl.scan(/\d+/)[0]
    end
  end
  custs
end

lines = readfile 'source_example'
customers = getcust lines
puts customers

All comments that move me forward are appreciated.

Attachments:
http://www.ruby-forum.com/attachment/2757/source_example

···

--
Posted via http://www.ruby-forum.com/.

First file is read into array and array fed to method that does the
work.

I'm not sure how big the actual data is, but if it's very large you may want to consider not using readlines(). You "slurp" all of this data into memory, but then walk through it one line at a time. Instead, you could just read it a line at a time, using foreach() instead of readlines().

I have managed to extract customer id and I think other heading info
won't differ much from that, but now I got stuck when I tried to
determine if string starts with number to see if that row includes
product info. I tried start_with?(/\d+/)[0] but that didn't work.

To check if a String begins with a number you can use:

   if str =~ /\A\d/
     # handle string that starts with a number here
   end

Hope that helps.

James Edward Gray II

···

On Oct 2, 2008, at 7:40 AM, Panu Kinnari wrote:

I'm not sure how big the actual data is, but if it's very large you
may want to consider not using readlines().

It is large, example I provided is excerpt from 450 page, 2 mb text
file.

You "slurp" all of this

data into memory, but then walk through it one line at a time.
Instead, you could just read it a line at a time, using foreach()
instead of readlines().

I will look into this.

To check if a String begins with a number you can use:

   if str =~ /\A\d/
     # handle string that starts with a number here
   end

Thanks, I'll try this

Hope that helps.

It sure does.

I was wondering, would it be feasible to construct regexp for each line
type? They do follow certain convention to certain extent after all.
Atleast lines that matter.

Thanks again.

···

--
Posted via http://www.ruby-forum.com/\.

I think that's a fine strategy, yes.

James Edward Gray II

···

On Oct 2, 2008, at 11:07 AM, Panu Kinnari wrote:

I was wondering, would it be feasible to construct regexp for each line
type? They do follow certain convention to certain extent after all.
Atleast lines that matter.

I think that's a fine strategy, yes.

Here is my progress so far. I am now able to extract all relevant info
from lines with product id, name and quantity.

I didn't look into foreach() yet, I feel that I need to get these
regexps in order first.

def getcust(arr)
  custs =
  arr.each do |fl|
    if fl.include? 'Customer'
      custs.push fl.scan(/\d+/)[0]
    elsif fl.strip =~ /\A\d/
      # remove whitespace
      product = fl.strip
      # extract productid and add it to array
      productid = product.scan(/\A\d+/)
      custs.push productid
      # extract beginning words and remove last one (doesn't belong to
product name) and add to array
      productname = product.scan(/[A-Z,a-z]+/)
      productname.pop
      custs.push productname.join(" ")
      # extract numbers and select third from last (same number as first
one after product name, but this felt easier) and add to array
      qty = product.scan(/\d+/)
      custs.push qty[qty.length-3]
    end
  end
  custs
end

···

--
Posted via http://www.ruby-forum.com/\.

It is really easy to re-factor from array based to file based.
Change the first three lines of your code to this...

def getcust(file_name)
  custs =
  IO.foreach(file_name) do |fl|

Then update your calling code to pass the method a file name
rather than an array. The rest of your method can be used as is.

···

I didn't look into foreach() yet, I feel that I need to get these
regexps in order first.

def getcust(arr)
custs =
arr.each do |fl|

i3w wrote:

It is really easy to re-factor from array based to file based.
Change the first three lines of your code to this...

Yes, thats what I did. Finished next iteration of my code yesterday. Now
it basically does everything I need and writes new lines to text file.
What I'd want to do next is convert year-week-weekday combo to proper
timestamp, but haven't looked into it yet.

Anyway, comments are again appreciated.

# Write orderline to file

def writefile(file, *linedata)

  linedata.each do |line|

    file << line.join(",") + "\n"

  end

end

# Read file line-by-line and extract relevant information
def readfile(file, outputfile)

  out = File.new(outputfile, "w+")

  info =

  wline = ['year', 'week', 'day', 'customerid', 'customer', 'subcustid',
'subcust', 'prodid', 'prod', 'qty']

  IO.foreach(file){|line|

    if line =~ /Customer/

      wline[3] = line.split(":")[1].scan(/\d+/)

      wline[4] = line.split(":")[1].scan(/[a-zA-Z]+/).join(" ")

    elsif line =~ /Sub-cust/

      wline[5] = line.split(":")[1].scan(/\d+/)

      wline[6] = line.split(":")[1].scan(/[a-zA-Z]+/).join(" ")

    elsif line =~ /Year/

      wline[0] = line.scan(/\d+/)[0]

    elsif line =~ /Week/

      wline[1] = line.scan(/\d+/)[0]

    elsif line.strip! =~ /\A\d/

      wline[7] = line.scan(/\A\d+/)

      temp = line.scan(/[A-Za-z]+/)

      wline[2] = temp.pop #later used for delivery day

      wline[8] = temp.join(" ")

      wline[9] = line.scan(/\d+/)[line.scan(/\d+/).length-13]

      writefile(out, wline)

    elsif line =~ /MON/ || line =~ /TUE/ || line =~ /WED/ ||

        line =~ /THU/ || line =~ /FRI/ || line =~ /SAT/ ||

        line =~ /SUN/

      wline[2] = line.scan(/[A-Za-z]+/)

      wline[9] = line.scan(/\d+/)[line.scan(/\d+/).length-13]

      writefile(out, wline)

    end

  }

  out.close

end

readfile('source_example', 'output.txt')

···

--
Posted via http://www.ruby-forum.com/\.