[SUMMARY] To Excel (#17)

Probably the first question to ask when looking at this quiz is, what makes this
report so unfriendly, not just to Excel, but to work with in general?

1. There's a lot of junk we don't need. Headers, footers, dashed lines. All
that needs to go, so we can focus on the data.

2. The data itself isn't pure. Have a look at these two fragments:

  50,589 358 %%%% 56,652 430 %%%%
  924,553 704,028 31 1,612K 1,226K

Those aren't numbers in there. Yuck.

3. This report is ALMOST a fixed width file, which would help a lot, but it
fails in a few areas:

  TEE_X_101 26-Mar-2002 15:26 1: GE

···

---------------
                                           Current LastYr
  Part Code Description Period Period
  --------------- ------------------------ ------- -------
    Salesperson 00 NOBODY
      Customer 1036 COMPANY 501
        SA Sort Code 1.43 WATER DOLLS
  78-143FS 17/8# SS MODEL 10 0
                                           ------- -------
        SA Sort Code subtotals 10 0
  
        SA Sort Code 3.3 REMOTE CONTROL CARS
  74270 Model 35357-DBL 0 0
  921137-73 LARGE 19 X 18 X 14 30 0
                                           ------- -------
        SA Sort Code subtotals 30 0
                                           ------- -------
      Customer subtotals 40 0
  
      Customer 14457 COMPANY 518
        SA Sort Code 11.5 KITCHEN SETS
  943437 19/8# SS MODEL 0 0
                                           ------- -------
        SA Sort Code subtotals 0 0
                                           ------- -------
      Customer subtotals 0 0
  
      Customer 1824 COMPANY 529
        SA Sort Code 19.4 SLIDES
  8394 2.5 OZ 0 20
  8341 .21 SIZE PLASTIC, NO BATT 0 10
  18363 .29 SIZE PLASTIC, NO BATT 0 24
                                           ------- -------
        SA Sort Code subtotals 0 54
                                           ------- -------
      Customer subtotals 0 54

Use the dashed lines as a guide and you can see how the sub-headers
(Salesperson, Customer, and SA Sort Code) don't fit into columns. Subtotal
lines have the same problem. Also notice that the Description content is
allowed to overflow it's column by one character. That's a mess.

4. This one is less obvious, but equally real. The data isn't very useful in
this format. Christian Neukirchen did a nice job of recognizing this and taking
steps to correct it.

Now that we've identified the issues, let's work through them. When I began
playing with a foreign report like this, I do just that: Play around a bit. My
first script may not end up being the one I keep, but I need to get familiar
with the data. I'll show you what works for me, step by step.

Whenever I think, "I need to translate this into..." I mentally switch that to:

  while line = ARGF.gets # traditional Unix filter
  
    print line
  end

It may not be much, but it's a start. Now, let's tackle problem number 1 and
jettison the junk. The majority of it is those silly page headers. Those
should be easy enough to drop. We stop printing when we enter a header and
start up again when we leave.

I think in patterns, so I would then start looking for a way to spot that I'm
entering a header. I'm leery to anchor against arbitrary text, if I don't have
too, so I would like to find something better. If you're familiar with these
page print style reports, you might have a good guess of what comes before that
header. Even if you're not, it pays to look at a foreign document with the
"invisible characters" turned on, if your editor can do that. Here's what a
header really looks like:

        SA Sort Code subtotals 20 0
  
  \f <-- Look what's hiding here! A form-feed character.
  TEE_X_101 26-Mar-2002 15:26 1: GE
  
                                           ---------------
                                           Current LastYr
  Part Code Description Period Period
  --------------- ------------------------ ------- -------
        SA Sort Code 5.207 BUILDING BLOCKS

That makes spotting the headers easy. The very first one doesn't have the
form-feed, but that's not hard to work around.

Once we're in a header, the last line is just a solid line of dashes and spaces.
  Just be careful to distinguish it from the other dashed line in the header and
you can build a solution:

  header = true # start in header
  while line = ARGF.gets
    if header # we're inside the header
      header = false if line =~ /^-[- ]+-$/ # watch for the end
    else # we're not in the header
      if line =~ /\f/ # watch for header beginning
        header = true
        next
      end
      
      print line
    end
  end

If you run that on the report, you'll see that we did indeed remove all the page
headers. I showed (and even submitted) that version because it's easy to break
down and understand, but if all that code bothers you, we can shorten it up:

  while line = ARGF.gets
    # the next line skips all headers
    next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
      
    print line
  end

That does exactly the same thing, making use of Ruby's versatile Ranges. The
range acts as a toggle here. When the first condition becomes true, the Range
will evaluate to true (skipping lines with next()) until the second condition
becomes true. Then the process begins again. My first condition searches for
line one of the report or a form-feed character. The second condition looks for
the solid line of dashes. Again, the code works the same, so you can use
whatever you are comfortable with.

We're not done dropping junk yet! Let's toss out dashed lines and the report
footer. Both are trivial:

  while line = ARGF.gets
    next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
    next if line =~ /--$/ # skip dashed lines
      
    print line
  
    break if line =~ /^Report Totals/ # stop with "Report Totals"
  end

What's the report look like now?

    Salesperson 00 NOBODY
      Customer 1036 COMPANY 501
        SA Sort Code 1.43 WATER DOLLS
  78-143FS 17/8# SS MODEL 10 0
        SA Sort Code subtotals 10 0
  
        SA Sort Code 3.3 REMOTE CONTROL CARS
  74270 Model 35357-DBL 0 0
  921137-73 LARGE 19 X 18 X 14 30 0
        SA Sort Code subtotals 30 0
      Customer subtotals 40 0
  
      Customer 14457 COMPANY 518
        SA Sort Code 11.5 KITCHEN SETS
  943437 19/8# SS MODEL 0 0
        SA Sort Code subtotals 0 0
      Customer subtotals 0 0

We're making progress. It's almost readable.

There are four types of lines left in this report that we need to deal with.
Yes, I'm sure it's four and not three. Sub-headers, product sales, subtotals
and blank lines. At this point, I would try to build selectors to handle each
of them:

  while line = ARGF.gets
    next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
    next if line =~ /--$/
    
    # handle blank lines...
    if line !~ /\S/
      print " BLANK: ", line
    
    # handle subtotals
    elsif line =~ /^.+?totals(?:\s+(?:-?[\d,]+K?|%+)){12}\s*$/i
      print " TOTAL: ", line
    
    # hande product sales
    elsif line =~ /^\S/
      print "PRODUCT: ", line
    
    # handle subheaders
    else
      print " HEADER: ", line
    end
  
    break if line =~ /^Report Totals/
  end

Uh oh, scary Regexp in there. Let's break it down:

  ^ # the beginning of the line
      .+? # skip some of the beginning looking for...
      totals # the word totals (to match subtotals or Report Totals with /i)
      (?: # grouping
          \s+ # some whitespace
          (?: # start group
              -? # an optional minus
              [\d,]+ # a comma separated digit sequence
              K? # an optional trailing K
              > # or ...
              %+ # a run of % signs
          ) # end group
      ){12} # end group -- find exactly 12 of those
      \s* # optional trailing whitespace
  $ # the end of the line

In short, it's just looking for the word "totals" followed by 12 things that
could be numbers in this report. You really don't even need an expression that
complex, in this case, but since the headers can hold arbitrary text (company
names) I want to make sure I'm matching what I'm looking for. Ruby will even
let you drop that commented Regexp in your code with /x if you like.

Notice that I'm being careful to use generic patterns. For example, I never
match Salesperson, Company or SA Sort Code. What if the company happens to have
another report that includes Brokers? (Trick question. I know it to be true in
this case!) These patterns will treat that like any other sub-header and just
work as expected.

So does that successfully locate everything?

   HEADER: Salesperson 00 NOBODY
   HEADER: Customer 1036 COMPANY 501
   HEADER: SA Sort Code 1.43 WATER DOLLS
  PRODUCT: 78-143FS 17/8# SS MODEL 10 0
    TOTAL: SA Sort Code subtotals 10 0
    BLANK:
   HEADER: SA Sort Code 3.3 REMOTE CONTROL CARS
  PRODUCT: 74270 Model 35357-DBL 0 0
  PRODUCT: 921137-73 LARGE 19 X 18 X 14 30 0
    TOTAL: SA Sort Code subtotals 30 0
    TOTAL: Customer subtotals 40 0
    BLANK:
   HEADER: Customer 14457 COMPANY 518
   HEADER: SA Sort Code 11.5 KITCHEN SETS
  PRODUCT: 943437 19/8# SS MODEL 0 0
    TOTAL: SA Sort Code subtotals 0 0
    TOTAL: Customer subtotals 0 0

Sure does. Now it's easy to just transform that data into CSV, which will fix
our not-quite-fixed-width-columns problem:

  require "csv"
  
  while line = ARGF.gets
    next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
    next if line =~ /--$/
    
    if line !~ /\S/
      puts CSV.generate_line([""])
  
    elsif line =~ /^(.+?totals)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/i
      puts CSV.generate_line(["", $1.lstrip, *$2.split(" ")])
  
    elsif line =~ /^(\S+)\s+(.+?)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/
      puts CSV.generate_line([$1, $2, *$3.split(" ")])
  
    else
      puts CSV.generate_line(["", line.strip])
    end
  
    break if line =~ /^Report Totals/
  end

I snuck another nasty Regexp in there didn't I? Actually, if you look closely,
you'll see that it's nearly identical to the other one, save that it also
catches product descriptions. It's just there to help me locate all the parts
of the line. After capturing the parts, I use split() to break them up, strip()
to clean them, and let Ruby's CSV handle the rest.

If you open the CSV output from this in Excel, you'll see that I've just shifted
sub-headers and subtotals into the second column. That pretty much maintains
the familiar report format, while eliminating the column breakup issues.

That's really a lot of the clean up work done already. The nagging problem is
that Excel doesn't consider 1,612K or %%%% numbers. We can fix the first one
easy enough, but what the heck is causing the %%%%s?

More detective work is needed.

Those %%%% entries only happen in the small "Pct Var" column, which is four
characters wide. "Var" huh? Variance maybe? If you look around a bit, the
formula is easy to calculate from a row like this:

                                           Current LastYr Pct
  Part Code Description Period Period Var
  --------------- ------------------------ ------- ------- ----
  
  613433 .22 SIZE PLASTIC, NO BATT 65 1 6400

That looks like (65 - 1) * 100. That doesn't make much sense though, because
we're probably talking about a percentage. More likely is (65 - 1) / 1 * 100.
You can test that on other rows to prove that it is indeed correct.

Okay, so why are we getting %%%%s?

  Current LastYr Pct
      YTD YTD Var
  ------- ------- ----
  
      125 1 %%%%

(125 - 1) / 1 * 100 = 12400. That's five characters and we're only allowed
four. It's a column overflow problem. Our CSV approach has no such limitation,
so we can recalculate those numbers and add them to the output.

Here's a routine to clean up the numbers:

  def clean( numbers )
    # turn them back into numbers...
    numbers.map! do |n|
      n.gsub!(",", "") # drop commas
      if n.sub!(/K$/, "") # when there's a K...
        n.to_i * 1000 # multiple by 1,000
      elsif n !~ /%/
        n.to_i # numify
      else
        n
      end
    end
    
    # recalculate %%%% columns...
    numbers.each_with_index do |n, i|
      if n.to_s =~ /%/
        numbers[i] = ( (numbers[i - 2] - numbers[i - 1]) /
                 numbers[i - 1].to_f * 100 ).to_i
      end
    end
    
    numbers # return results
  end

To use that, we just call it twice:

  while line = ARGF.gets
    next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
    next if line =~ /--$/
    
    if line !~ /\S/
      puts CSV.generate_line([""])
  
    elsif line =~ /^(.+?totals)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/i
      # first call...
      puts CSV.generate_line(["", $1.lstrip, *clean($2.split(" "))])
  
    elsif line =~ /^(\S+)\s+(.+?)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/
      # second call...
      puts CSV.generate_line([$1, $2, *clean($3.split(" "))])
  
    else
      puts CSV.generate_line(["", line.strip])
    end
  
    break if line =~ /^Report Totals/
  end

The rest of my solution (not shown) was boring header parsing, to print out the
column labels and the report "Period". You can follow the link in the sidebar
if you want to examine that.

At this point, we've solved problems 1, 2 and 3 from the beginning of this
summary. I want to talk a little about problem 4, then I promise to end this
ridiculously long message.

Christian Neukirchen's short solution (not shown) does away with sub-header and
subtotal lines. That filter tacks on the Salesperson, Customer and SA Sort Code
to each product entry. This makes all the needed information available on every
line. This is a big win for using Excel's filters, making the report a lot
easier to examine and change.

When I did this project for work, I did the same thing. However, I put the
metadata at the back of the line and left in the sub-headers, subtotals and
blanks. I also added a column to identify each type of line: Header, total,
data or blank. This gave me all the same benefits as Christian's code has,
while keeping the familiar format. I could filter down to the "data" lines to
have nearly identical results. This feature turned out to be a big hit,
drastically increasing company productivity.

Sorry for sending in a quiz that "looked too much like real work". I promise,
we're back to fun and games tomorrow...

James Edward Gray II <james@grayproductions.net> writes:

Christian Neukirchen's short solution (not shown) does away with
sub-header and subtotal lines. That filter tacks on the Salesperson,
Customer and SA Sort Code to each product entry. This makes all the
needed information available on every line. This is a big win for
using Excel's filters, making the report a lot easier to examine and
change.

When I did this project for work, I did the same thing. However, I
put the metadata at the back of the line and left in the sub-headers,
subtotals and blanks. I also added a column to identify each type of
line: Header, total, data or blank. This gave me all the same
benefits as Christian's code has, while keeping the familiar format.
I could filter down to the "data" lines to have nearly identical
results. This feature turned out to be a big hit, drastically
increasing company productivity.

Now, that's a really nice idea...

···

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Bravo, James. I rarely have time to try the quizzes, but your tutorial summaries are outstanding.

This, in my opinion, is one thing we should do to increase Ruby's visibility: write articles and books that take the time to explain code in the context of a specific application and and strategy.

Self-documenting code is a myth. It tells you everything about the wasp but why [1-2].

Steve

[1] Semi-obscure literary references are no fun in the age of Google.
[2] 'why', not '_why'.