Howdy, Gavin. Thanks for taking the time to answer!
Tobias responded to a lot of your comments already; I just wanted to add a
little more:
Gavin Sinclair wrote:
So one might start out with
class LogLine
attr_reader :month, :day, :hour, :minute, :second,
:host, :program, :pid, :message
[...]
end
I’d use a Time object instead of :month, :day, etc.
We’re thinking along the same lines; part of that […] is a method that
returns a Time object. But I did it this way for a couple of reasons. One
is that because time in the standard syslog file is ambiguous (you’ll note
there’s no year or tz field), it’s important to offer the uninterpreted
data. Anotherissue is one of speed: I’d rather defer creation of the Time
object until somebody actually needs it.
I wouldn’t try to make subclasses of LogLine straight away, unless you
already know what you want to do with them.
Happily, I do. I need to process my mail logs to see which DNS-based
blacklists would cut spam without yielding false positives. This struck me
as a fine project to learn Ruby with. But once I have a some decent
classes, they will undoubtedly be reused for other log analysis.
Passing the string (minus the date) is all I can think you can do. You
can’t slice the string in a way that’s meaningful to all subclasses.
Well, that’s not entirely true. All LogLine subclasses agree on the basic
fields in the LogLine class, so I want to slice those up first. Other
subclasses may be able to ferret out additional data from the message, so
each subclass may do further slicing.
Done naively, this would mean that each subclass would parse the line,
but that would be wasteful. So obviously, the superclass should do the
basic parsing once and pass the details down. Passing as an Array is
brittle and not very OO. Passing as a Hash is better, but still not
right.
BTW There’s nothing non-OO about arrays in Ruby. They’re not like Java!
Ah, I should have explained more what I meant.
In OO analysis and design, you find groups of data that travel together and
call them objects. If I break a log line up into an array of fields
["Aug", "24", "12", "31", "01", "myhost", "sendmail", "12345", "foo"]
and pass that around, that’s not very OO because there is an implicit
structure that isn’t expressed in my representation. Arrays are just a row
of things, and I know more about the log line than that. So I could use a
hash to represent the structure and the data together:
{ "month"=>"Aug", "day"=>"24", "hour"=>"12", [...] }
Which is better, right? With the array if I added a field in the middle (say
syslog starts logging microseconds) then everything breaks. But with the
hash, it’s more flexible.
But then I’ve got to say to myself, “Gosh, that hash looks a lot like an
object.” And if I turn it into a full-fledged object, I can add behavior,
like a call to get a Time object.
So that’s what I meant when I said that passing arrays like this around
isn’t such good OO design.
Basically, I think you should put all the logic of what subclass best
represents the line into LogLine, like: […]
case line
when SENDMAIL_REGEX
return SendMailLogLine.new(...)
when NAMED_REGEX
return NamedLogLine.new(...)
else
return BasicLogLine.new(line)
end
end
end
That’s a good start, but you will run into trouble with that over time.
The case statement approach works for a few LogLine types, but is brittle.
If you only have a couple of subclasses, then keeping the superclass in
sync with the subclasses is easy. But with 20, it’s a big pain, exactly the
kind of pain that OO is supposed to save us from.
And if you want other people to be able to write dynamically loadable log
parsing modules that extend your current functionality, the case statement
approach breaks down entirely.
It’s been my experience that often when I write a complicated if or case
statement, there’s a set of subclasses waiting to be born. Or if they
already exist, then there’s an opportunity to move the code into the
subclasses.
It may be a bit awkward for the client of LogLine to know what to do with
the return value - it’ll probably have to type-check it - but that’s
because of the subclassing, not because of my approach.
I don’t think awkward is quite the right word; one of the points of object
orientation is that you have typing built in to the language.
In my case, some parts of the code are interested only in certain types of
LogLines; they will indeed check for type. But many parts of the code will
work with any LogLine at all. So a ReportGenerator object would accept a
collection of LogLine objects, but the SendmailReportGenerator would only
choose to process things of the SendmailLogLine type.
Thanks again for taking the time to lend me a hand!
William