Final iteration lost?

Leam_Hall · 8 August 2019 11:07

I'm parsing a data file with each_line, each section starts with a line that ends in ")". Each section has 2-3 subsections, delimited by blank lines. The original test for a new section was:

if line.end_with?(')')

However, that means the last section is never dealt with, since the "each_line" ends at EOF. What I currently have is a line to process the data once the each_line block is finished.

  if data.count > 0
    jobs << Job.new(data)
  end

Full code below.

Thanks!

Leam

### Sample job section

Linux Admin, Spidernet s.r.l, Repubblica Italiana, (1997 - 1998)

Supported Linux and Solaris in a multi-lingual environment. Configured Apache to support application development team's needs. Took ownership of OS and hardware production issues and channeled application issues to specialists.

Solaris 2.5, Red Hat Linux 3, 4, 5, 6, Sendmail, Apache, Tomcat

### parser

$VERBOSE = true

$LOAD_PATH << File.expand_path('../../lib/', __FILE__)
require 'job_detail'

jobs_file = File.open('input/leamhall_jobs.txt')
jobs = Array.new
in_job = false

def set_dates(dates)
   date_re = %r{$(.*) - (.*)$}
   d = date_re.match(dates)
   start_date = long_years(d[1])
   end_date = long_years(d[2])
   return start_date, end_date
end

def long_years(date)
   dates = date.split(' ')
   if dates[0] == "present"
     return date
   end
   if dates.count == 1
     year = dates[0].to_i
   else
     year = dates[1].to_i
   end
   if year > 70 and year < 1900
     year += 1900
   elsif year < 70
     year += 2000
   end
   if dates.count == 1
     date = "#{year}"
   else
     date = "#{dates[0]} #{year}"
   end
end

data = Hash.new
jobs_file.each_line { |line|
   line.chomp!.strip!
   next if line.length == 0
   if line.end_with?(')')
     if data.count > 0
       jobs << Job.new(data)
     end
     data = Hash.new
     header_array = line.split(',')
     title = header_array.shift
     dates = header_array.pop
     start_date, end_date = set_dates(dates)
     data['start_date'] = start_date
     data['end_date'] = end_date
     data['title'] = title
     data['employer'] = header_array.join(' ').strip!
   elsif data['blurb'].nil?
     data['blurb'] = line
   elsif data['tech'].nil?
     data['tech'] = line
   else
     data['extra'] = line
   end
}
if data.count > 0
   jobs << Job.new(data)
end

jobs.each { |j|
     puts "#{j.employer}"
     puts "#{j.title}: #{j.start_date} - #{j.end_date}"
     puts "#{j.blurb}" unless j.blurb.nil?
     puts "#{j.tech}" unless j.tech.nil?
     puts
}

John_W_Higgins1 · 8 August 2019 19:59

The better option here is to create a "job" and have it process the rows as
opposed to processing rows and then creating a job

jobs =
current_job = nil

jobs_file.each_line do |line|
  if line.end_with? ')'
    current_job = Job.new
    jobs << current_job
  end
  current_job.process_line(line)
end

class Job
  def initialize
    @data = {}
  end

  def process_line(line)
    if @data['header'].nil?
      process_header(line)
   elsif @data['blurb'].nil?
     @data['blurb'] = line
   elsif @data['tech'].nil?
     @data['tech'] = line
   else
     @data['extra'] = line
    end
  end

  def process_header(line)
    .....
  end
end

John

···

On Thu, Aug 8, 2019 at 4:07 AM Leam Hall <leamhall@gmail.com> wrote:

I'm parsing a data file with each_line, each section starts with a line
that ends in ")". Each section has 2-3 subsections, delimited by blank
lines. The original test for a new section was:

if line.end_with?(')')

Robert_K1 · 10 August 2019 10:17

I will only focus on the line parsing. The pattern I use for this uses case
is like this:

require 'pp'

def process_sections(sections)
  sections || return
  # whatever
  pp sections
end

sections = nil

ARGF.each_line do |line|
  case line
  when /\($/ # beginning of section
    process_sections(sections)
    sections = [line]
  when /^$/ # new sub section
    sections and sections << ""
  else
    sections and sections.last << line
  end
end

process_sections(sections)

I find case with regular expressions much easier to digest than if elsif
cascades. And of course you need one final processing step after the loop
to make sure you process outstanding data.

Kind regards

robert

···

On Thu, Aug 8, 2019 at 1:07 PM Leam Hall <leamhall@gmail.com> wrote:

I'm parsing a data file with each_line, each section starts with a line
that ends in ")". Each section has 2-3 subsections, delimited by blank
lines. The original test for a new section was:

        if line.end_with?(')')

However, that means the last section is never dealt with, since the
"each_line" ends at EOF. What I currently have is a line to process the
data once the each_line block is finished.

        if data.count > 0
          jobs << Job.new(data)
        end

Full code below.

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can -
without end}
http://blog.rubybestpractices.com/

Leam_Hall · 11 August 2019 10:53

John, just a quick reply on this. I'm still very much a Ruby beginner, your solution took me a couple reads to begin to understand. I think it's brilliant, and will work to understand it more.

Thank you for twisting my brain around!

Leam

···

On 8/8/19 3:59 PM, John W Higgins wrote:

On Thu, Aug 8, 2019 at 4:07 AM Leam Hall <leamhall@gmail.com > <mailto:leamhall@gmail.com>> wrote:

    I'm parsing a data file with each_line, each section starts with a line
    that ends in ")". Each section has 2-3 subsections, delimited by blank
    lines. The original test for a new section was:

      if line.end_with?(')')

The better option here is to create a "job" and have it process the rows as opposed to processing rows and then creating a job

Robert_K1 · 11 August 2019 13:53

I would add one caveat: John's approach adds knowledge about the formatting
to class Job. If I put this into a class, I would probably rather have
classes Job and TextFileJobParser where the latter has knowledge about the
specific formatting (section begins with a line with "(" at end etc.). That
way you can keep class Job for the business logic and have it independent
of file formats. You might store jobs in a JSON or XML format at a later
point in time. If you then lump processing for all the formats into class
Job it is quite likely that it is becoming messy.

Another reason to keep the parsing out of class Job is that you might
require parsing state in the instance which is not otherwise needed for
Job's functionality. So you would have transitional state (the parsing
state) as well as the state required to do Job's job in the same instance.
Not nice, can be confusing and use more resources as necessary.

The grain of salt is that if this is a small application and there is just
the one format it might be OK to put the parsing code into the Job class.

Kind regards

robert

···

On Sun, Aug 11, 2019 at 12:53 PM Leam Hall <leamhall@gmail.com> wrote:

On 8/8/19 3:59 PM, John W Higgins wrote:

> On Thu, Aug 8, 2019 at 4:07 AM Leam Hall <leamhall@gmail.com > > <mailto:leamhall@gmail.com>> wrote:
>
> I'm parsing a data file with each_line, each section starts with a
line
> that ends in ")". Each section has 2-3 subsections, delimited by
blank
> lines. The original test for a new section was:
>
> if line.end_with?(')')
>
> The better option here is to create a "job" and have it process the rows
> as opposed to processing rows and then creating a job

John, just a quick reply on this. I'm still very much a Ruby beginner,
your solution took me a couple reads to begin to understand. I think
it's brilliant, and will work to understand it more.

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can -
without end}
http://blog.rubybestpractices.com/

Leam_Hall · 12 August 2019 11:31

> The better option here is to create a "job" and have it process
the rows as opposed to processing rows and then creating a job

Leam replied:

    John, just a quick reply on this. I'm still very much a Ruby beginner,
    your solution took me a couple reads to begin to understand. I think
    it's brilliant, and will work to understand it more.

I would add one caveat: John's approach adds knowledge about the formatting to class Job.

To which Leam replies:

Understood. My biggest hurdle was to get the object to accumulate all the sub-sections without losing the last one, and preferably without additional post-processing. There are a variable number of sub-sections, and that was my trouble-spot.

The regex parsing before, and the presentation after, should not be a part of the Job class.

Thanks!

Leam

···

On 8/8/19 3:59 PM, John W Higgins wrote:
On 8/11/19 9:53 AM, Robert Klemme then wrote:

Leam_Hall · 15 August 2019 11:31

Here's the current iteration of that section. The Job class initialization takes some data and sets other values to nil. This is due to some older jobs not having those data sections. The sections are always in order.

···

On 8/11/19 9:53 AM, Robert Klemme wrote:

On Sun, Aug 11, 2019 at 12:53 PM Leam Hall <leamhall@gmail.com > <mailto:leamhall@gmail.com>> wrote:

    On 8/8/19 3:59 PM, John W Higgins wrote:

     > On Thu, Aug 8, 2019 at 4:07 AM Leam Hall <leamhall@gmail.com > <mailto:leamhall@gmail.com> > > <mailto:leamhall@gmail.com>> wrote:
     >
     > I'm parsing a data file with each_line, each section starts
    with a line
     > that ends in ")". Each section has 2-3 subsections, delimited
    by blank
     > lines. The original test for a new section was:
     >
     > if line.end_with?(')')
     >
     > The better option here is to create a "job" and have it process
    the rows
     > as opposed to processing rows and then creating a job

    John, just a quick reply on this. I'm still very much a Ruby beginner,
    your solution took me a couple reads to begin to understand. I think
    it's brilliant, and will work to understand it more.

I would add one caveat: John's approach adds knowledge about the formatting to class Job. If I put this into a class, I would probably rather have classes Job and TextFileJobParser where the latter has knowledge about the specific formatting (section begins with a line with "(" at end etc.). That way you can keep class Job for the business logic and have it independent of file formats. You might store jobs in a JSON or XML format at a later point in time. If you then lump processing for all the formats into class Job it is quite likely that it is becoming messy.

Another reason to keep the parsing out of class Job is that you might require parsing state in the instance which is not otherwise needed for Job's functionality. So you would have transitional state (the parsing state) as well as the state required to do Job's job in the same instance. Not nice, can be confusing and use more resources as necessary.

The grain of salt is that if this is a small application and there is just the one format it might be OK to put the parsing code into the Job class.

Kind regards

robert

###

data = Hash.new
jobs_file = 'input/leamhall_jobs.txt'
File.foreach(jobs_file) { |line|
   line.chomp!.strip!
   next if line.empty?
   if line.end_with?(')')
     data.clear
     header_array = line.split(',')
     title = header_array.shift
     dates = header_array.pop
     start_date, end_date = set_dates(dates)
     employer = header_array.join(' ').strip!
     data = { start_date: start_date, end_date: end_date,
               title: title, employer: employer }
     current_job = Job.new(data)
     jobs << current_job
   elsif current_job.blurb.nil?
     current_job.blurb = line
   elsif current_job.tech.nil?
     current_job.tech = line
   else
     current_job.extra = line
   end
}

Hassan_Schroeder · 15 August 2019 14:20

Not everyone loves regexes but a possible alternative to the above:

irb(main):008:0> line = "Linux Admin, Spidernet s.r.l, Repubblica
Italiana, (1997 - 1998)"
=> "Linux Admin, Spidernet s.r.l, Repubblica Italiana, (1997 - 1998)"
irb(main):009:0> /\A(?<title>[\w
]+),\s*(?<employer>[^,]+).+$(?<start_date>.+) -
(?<end_date>.+)$\z/.match(line).named_captures
=> {"title"=>"Linux Admin", "employer"=>"Spidernet s.r.l",
"start_date"=>"1997", "end_date"=>"1998"}
irb(main):010:0>

FWIW!

···

On Thu, Aug 15, 2019 at 4:31 AM Leam Hall <leamhall@gmail.com> wrote:

     header_array = line.split(',')
     title = header_array.shift
     dates = header_array.pop
     start_date, end_date = set_dates(dates)
     employer = header_array.join(' ').strip!
     data = { start_date: start_date, end_date: end_date,
               title: title, employer: employer }

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Robert_K1 · 15 August 2019 15:10

Here's the current iteration of that section.

So it is not the final iteration?

The Job class

initialization takes some data and sets other values to nil. This is due

to some older jobs not having those data sections. The sections are
always in order.

###

data = Hash.new
jobs_file = 'input/leamhall_jobs.txt'
File.foreach(jobs_file) { |line|
   line.chomp!.strip!
   next if line.empty?
   if line.end_with?(')')
     data.clear
     header_array = line.split(',')
     title = header_array.shift
     dates = header_array.pop
     start_date, end_date = set_dates(dates)
     employer = header_array.join(' ').strip!
     data = { start_date: start_date, end_date: end_date,
               title: title, employer: employer }
     current_job = Job.new(data)
     jobs << current_job
   elsif current_job.blurb.nil?
     current_job.blurb = line
   elsif current_job.tech.nil?
     current_job.tech = line
   else
     current_job.extra = line
   end
}

I would definitively move parsing of the header line and creation of a new
job into a separate function. Makes the line - section logic much more
readable.

Kind regards

robert

···

On Thu, Aug 15, 2019 at 1:31 PM Leam Hall <leamhall@gmail.com> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can -
without end}
http://blog.rubybestpractices.com/

Robert_K1 · 15 August 2019 15:19

Sent too early.

Does this even work? current_job will be nil on each iteration because it
is not declared outside. Then this will happen:

irb(main):008:0> 5.times {|i| puts i; if rand(2) == 0 then puts "assign";
foo = Time.now end; p foo}
0
assign
2019-08-15 17:17:22 +0200
1
nil
2
assign
2019-08-15 17:17:22 +0200
3
nil
4
nil
=> 5

Also, you do not need data here. You always overwrite it with a new Hash
all the time. And you do not need to declare it outside the loop.

As an example

def Job.parse_header(line)
  header_array = line.split(',')
  title = header_array.shift
  dates = header_array.pop
  start_date, end_date = set_dates(dates)
  employer = header_array.join(' ').strip!

data = { start_date: start_date, end_date: end_date,
title: title, employer: employer }

current_job = new(data)
end

Kind regards

robert

···

On Thu, Aug 15, 2019 at 5:10 PM Robert Klemme <shortcutter@googlemail.com> wrote:

On Thu, Aug 15, 2019 at 1:31 PM Leam Hall <leamhall@gmail.com> wrote:

Here's the current iteration of that section.

So it is not the final iteration?

The Job class

initialization takes some data and sets other values to nil. This is due

to some older jobs not having those data sections. The sections are
always in order.

###

data = Hash.new
jobs_file = 'input/leamhall_jobs.txt'
File.foreach(jobs_file) { |line|
   line.chomp!.strip!
   next if line.empty?
   if line.end_with?(')')
     data.clear
     header_array = line.split(',')
     title = header_array.shift
     dates = header_array.pop
     start_date, end_date = set_dates(dates)
     employer = header_array.join(' ').strip!
     data = { start_date: start_date, end_date: end_date,
               title: title, employer: employer }
     current_job = Job.new(data)
     jobs << current_job
   elsif current_job.blurb.nil?
     current_job.blurb = line
   elsif current_job.tech.nil?
     current_job.tech = line
   else
     current_job.extra = line
   end
}

I would definitively move parsing of the header line and creation of a new
job into a separate function. Makes the line - section logic much more
readable.

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can -
without end}
http://blog.rubybestpractices.com/

Leam_Hall · 15 August 2019 15:23

I write Ruby to get a job writing Ruby. I'll keep iterating until that happens.

···

On 8/15/19 11:10 AM, Robert Klemme wrote:

On Thu, Aug 15, 2019 at 1:31 PM Leam Hall <leamhall@gmail.com > <mailto:leamhall@gmail.com>> wrote:

Here's the current iteration of that section.

So it is not the final iteration?

Topic		Replies	Views
Question about Ruby syntax ruby-talk	9	68	18 June 2008
Detecting the end of a text file ruby-talk	3	85	25 July 2007
Regexp.newbie? ruby-talk	2	61	22 January 2007
Can you terminate an each_line statement ruby-talk	5	122	14 October 2007
Newbie - Q: working with files ruby-talk	2	81	20 July 2006

Final iteration lost?

Related Topics