Newbie regexp question

Hello,

I'm trying to split a formatted text file into four separate columns.
The data is comprised of lines of text that are bundled into four
distinct columns, corresponding to a "Required versus Optional"
variable, a requirement number, a requirement classification (R1=Rev 1,
F=Future, I=Internal), and a textual description of the requirement.

My raw data looks like this in the input text file:

R [01] R1 The system shall support "emergency call processing"
R [02] R1 The system shall support "local call processing"
R [08] F The system shall provide a command-line user interface
R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
R [11] F The system shall support VoIP networks
R [398] R1 The system shall contain 2 control boards
O [327] I The system should support hotswapping of all internal boards
R [19] I The system shall be able to detect transmission errors
R [631] F The system shall continue processing data as long as a call is
active.

I've set up a loop to process each line in the input file, and what I'd
like to get is four separate variables containing on a line-by-line
basis the data corresponding to the four distinct columns. The problem
is my regexp experience is next to nothing, and I can't figure out how
to extract the data I want since my fourth column contains whitespace
(I'd have used that as my column separator otherwise).

Here's my loop:

File.open(textfile, "r") do |input_file|
  while line = input_file.gets
      output_file << line
  end
end

What can I replace the simple copy statement (output_file << line) with
in order to get what I want?

Thanks in advance, I hope this question makes some sense.

James

···

--
Posted via http://www.ruby-forum.com/.

James Calivar wrote:

Hello,

I'm trying to split a formatted text file into four separate columns. The data is comprised of lines of text that are bundled into four distinct columns, corresponding to a "Required versus Optional" variable, a requirement number, a requirement classification (R1=Rev 1, F=Future, I=Internal), and a textual description of the requirement.

My raw data looks like this in the input text file:

R [01] R1 The system shall support "emergency call processing"
R [02] R1 The system shall support "local call processing"
R [08] F The system shall provide a command-line user interface
R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
R [11] F The system shall support VoIP networks
R [398] R1 The system shall contain 2 control boards
O [327] I The system should support hotswapping of all internal boards
R [19] I The system shall be able to detect transmission errors
R [631] F The system shall continue processing data as long as a call is active.

try this one

open("file").read.scan(/(\w)\s+(.+?)\s+(\w+)\s+(.*?)\n?$/){|req,num,cls,dsc| ...}

lopex

My wife, Dana Gray, is still learning Ruby so I gave her this problem as a test. :wink: She suggests the code below.

James Edward Gray II

DATA.each do |line|
   line =~ /^(\w)\s+(\S+)\s+(\S+)\s+(.+)/
   p [$1, $2, $3, $4]
end

__END__
R [01] R1 The system shall support "emergency call processing"
R [02] R1 The system shall support "local call processing"
R [08] F The system shall provide a command-line user interface
R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
R [11] F The system shall support VoIP networks
R [398] R1 The system shall contain 2 control boards
O [327] I The system should support hotswapping of all internal boards
R [19] I The system shall be able to detect transmission errors
R [631] F The system shall continue processing data as long as a call is active.

···

On Sep 14, 2006, at 6:37 PM, James Calivar wrote:

What can I replace the simple copy statement (output_file << line) with
in order to get what I want?

You have a number of options - if your data is tab delimited (i.e. the first "two" coluumns are really one):

s = 'R [01] R1 The system shall support "emergency call processing"'
p s.split(/\t/)

=> ["R [01]", "R1", "The system shall support \"emergency call processing\""]

or you can just split on whitespace and specify a limit on the number of fields:

s = 'R [01] R1 The system shall support "emergency call processing"'
p s.split(/\s+/, 4)

=> ["R", "[01]", "R1", "The system shall support \"emergency call processing\""]

Or you can use a regex (ick :wink:

Hope this helps,

Mike

···

On 14-Sep-06, at 7:37 PM, James Calivar wrote:

Hello,

I'm trying to split a formatted text file into four separate columns.
The data is comprised of lines of text that are bundled into four
distinct columns, corresponding to a "Required versus Optional"
variable, a requirement number, a requirement classification (R1=Rev 1,
F=Future, I=Internal), and a textual description of the requirement.

My raw data looks like this in the input text file:

R [01] R1 The system shall support "emergency call processing"
R [02] R1 The system shall support "local call processing"
R [08] F The system shall provide a command-line user interface
R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
R [11] F The system shall support VoIP networks
R [398] R1 The system shall contain 2 control boards
O [327] I The system should support hotswapping of all internal boards
R [19] I The system shall be able to detect transmission errors
R [631] F The system shall continue processing data as long as a call is
active.

I've set up a loop to process each line in the input file, and what I'd
like to get is four separate variables containing on a line-by-line
basis the data corresponding to the four distinct columns. The problem
is my regexp experience is next to nothing, and I can't figure out how
to extract the data I want since my fourth column contains whitespace
(I'd have used that as my column separator otherwise).

Here's my loop:

File.open(textfile, "r") do |input_file|
  while line = input_file.gets
      output_file << line
  end
end

What can I replace the simple copy statement (output_file << line) with
in order to get what I want?

Thanks in advance, I hope this question makes some sense.

--

Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/

The "`Stok' disclaimers" apply.

I suck at regex too, I tried this as an exercise and came up with the below. It's less concise than previous solutions, but it works as far as I can tell:

Row = Struct.new(:col1, :col2, :col3, :col4)
rows = Array.new()
regex = /([A-Z])\s(\[[0-9]+\])\s([A-Z1-9]+)\s(.+)/

File.open("file.txt") do |file|
  while (line = file.gets)
    m = line.match(regex)
    rows << Row.new(m[1], m[2], m[3], m[4])
  end
end

puts rows.flatten

#output =>

#<struct Row col1="R", col2="[01]", col3="R1", col4="The system shall support \"emergency call processing\"">
#<struct Row col1="R", col2="[02]", col3="R1", col4="The system shall support \"local call processing\"">
#<struct Row col1="R", col2="[08]", col3="F", col4="The system shall provide a command-line user interface">
#<struct Row col1="R", col2="[723]", col3="F", col4="The system shall provide 6 10/100/1000 Ethernet interfaces">
#<struct Row col1="R", col2="[11]", col3="F", col4="The system shall support VoIP networks">
#<struct Row col1="R", col2="[398]", col3="R1", col4="The system shall contain 2 control boards">
#<struct Row col1="O", col2="[327]", col3="I", col4="The system should support hotswapping of all internal boards">
#<struct Row col1="R", col2="[19]", col3="I", col4="The system shall be able to detect transmission errors">
#<struct Row col1="R", col2="[631]", col3="F", col4="The system shall continue processing data as long as a call is active.">

-Steven

James Calivar wrote:

···

Hello,

I'm trying to split a formatted text file into four separate columns. The data is comprised of lines of text that are bundled into four distinct columns, corresponding to a "Required versus Optional" variable, a requirement number, a requirement classification (R1=Rev 1, F=Future, I=Internal), and a textual description of the requirement.

My raw data looks like this in the input text file:

R [01] R1 The system shall support "emergency call processing"
R [02] R1 The system shall support "local call processing"
R [08] F The system shall provide a command-line user interface
R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
R [11] F The system shall support VoIP networks
R [398] R1 The system shall contain 2 control boards
O [327] I The system should support hotswapping of all internal boards
R [19] I The system shall be able to detect transmission errors
R [631] F The system shall continue processing data as long as a call is active.

I've set up a loop to process each line in the input file, and what I'd like to get is four separate variables containing on a line-by-line basis the data corresponding to the four distinct columns. The problem is my regexp experience is next to nothing, and I can't figure out how to extract the data I want since my fourth column contains whitespace (I'd have used that as my column separator otherwise).

Here's my loop:

File.open(textfile, "r") do |input_file|
  while line = input_file.gets
      output_file << line
  end
end

What can I replace the simple copy statement (output_file << line) with in order to get what I want?

Thanks in advance, I hope this question makes some sense.

James

Marcin Mielżyński wrote:

Ooops,

the newline in regexp is not needed...

try this one

open("file").read.scan(/(\w)\s+(.+?)\s+(\w+)\s+(.*?)$/){|req,num,cls,dsc| ...}

lopex

lopex