Regexp Parsing -- What's the right way?

Skelastic · 12 August 2006 05:50

Greetings,

I'm trying to parse the following line:

"00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
LAGUERRE"

i've constructed the following regexp:
/(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/

with a input file i've successfully produced the following output:
control# 00608 ---- correct
course#: P 135 ---- correct
section#: 001 LEC ---- correct
day-hour#: Tu 2 ---- missing '-5P
room#: 3 LAGUERRE, ---- should be 210 WHEELER
course-name#: M --- IT and Soceity
credits#: --- should be 3
prof#: 5 --- should be LAGUERRE

i'm a novice to ruby and regexp. i would like to know if i'm taking the
right approach.
i'll eventually nail it but any hints or suggestions would be useful.

appreciate the help.

Simon_Kroger · 12 August 2006 07:32

skelastic@gmail.com wrote:

Greetings,

I'm trying to parse the following line:

"00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
LAGUERRE"

i've constructed the following regexp:
/(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/

with a input file i've successfully produced the following output:
control# 00608 ---- correct
course#: P 135 ---- correct
section#: 001 LEC ---- correct
day-hour#: Tu 2 ---- missing '-5P
room#: 3 LAGUERRE, ---- should be 210 WHEELER
course-name#: M --- IT and Soceity
credits#: --- should be 3
prof#: 5 --- should be LAGUERRE

i'm a novice to ruby and regexp. i would like to know if i'm taking the
right approach.
i'll eventually nail it but any hints or suggestions would be useful.

appreciate the help.

I would go with split in this case:

t = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
LAGUERRE"
a = t.split
#strip from the beginning
control = a.shift
course = a.shift + ' ' + a.shift
section = a.shift + ' ' + a.shift
hour = a.shift + ' ' + a.shift
room = a.shift + ' ' + a.shift
#strip from behind
prof = a.pop
credits = a.pop
#the rest is the name
coursen = a.join(' ')

puts "control: #{control}"
puts "course: #{course}"
puts "section: #{section}"
puts "hour: #{hour}"
puts "room: #{room}"
puts "coursen: #{coursen}"
puts "credits: #{credits}"
puts "prof: #{prof}"

cheers

Simon

Jan_Svitok · 12 August 2006 09:09

Hi,

although in this case I'd prefer the array.split solution here's how
it can be done in case you really need regex:
These are incremental versions of the regex, and a test to check them.
Save to file and enjoy!

Jano

#!/usr/bin/ruby
require 'test/unit'

DATA = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER IT and Society 3 LAGUERRE"

REGEX1 = /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/

# add /x and # comments,
REGEX2 = /
  (\d{5}). # control
  (\D\s\w{2,4}). # course
  (\d{1,4}\s\D{3}). # section
  (\D{1,4}\s\d+).* # day-hour
  (\d{1,4}\s\D{1,9}). # room
  (\w+).* # course-name
  (\d?).* # credits
  (\w{1,14}).* # professor
/x

# now we'll fix the day-hour: add -\d[AP] - that will match a slash,
# a digit and either 'A' or 'P'
# and fix for the room: \D{1,9} replace with \w+
REGEX3 = /
  (\d{5}). # control
  (\D\s\w{2,4}). # course
  (\d{1,4}\s\D{3}). # section
  (\D{1,4}\s\d+-\d[AP]). # day-hour
  (\d{1,4}\s\w+). # room
  (\w+).* # course-name
  (\d?).* # credits
  (\w{1,14}).* # professor
/x

# To fix course name, the previous tricks aren't enough --
# there are many words, with different length. So what we'll do?
# We'll parse the things at the end: credits and professor

···

On 8/12/06, skelastic@gmail.com <skelastic@gmail.com> wrote:

Greetings,

I'm trying to parse the following line:
...

#
# To see the results, temporarily comment out the lines
# that checks the course name and credits in the test
# and run it with REGEX3.
#
# To fix the professor, we'll say that it's tha last word on the line:
# notice the \s+ before the professor group - there has to be something
# fixed that separates the name from the rest - .* won't do it.
REGEX4 = /
  (\d{5}). # control
  (\D\s\w{2,4}). # course
  (\d{1,4}\s\D{3}). # section
  (\D{1,4}\s\d+-\d[AP]). # day-hour
  (\d{1,4}\s\w+). # room
  (\w+).* # course-name
  (\d?)\s+ # credits
  (\w+)\s*$ # professor
/x

# Now we can try the rest two pieces: uncomment credits and
# we'll see that they are already ok, so uncomment course name as well.
#
# Only the first word appears. So we'll move .* inside the parentheses
# and add a separating \s+
#
# Finally some small touches:
# replace separating . with \s+
REGEX5 = /
  (\d{5})\s+ # control
  (\D\s\w{2,4})\s+ # course
  (\d{1,4}\s\D{3})\s+ # section
  (\D{1,4}\s\d+-\d[AP])\s+# day-hour
  (\d{1,4}\s\w+)\s+ # room
  (.*)\s+ # course-name
  (\d+)\s+ # credits
  (\w+)\s*$ # professor
/x

class TestRegex < Test::Unit::TestCase
  def test_regex
    assert DATA =~ REGEX1 # <--- change number here
    assert_equal "00608", $1
    assert_equal "P 135", $2
    assert_equal "001 LEC", $3
    assert_equal "Tu 2-5P", $4
    assert_equal "210 WHEELER", $5
    assert_equal "IT and Society", $6
    assert_equal "3", $7
    assert_equal "LAGUERRE", $8
  end
end

Robert_K1 · 12 August 2006 10:35

skelastic@gmail.com wrote:

Greetings,

I'm trying to parse the following line:

"00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity
3 LAGUERRE"

i've constructed the following regexp:
/(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/

with a input file i've successfully produced the following output:
control# 00608 ---- correct
course#: P 135 ---- correct
section#: 001 LEC ---- correct
day-hour#: Tu 2 ---- missing '-5P
room#: 3 LAGUERRE, ---- should be 210 WHEELER
course-name#: M --- IT and Soceity
credits#: --- should be 3
prof#: 5 --- should be LAGUERRE

i'm a novice to ruby and regexp. i would like to know if i'm taking
the right approach.
i'll eventually nail it but any hints or suggestions would be useful.

appreciate the help.

Looks pretty ok to me apart from that I'd use \s instead of . to parse white space separating entries.

robert

sukhchander · 12 August 2006 20:30

Hi,

I worked on the regexp some more before I saw everyone's response.
I was able to extract all parts except for the day hour. I was treating
- as "-" and the literals A as "A" and P as "P" so I didn't hit any
matches.

I see you created line breaks with each component of the REGEXP. I will
follow that convention from now on.

I also now understand the difference between .* and \s+ as many of you
have pointed out.

I'm new to ruby as well and will continue to expreriment with it some
more.

Thanks for your responses.
[sukhchander]

Jan Svitok wrote:

···

On 8/12/06, skelastic@gmail.com <skelastic@gmail.com> wrote:
> Greetings,
>
> I'm trying to parse the following line:
> ...

Hi,

although in this case I'd prefer the array.split solution here's how
it can be done in case you really need regex:
These are incremental versions of the regex, and a test to check them.
Save to file and enjoy!

Jano

#!/usr/bin/ruby
require 'test/unit'

DATA = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER IT and Society 3 LAGUERRE"

REGEX1 = /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/

# add /x and # comments,
REGEX2 = /
  (\d{5}). # control
  (\D\s\w{2,4}). # course
  (\d{1,4}\s\D{3}). # section
  (\D{1,4}\s\d+).* # day-hour
  (\d{1,4}\s\D{1,9}). # room
  (\w+).* # course-name
  (\d?).* # credits
  (\w{1,14}).* # professor
/x

# now we'll fix the day-hour: add -\d[AP] - that will match a slash,
# a digit and either 'A' or 'P'
# and fix for the room: \D{1,9} replace with \w+
REGEX3 = /
  (\d{5}). # control
  (\D\s\w{2,4}). # course
  (\d{1,4}\s\D{3}). # section
  (\D{1,4}\s\d+-\d[AP]). # day-hour
  (\d{1,4}\s\w+). # room
  (\w+).* # course-name
  (\d?).* # credits
  (\w{1,14}).* # professor
/x

# To fix course name, the previous tricks aren't enough --
# there are many words, with different length. So what we'll do?
# We'll parse the things at the end: credits and professor
#
# To see the results, temporarily comment out the lines
# that checks the course name and credits in the test
# and run it with REGEX3.
#
# To fix the professor, we'll say that it's tha last word on the line:
# notice the \s+ before the professor group - there has to be something
# fixed that separates the name from the rest - .* won't do it.
REGEX4 = /
  (\d{5}). # control
  (\D\s\w{2,4}). # course
  (\d{1,4}\s\D{3}). # section
  (\D{1,4}\s\d+-\d[AP]). # day-hour
  (\d{1,4}\s\w+). # room
  (\w+).* # course-name
  (\d?)\s+ # credits
  (\w+)\s*$ # professor
/x

# Now we can try the rest two pieces: uncomment credits and
# we'll see that they are already ok, so uncomment course name as well.
#
# Only the first word appears. So we'll move .* inside the parentheses
# and add a separating \s+
#
# Finally some small touches:
# replace separating . with \s+
REGEX5 = /
  (\d{5})\s+ # control
  (\D\s\w{2,4})\s+ # course
  (\d{1,4}\s\D{3})\s+ # section
  (\D{1,4}\s\d+-\d[AP])\s+# day-hour
  (\d{1,4}\s\w+)\s+ # room
  (.*)\s+ # course-name
  (\d+)\s+ # credits
  (\w+)\s*$ # professor
/x

class TestRegex < Test::Unit::TestCase
  def test_regex
    assert DATA =~ REGEX1 # <--- change number here
    assert_equal "00608", $1
    assert_equal "P 135", $2
    assert_equal "001 LEC", $3
    assert_equal "Tu 2-5P", $4
    assert_equal "210 WHEELER", $5
    assert_equal "IT and Society", $6
    assert_equal "3", $7
    assert_equal "LAGUERRE", $8
  end
end

sukhchander · 12 August 2006 20:30

Hi Simon,

That's pretty cool.
I was looking for a utility similar to Java's StringTokenizer. You just
pointed it out.
Ruby has so many things built in. It's very comprehensive.

For larger regexp I assume you prefer the split/tokenize method?

I went with the Regexp approach because it occurred to me first.

Thanks.
[sukhchander]

Simon Kröger wrote:

···

skelastic@gmail.com wrote:
> Greetings,
>
> I'm trying to parse the following line:
>
> "00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
> LAGUERRE"
>
> i've constructed the following regexp:
> /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/
>
> with a input file i've successfully produced the following output:
> control# 00608 ---- correct
> course#: P 135 ---- correct
> section#: 001 LEC ---- correct
> day-hour#: Tu 2 ---- missing '-5P
> room#: 3 LAGUERRE, ---- should be 210 WHEELER
> course-name#: M --- IT and Soceity
> credits#: --- should be 3
> prof#: 5 --- should be LAGUERRE
>
> i'm a novice to ruby and regexp. i would like to know if i'm taking the
> right approach.
> i'll eventually nail it but any hints or suggestions would be useful.
>
> appreciate the help.

I would go with split in this case:

t = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
LAGUERRE"
a = t.split
#strip from the beginning
control = a.shift
course = a.shift + ' ' + a.shift
section = a.shift + ' ' + a.shift
hour = a.shift + ' ' + a.shift
room = a.shift + ' ' + a.shift
#strip from behind
prof = a.pop
credits = a.pop
#the rest is the name
coursen = a.join(' ')

puts "control: #{control}"
puts "course: #{course}"
puts "section: #{section}"
puts "hour: #{hour}"
puts "room: #{room}"
puts "coursen: #{coursen}"
puts "credits: #{credits}"
puts "prof: #{prof}"

cheers

Simon

Robert_K1 · 12 August 2006 22:15

sukhchander wrote:

Hi Simon,

That's pretty cool.
I was looking for a utility similar to Java's StringTokenizer. You just
pointed it out.
Ruby has so many things built in. It's very comprehensive.

For larger regexp I assume you prefer the split/tokenize method?

I went with the Regexp approach because it occurred to me first.

Personally I'd stick with the regexp approach as it has these advantages:

- probably faster because you don't have to split and then combine again

- more precise with regard to matching, i.e. you can better define where to match plus you get the info whether the input string is properly formatted

Btw, if you want to dive into regexp I can recommend "Mastering Regular Expressions". It's probably best to first get some basic knowledge of RX but if you want to know how to build efficient RX etc. then that book is definitive a great help. Ah, I get carried away...

Then there's also tool programs that help in understanding RX visually. RegexBuddy and Regex-Coach.

Kind regards

robert

Topic		Replies	Views
Regexp issue on parsing from file ruby-talk	10	135	15 August 2009
String.scan (Regexp again...) ruby-talk	3	75	12 December 2002
Can't find appropriate regexp ruby-talk	16	83	24 June 2003
Regexp help needed ruby-talk	4	66	27 April 2007
Short regexp question ruby-talk	18	98	23 September 2008

Regexp Parsing -- What's the right way?

Related topics