Hi,
although in this case I'd prefer the array.split solution here's how
it can be done in case you really need regex:
These are incremental versions of the regex, and a test to check them.
Save to file and enjoy!
Jano
#!/usr/bin/ruby
require 'test/unit'
DATA = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER IT and Society 3 LAGUERRE"
REGEX1 = /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/
# add /x and # comments,
REGEX2 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d+).* # day-hour
(\d{1,4}\s\D{1,9}). # room
(\w+).* # course-name
(\d?).* # credits
(\w{1,14}).* # professor
/x
# now we'll fix the day-hour: add -\d[AP] - that will match a slash,
# a digit and either 'A' or 'P'
# and fix for the room: \D{1,9} replace with \w+
REGEX3 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d+-\d[AP]). # day-hour
(\d{1,4}\s\w+). # room
(\w+).* # course-name
(\d?).* # credits
(\w{1,14}).* # professor
/x
# To fix course name, the previous tricks aren't enough --
# there are many words, with different length. So what we'll do?
# We'll parse the things at the end: credits and professor
···
On 8/12/06, skelastic@gmail.com <skelastic@gmail.com> wrote:
Greetings,
I'm trying to parse the following line:
...
#
# To see the results, temporarily comment out the lines
# that checks the course name and credits in the test
# and run it with REGEX3.
#
# To fix the professor, we'll say that it's tha last word on the line:
# notice the \s+ before the professor group - there has to be something
# fixed that separates the name from the rest - .* won't do it.
REGEX4 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d+-\d[AP]). # day-hour
(\d{1,4}\s\w+). # room
(\w+).* # course-name
(\d?)\s+ # credits
(\w+)\s*$ # professor
/x
# Now we can try the rest two pieces: uncomment credits and
# we'll see that they are already ok, so uncomment course name as well.
#
# Only the first word appears. So we'll move .* inside the parentheses
# and add a separating \s+
#
# Finally some small touches:
# replace separating . with \s+
REGEX5 = /
(\d{5})\s+ # control
(\D\s\w{2,4})\s+ # course
(\d{1,4}\s\D{3})\s+ # section
(\D{1,4}\s\d+-\d[AP])\s+# day-hour
(\d{1,4}\s\w+)\s+ # room
(.*)\s+ # course-name
(\d+)\s+ # credits
(\w+)\s*$ # professor
/x
class TestRegex < Test::Unit::TestCase
def test_regex
assert DATA =~ REGEX1 # <--- change number here
assert_equal "00608", $1
assert_equal "P 135", $2
assert_equal "001 LEC", $3
assert_equal "Tu 2-5P", $4
assert_equal "210 WHEELER", $5
assert_equal "IT and Society", $6
assert_equal "3", $7
assert_equal "LAGUERRE", $8
end
end