Hi All,
I am trying to parse out a list of elements from a set of xml file which match a given regular expression. I am sure there is probably a way to do this using an xml parsing library, but I thought it might be just as easy to do so with regular expressions.
My thought was to do the following:
Iterate through a set of files in a directory.
Search each file for a set of lines which match a given regular expression.
Add the capture group in each match to an array.
Sort the array and remove any duplicate values
print the results.
Here are the steps I have tried in building my script:
First, I tested to make sure my regular expression actually matched against the pattern I was seeking. This seemed to work as expected.
···
_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)<\/Font-family>/m)
string = %q(<Font-family codeSet="Roman" fontId="0">Helvetica</Font-
)
if string =~ regexp
puts "yes, there is a match. #{$1}"
end
_________
Returns >> yes, there is a match. Helvetica
Then, I tested a different method which would add the matches to an array. This also seemed to work as expected.
_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)<\/Font-family>/m)
string = %q(<Font-family codeSet="Roman" fontId="0">Helvetica</Font-
)
a = regexp.match(string)
puts a[1]
_________
Returns >> Helvitica
Next, I tested opening a file and returning all lines. This seemed to work as well.
_________
file = File.new('/Users/donlevan/Desktop/DDRs/Apple Dealer Price List.xml')
file.each do |line|
puts line
end
_________
Returns >> <?xml version="1.0" encoding="UTF-16"?>
<FMPReport link="Summary.xml" type="Report" version="8.5v1" creationDate="6/26/2007" creationTime="10:54:46 AM">
<File name="Apple Dealer Price List" path="10.100.0.10">
<BaseTableCatalog>
<BaseTable id="32769" name="Apple Dealer Price List" records="235">
<FieldCatalog> ... end of file
Where I am getting stuck is in the next code fragment, in which I am testing each line to see if there is a match. There should be as the string I used above for testing was pulled directly from one line of the file. Unfortunately, I get an error and no -matches.
_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)<\/Font-family>/m)
file = File.new('/Users/donlevan/Desktop/DDRs/Apple Dealer Price List.xml')
file.each do |string|
if string =~ regexp
puts "yes, there is a match. #{$1}"
end
end
_________
Returns >>
RubyMate r6354 running Ruby r1.8.6 (/usr/local/bin/ruby)
>>> untitled
/Users/donlevan/Library/Application Support/TextMate/Support/lib/scriptmate.rb:29: warning: Insecure world writable dir /Users/donlevan/Library/Application Support in PATH, mode 040706
Program exited.
I would be grateful for any assistance. Thanks so much.
Don Levan
Brooklyn, New York