RegExp & File read help

Hi All,

I am trying to parse out a list of elements from a set of xml file which match a given regular expression. I am sure there is probably a way to do this using an xml parsing library, but I thought it might be just as easy to do so with regular expressions.

My thought was to do the following:

Iterate through a set of files in a directory.
Search each file for a set of lines which match a given regular expression.
Add the capture group in each match to an array.
Sort the array and remove any duplicate values
print the results.

Here are the steps I have tried in building my script:

First, I tested to make sure my regular expression actually matched against the pattern I was seeking. This seemed to work as expected.

···

_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)<\/Font-family>/m)
string = %q(<Font-family codeSet="Roman" fontId="0">Helvetica</Font-

)

if string =~ regexp
puts "yes, there is a match. #{$1}"
end
_________

Returns >> yes, there is a match. Helvetica

Then, I tested a different method which would add the matches to an array. This also seemed to work as expected.
_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)<\/Font-family>/m)
string = %q(<Font-family codeSet="Roman" fontId="0">Helvetica</Font-

)

a = regexp.match(string)
puts a[1]
_________

Returns >> Helvitica

Next, I tested opening a file and returning all lines. This seemed to work as well.
_________
file = File.new('/Users/donlevan/Desktop/DDRs/Apple Dealer Price List.xml')

file.each do |line|
   puts line
end
_________
Returns >> <?xml version="1.0" encoding="UTF-16"?>
<FMPReport link="Summary.xml" type="Report" version="8.5v1" creationDate="6/26/2007" creationTime="10:54:46 AM">
<File name="Apple Dealer Price List" path="10.100.0.10">
<BaseTableCatalog>
  <BaseTable id="32769" name="Apple Dealer Price List" records="235">
    <FieldCatalog> ... end of file

Where I am getting stuck is in the next code fragment, in which I am testing each line to see if there is a match. There should be as the string I used above for testing was pulled directly from one line of the file. Unfortunately, I get an error and no -matches.

_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)<\/Font-family>/m)
file = File.new('/Users/donlevan/Desktop/DDRs/Apple Dealer Price List.xml')

file.each do |string|
   if string =~ regexp
      puts "yes, there is a match. #{$1}"
   end
end
_________
Returns >>

RubyMate r6354 running Ruby r1.8.6 (/usr/local/bin/ruby)
>>> untitled

/Users/donlevan/Library/Application Support/TextMate/Support/lib/scriptmate.rb:29: warning: Insecure world writable dir /Users/donlevan/Library/Application Support in PATH, mode 040706
Program exited.

I would be grateful for any assistance. Thanks so much.

Don Levan
Brooklyn, New York

Hi Don,

I am sure there is probably a
way to do this using an xml parsing library, but I thought it might
be just as easy to do so with regular expressions.

Hpricot is a good choice.

Where I am getting stuck is in the next code fragment, in which I am
testing each line to see if there is a match. There should be as the
string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.

_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)

In the regexp, you need to escape the minus sign also, otherwise,
it is interpreted as a range of signs, i.e. f-t =['f','g',...,'t']

regexp = Regexp.new(/<Font\-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)
<\/Font\-family>/m)

Best regards,

Axel

···

--
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten
Browser-Versionen downloaden: GMX Browser - verwenden Sie immer einen aktuellen Browser. Kostenloser Download.

Hi --

···

On Wed, 27 Jun 2007, Axel Etzold wrote:

Hi Don,

I am sure there is probably a
way to do this using an xml parsing library, but I thought it might
be just as easy to do so with regular expressions.

Hpricot is a good choice.

Where I am getting stuck is in the next code fragment, in which I am
testing each line to see if there is a match. There should be as the
string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.

_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)

In the regexp, you need to escape the minus sign also, otherwise,
it is interpreted as a range of signs, i.e. f-t =['f','g',...,'t']

Only inside a character class. Otherwise it's just a minus sign:

irb(main):017:0> Regexp.new(/a-z/).match("a")
=> nil
irb(main):018:0> Regexp.new(/a-z/).match("literal a-z")
=> #<MatchData:0x312ce8>

David

--
* Books:
   RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242\)
   RUBY FOR RAILS (http://www.manning.com/black\)
* Ruby/Rails training
     & consulting: Ruby Power and Light, LLC (http://www.rubypal.com)

Hi David and Alex,

Thanks for your help. I don't have it solved yet, but you both have cleared up the confusion.

Thanks,

Don

···

On Jun 27, 2007, at 11:28 AM, dblack@wobblini.net wrote:

Hi --

On Wed, 27 Jun 2007, Axel Etzold wrote:

Hi Don,

I am sure there is probably a
way to do this using an xml parsing library, but I thought it might
be just as easy to do so with regular expressions.

Hpricot is a good choice.

Where I am getting stuck is in the next code fragment, in which I am
testing each line to see if there is a match. There should be as the
string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.

_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)

In the regexp, you need to escape the minus sign also, otherwise,
it is interpreted as a range of signs, i.e. f-t =['f','g',...,'t']

Only inside a character class. Otherwise it's just a minus sign:

irb(main):017:0> Regexp.new(/a-z/).match("a")
=> nil
irb(main):018:0> Regexp.new(/a-z/).match("literal a-z")
=> #<MatchData:0x312ce8>

David

--
* Books:
  RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242\)
  RUBY FOR RAILS (http://www.manning.com/black\)
* Ruby/Rails training
    & consulting: Ruby Power and Light, LLC (http://www.rubypal.com)