I posted this to c.l.ruby the other day. Any comments?
I’ve been working on extending the sgml-parser included with the std
windows distribution and have run into several issues.
-
Tags with xml namespace qualifiers would get split thusly
dc:language would end up with tag=“dc” with an attribute of
"language" -
Attributes cannot contain namespace qualifiers - e.g. <l:link
l:rel="http://purl.org/rss/1.0/modules/proposed/link/#permalink>
would end up as [{l => “”}, {rel=>“http…”}] -
directives such as <![CDATA[ were not being recognized as “Special”
Here’s my diffs: all three items are working for me now.
17c17
< Special = /<![^<>]*>/
···
Special = /<![/
20,21c20,21
< Tagfind = /[a-zA-Z][a-zA-Z0-9.-]/
< Attrfind = Regexp.compile(’[\s,]([a-zA-Z_][a-zA-Z_0-9.-]*)’ +
Tagfind = /[a-zA-Z][a-zA-Z0-9.-:]/
Attrfind = Regexp.compile(’[\s,]([a-zA-Z_][-:.a-zA-Z_0-9]*)’ +
Are these changes perceived as valuable? Any comments, gotchas?
-Jeff