_why wrote:
[snip]
A really fantastic HTML parser library is HTree by Tanaka Akira.
I'm glad you brought that in because I tried it last year and saw
that it was a serious "heavy horse" and perhaps a little bit
_more_ than I was looking for.
It was adding XHTML namespace prefixes to all tags, so a
horizontal rule, for example, became:
<{XHTML namespace}hr>
It's completely forgiving of bad HTML and you can import the
document into REXML through the HTree parser.
require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml
It may have been in its early stages of development, but my
assumption that HTree would be too strict is under review
Applying your example, I get the result I was expecting
without all that namespace stuff.
The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows.
Instead of hunting around for that, I'd made a dummy Ruby version
(no functionality for those who don't need any):
In my old version there are two files which require 'iconv' -
("text.rb" and "encoder.rb") which I changed to:
begin
require 'iconv'
rescue LoadError
require 'htree/iconv_dummy'
end
Then, add this dummy file as:
lib\ruby\site_ruby\1.8\htree\iconv_dummy.rb
···
#-------------------------------------------------------------
class Iconv
## For testing : Not part of the HTree package ##
warn "Using dummy iconv lib: #{__FILE__}"
IC_DUMMY = true
def Iconv.open(to, from)
inst = Iconv.new
block_given? ? yield(inst) : inst
end
def Iconv.iconv(to, from, *strs)
strs.join
end
def Iconv.conv(to, from, str)
str
end
def Iconv.list
raise 'No Iconv.list'
end
def initialize(to, from)
end
def close
''
end
def iconv(str, strt = 0, len = -1)
(len and !( len < 0 )) or len = str.size - strt
str[strt, len]
end
module Failure
def initialize(*args) # 3
end
def success
end
def failed
end
def inspect
end
end
# class InvalidEncoding < ArgumentError; end
# class IllegalSequence < ArgumentError; end
# class InvalidCharacter < ArgumentError; end
# class OutOfRange < RuntimeError; end
def Iconv.charset_map
raise 'No Iconv.charset_map'
end
end
#-------------------------------------------------------------
_why
daz