I've been going through a similar situation with my current project. I
was initially using Hpricot, and was very frustrated by the lack of
documentation and some of the lingering bugs. I've now switched to
nokogiri and have been very impressed with it.
I'm now running into some of the robustness issues that are faced when
you process data from the open web, like Dan alluded to. I'm using
nokogiri's sax implementation, and I've ran into some problems with
handling html entities, rather they are preserved or decoded into utf-8.
In both cases, nokogiri will quit calling my start and end element
handlers, but continue to call my character handler after an entity is
seen. Specifically, I've noticed this behavior when it sees and
…. Has anyone else experienced this and have any advice to share?
I appreciate it!
-lance
(here's my code)
class Nokogiri::XML::SAX::Document
attr_accessor :rhtml
def initialize
@rhtml = ""
@keep_text = true
@keep_elements = %w{ br p img ul ol title li div table head body
meta base blockquote }
end
def start_element name, attrs =
puts "start element called: " + name
if @keep_elements.include?(name)
puts "keeping: #{name}"
@rhtml << "<#{name}>\n"
end
if ['script', 'style'].include? name
@keep_text = false
end
end
def characters text
#@rhtml << @coder.decode( text ) if @keep_text
@rhtml << text if @keep_text
puts text
end
def end_element name
puts "end element called: " + name
if @keep_elements.include?(name)
@rhtml << "</#{name}>\n"
end
if ['script', 'style'].include? name
@keep_text = true
end
end
end
html = open(ARGV[0], 'r').collect { |l| l }.join
#coder = HTMLEntities.new
#html = coder.decode(html)
Tidy.path = '/usr/lib/libtidy-0.99.so.0'
xml = Tidy.open(:show_warnings=>true) do |tidy|
tidy.options.output_xml = true
#tidy.options.char_encoding = 'utf8'
tidy.options.preserve_entities = true
xml = tidy.clean(html)
end
doc = Nokogiri::XML::SAX::Document.new
parser = Nokogiri::XML::SAX::Parser.new(doc)
parser.parse(xml)
puts "doc:"
puts doc.rhtml.gsub(/\n+/, "\n")
···
--
Posted via http://www.ruby-forum.com/\.