Best way to parse/update HTML file?

Bucco · 25 June 2005 01:15

Sorry for the newbie question. I am trying to find the best metod for
parsing a HTML file and changinf one tag/item. Unfortunately, REXML
chokes on the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific tag
in an html file, change it's text and attribute settings?

Thanks:)

SA

daz · 26 June 2005 09:40

Bucco wrote:

Sorry for the newbie question.

This has been answered once or twice before by this group

I am trying to find the best metod for parsing a HTML file
and changinf one tag/item. Unfortunately, REXML chokes on
the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific
tag in an html file, change it's text and attribute settings?

Thanks:)

SA

The best way tends to involve using a package although
you /could/ work your way through it using regular expressions.

If you're likely to be doing this kind of thing in the
future, you'll be glad you spent a bit of time installing;
then it's always available.

As I recall, a different package is often
recommended but I don't know which is best.

This is what some of us use:
http://ruby-htmltools.rubyforge.org/ (Ned Konz +)

Examples are included but here's another ...

···

#-----------------------------------------------------------------
EXAMPLE = <<EOX
<html lang="en">
  <head>
    <title>Page title</title>
  </head>
  <body>
    <div id="Header">
      <h1><a href="xxxx.net - This website is for sale! - xxxx Resources and Information.;
      <p>For When You Want a Quick URL</p>
    </div>
    <hr>
    <div id="Content">
      <form action="xxxx.net - This website is for sale! - xxxx Resources and Information.; method="post">
        <fieldset>
          <legend>Enter a <abbr title="Uniform Resource Locator">URL</abbr> to make into a xxxx:</legend>
          <input id="InputURL" type="text" size="40" maxlength="65535" name="url" value="">
          <input type="submit" name="action" value="Create xxxx">
        </fieldset>
      </form>
    </div>
    <hr>
    <a href="xxxx.net - This website is for sale! - xxxx Resources and Information.; -
    <a href="xxxx.net - This website is for sale! - xxxx Resources and Information.; -
    <a href="xxxx.net - This website is for sale! - xxxx Resources and Information.; -
    <a href="xxxx.net - This website is for sale! - xxxx Resources and Information. of Use</a> -
    <a href="xxxx.net - This website is for sale! - xxxx Resources and Information.; -
    <a href="xxxx.net - This website is for sale! - xxxx Resources and Information.;
  </body>
</html>
EOX

require 'html/tree' # http://ruby-htmltools.rubyforge.org/

verbose = true

exa = HTMLTree::Parser.new(verbose, !false)
#exa.parse_file_named('xxxx_net.html')
exa.feed(EXAMPLE) # replaces '.parse_file_named'

item_a = exa.html.select {|ea| ea.tag == 'a'}
item_a.each {|ea| p [:ahref, ea['href']]}
puts '+'*100

exa.html.each do |ea|
  p [ea.tag, ea['href']]
  ea.each do |item|
    if item.data?
      p [:data, item.to_s]
    elsif item.tag == 'a'
      item['href'].sub!(/xxxx/, 'mysite')
    end
  end
  puts '='*100
  ea.dump
end

### exa.html.dump

#-----------------------------------------------------------------

Output from the script above is too long to post here,
so I've uploaded it to:
http://www.d10.karoo.net/ruby/example_html_parse.txt

Hope this is of some use,

daz
--
JARH (Nihon-style) qurl.net - This website is for sale! - qurl Resources and Information.

mathew · 27 June 2005 16:45

Bucco wrote:

Sorry for the newbie question. I am trying to find the best metod for
parsing a HTML file and changinf one tag/item. Unfortunately, REXML
chokes on the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific tag
in an html file, change it's text and attribute settings?

For invalid "tag soup" HTML, your best bet is probably to use html/htmltokenizer.

<URL:http://rubyforge.org/projects/htmltokenizer/>

It'll search for specified 'tags', returning the text skipped over, which you can put into a buffer. Then you can get the attributes of the 'tag', and modify them, and put the result in the buffer. Finally, you can slurp in the rest of the pseudo-HTML.

mathew

why_the_lucky_stiff1 · 28 June 2005 20:27

Bucco wrote:

Sorry for the newbie question. I am trying to find the best metod for
parsing a HTML file and changinf one tag/item. Unfortunately, REXML
chokes on the file because of the incomplete tags. Completing the tag
is not an option either. What is the best way to find a specific tag
in an html file, change it's text and attribute settings?

Hi. I know I'm a bit late to the discussion, so 'sokay if you have an answer already.

A really fantastic HTML parser library is HTree by Tanaka Akira.

<http://cvs.m17n.org/~akr/htree/>

It's completely forgiving of bad HTML and you can import the document into REXML through the HTree parser.

require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml

The only downside is that you'll need to install the iconv library, which can be a bit of a pain to track down on Windows. Other than that, it's a great package.

_why

Brad_Wilson · 27 June 2005 19:35

If you're comfortable "cleaning it up", why not tidy it to XHTML then
use the XML parser? This is the approach I took recently when I needed
it.

···

On 6/27/05, mathew <meta@pobox.com> wrote:

For invalid "tag soup" HTML, your best bet is probably to use
html/htmltokenizer.

daz · 29 June 2005 07:00

_why wrote:

[snip]

A really fantastic HTML parser library is HTree by Tanaka Akira.

I'm glad you brought that in because I tried it last year and saw
that it was a serious "heavy horse" and perhaps a little bit
_more_ than I was looking for.
It was adding XHTML namespace prefixes to all tags, so a
horizontal rule, for example, became:

<{XHTML namespace}hr>

It's completely forgiving of bad HTML and you can import the
document into REXML through the HTree parser.

require 'htree'
HTree.parse( "<b>Bad markup" ).to_rexml

It may have been in its early stages of development, but my
assumption that HTree would be too strict is under review
Applying your example, I get the result I was expecting
without all that namespace stuff.

The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows.

Instead of hunting around for that, I'd made a dummy Ruby version
(no functionality for those who don't need any):

In my old version there are two files which require 'iconv' -
("text.rb" and "encoder.rb") which I changed to:

  begin
    require 'iconv'
  rescue LoadError
    require 'htree/iconv_dummy'
  end

Then, add this dummy file as:
lib\ruby\site_ruby\1.8\htree\iconv_dummy.rb

···

#-------------------------------------------------------------
class Iconv

  ## For testing : Not part of the HTree package ##
  warn "Using dummy iconv lib: #{__FILE__}"
  IC_DUMMY = true

  def Iconv.open(to, from)
    inst = Iconv.new
    block_given? ? yield(inst) : inst
  end
  def Iconv.iconv(to, from, *strs)
    strs.join
  end
  def Iconv.conv(to, from, str)
    str
  end
  def Iconv.list
    raise 'No Iconv.list'
  end
  def initialize(to, from)
  end
  def close
    ''
  end
  def iconv(str, strt = 0, len = -1)
    (len and !( len < 0 )) or len = str.size - strt
    str[strt, len]
  end

  module Failure
    def initialize(*args) # 3
    end
    def success
    end
    def failed
    end
    def inspect
    end
  end

# class InvalidEncoding < ArgumentError; end
# class IllegalSequence < ArgumentError; end
# class InvalidCharacter < ArgumentError; end
# class OutOfRange < RuntimeError; end

  def Iconv.charset_map
    raise 'No Iconv.charset_map'
  end
end
#-------------------------------------------------------------

_why

daz

Bill_Guindon1 · 9 July 2005 13:17

Bucco wrote:

>Sorry for the newbie question. I am trying to find the best metod for
>parsing a HTML file and changinf one tag/item. Unfortunately, REXML
>chokes on the file because of the incomplete tags. Completing the tag
>is not an option either. What is the best way to find a specific tag
>in an html file, change it's text and attribute settings?
>
Hi. I know I'm a bit late to the discussion, so 'sokay if you have an
answer already.

A really fantastic HTML parser library is HTree by Tanaka Akira.

  <http://cvs.m17n.org/~akr/htree/>

It's completely forgiving of bad HTML and you can import the document
into REXML through the HTree parser.

  require 'htree'
  HTree.parse( "<b>Bad markup" ).to_rexml

The only downside is that you'll need to install the iconv library,
which can be a bit of a pain to track down on Windows. Other than that,
it's a great package.

There's a page on the Rails site that covers the iconv installation on Windows:
http://wiki.rubyonrails.com/rails/show/iconv

Once I had the iconv.so in a library path, and iconv.dll in
windows\system32, I ran the test-all.rb. Got an error due to a lack
of /dev/null, but that was fixed by creating a dev directory, and
adding an empty 'null' file to it.

Should swap that out to have it point to a temp dir, but with that
setup, all of the htree tests passed.

···

On 6/28/05, why the lucky stiff <ruby-talk@whytheluckystiff.net> wrote:

_why

--
Bill Guindon (aka aGorilla)

Bucco · 28 June 2005 00:20

I have a couple of more questions then:

1. I tried the example for the htmltokenizer and got an error around
assert. Where/what is the assert method?

2. What do you mean by "slurp" in the rest of the text?

3. Any better examples how to use htmltokenizer?

Thanks:)
SA

mathew · 28 June 2005 16:25

Bucco wrote:

1. I tried the example for the htmltokenizer and got an error around
assert. Where/what is the assert method?

An error around "assert" is likely an internal error of some kind. Assertions are pieces of code placed in software to detect invalid arguments to methods, internal data structure inconsistencies, and so on.

For example, consider the Ruby URI library. It doesn't support all kinds of URI. So, it would be a good idea if it were to assert that the URI it is being passed is one of the kinds it actually knows how to parse. That way, someone innocently using the library with the wrong kind of URI will discover the problem immediately, rather than being passed back bad data, or having some bizarre error occur in the middle of the library code.

So it could be that you're passing an invalid argument to a method of htmltokenizer. It's also possible that you're triggering a bug in the library.

2. What do you mean by "slurp" in the rest of the text?

"slurp" meaning "pull in the entire content of the file from the current file pointer onwards, without performing any processing on it".

As in file = File.new("something.gif")
data = file.read # slurp!

<URL:http://www.retrologic.com/jargon/S/slurp.html>

3. Any better examples how to use htmltokenizer?

require 'html/htmltokenizer'

#[...]

     # Parse all the images and links out of the web page
     tokenizer = HTMLTokenizer.new(@body)
     @images = Array.new
     @links = Array.new
     lastlink = ''
     while tag = tokenizer.getTag('img', 'a')
       if tag.tag_name == 'img'
         url = tag.attr_hash['src']
         uri = @uri.merge(url)
         @images.push([uri.to_s, lastlink])
       else
         url = tag.attr_hash['href']
         uri = @uri.merge(url)
         @links.push(uri.to_s)
         lastlink = uri.to_s
       end
     end

That's the only time I've used it, I'm afraid. Still, it might give you some ideas.

mathew

Topic		Replies	Views
HTML parsing ruby-talk	4	63	2 February 2004
Parsing of Html/Text files ruby-talk	3	114	4 January 2010
HTML parsing by REXML ruby-talk	5	59	1 April 2004
HTML Parsing? ruby-talk	13	183	11 February 2004
Ruby and XML ruby-talk	8	85	5 September 2011

Best way to parse/update HTML file?

Related Topics