A simple Hpricot text setter

If anyone is trying to use Hpricot to clean up the actual content of a site while leaving the markup alone, theymight find the following tiny method useful:

class Hpricot::Text
# Adds a simple Hpricot method to change
# the text embedded in an HTML document

···

#
# Example of use:
# body.traverse_text do |text|
# text_out = text.to_s
# manupulate text_out
# text.set(text_out)
# end
   def set(string)
     @content = string
     self.raw_string = string
   end
end

The trick is to set both @content in Hpricot::Text and @raw_string in it's parent.
--
The folly of mistaking a paradox for a discovery, a metaphor for a proof, a torrent of verbiage for a spring of capital truths, and oneself for an oracle, is inborn in us.
-Paul Valery, poet and philosopher (1871-1945)

You can also use Elements#inner_html= and Element#inner_html= for this.

  (body/:a).inner_html = "New Link Text"

Also: set, html, remove, append, prepend, before, after, and wrap, which all
work just like their JQuery cousins.[1]

Thankyou for using Hpricot, it helps the all horses' hearts when you do.

_why

[1] jQuery API Documentation

···

On Fri, Aug 11, 2006 at 03:19:13AM +0900, Chris Gehlker wrote:

If anyone is trying to use Hpricot to clean up the actual content of
a site while leaving the markup alone, theymight find the following
tiny method useful:

class Hpricot::Text
# Adds a simple Hpricot method to change
# the text embedded in an HTML document
#
# Example of use:
# body.traverse_text do |text|
# text_out = text.to_s
# manupulate text_out
# text.set(text_out)
# end
  def set(string)
    @content = string
    self.raw_string = string
  end
end

Thanks for responding, why: and thanks very much for Hpricot.

I'm a long way from completely understanding Hpricot but I did try to use inner_html in what I though was the correct way.

Here is a little sample program:

require 'rubygems'
require_gem 'hpricot'

doc = Hpricot(open('TestFile.html'))
body = doc.search('body')
body.each {|elmnt| elmnt.inner_html}
body.inner_html
(body/:a).inner_html = "New Link Text"
puts doc

The output is:
testHpricot.rb:6: undefined method `inner_html' for #<Hpricot::Elem:0x7546bc> (NoMethodError)
         from testHpricot.rb:6:in `each'
         from testHpricot.rb:6

If I comment out the body.each... line I get:

testHpricot.rb:7: undefined method `inner_html' for #<Hpricot::Elements:0x753d48> (NoMethodError)

If I comment out that line, I get:

testHpricot.rb:8: undefined method `inner_html=' for :Array (NoMethodError)

What may be related is that the file text.rb is at:
/usr/local/lib/ruby/gems/1.8/gems/hpricot-0.3/lib/hpricot/text.rb
but it is not actually being required anywhere in Hpricot. When i tried to require it manually, i found that it was requiring files that gem didn't give me. This is all in Hpricot 0.3.

Thanks again for both your time and Hpricot.

···

On Aug 11, 2006, at 5:20 PM, why the lucky stiff wrote:

On Fri, Aug 11, 2006 at 03:19:13AM +0900, Chris Gehlker wrote:

If anyone is trying to use Hpricot to clean up the actual content of
a site while leaving the markup alone, theymight find the following
tiny method useful:

class Hpricot::Text
# Adds a simple Hpricot method to change
# the text embedded in an HTML document
#
# Example of use:
# body.traverse_text do |text|
# text_out = text.to_s
# manupulate text_out
# text.set(text_out)
# end
  def set(string)
    @content = string
    self.raw_string = string
  end
end

You can also use Elements#inner_html= and Element#inner_html= for this.

  (body/:a).inner_html = "New Link Text"

Also: set, html, remove, append, prepend, before, after, and wrap, which all
work just like their JQuery cousins.[1]

--
Seven Deadly Sins? I thought it was a to-do list!

Okay, yeah, you'll need the latest Hpricot (0.4.43):

  gem install hpricot --source code.whytheluckystiff.net

Also, don't forget to remove `require_gem 'hpricot'` and use, instead,
`require 'hpricot'`.

_why

···

On Sat, Aug 12, 2006 at 11:23:14AM +0900, Chris Gehlker wrote:

What may be related is that the file text.rb is at:
/usr/local/lib/ruby/gems/1.8/gems/hpricot-0.3/lib/hpricot/text.rb
but it is not actually being required anywhere in Hpricot. When i
tried to require it manually, i found that it was requiring files
that gem didn't give me. This is all in Hpricot 0.3.

You seem to be making great progress with Hpricot, committing changes every day.

Yep, 'require_gem' no longer works. Just using 'require' seems better.

I don't know that I communicated my idea behind adding a set method for Hpricot::Text. There are times when one wants to scan an potentially change everything that's *not* markup. The markup should be left unchanged or modified only in trivial ways such as changing the order of attribute declarations.

Hpricott::Traverse#traverse_text is great for finding as the stuff that's *not* markup, the pcdata, in an HTML file. I just added a method to change that data.

You suggested using inner_html= but the only way I can see that working is to parse the tree looking for those elements which only have Hpricot::Text children and then using inner_html= on them. But that would involve essentially recreating Hpricott::Traverse#traverse_text to find such elements although the common code could mostly be factored out.

···

On Aug 14, 2006, at 9:29 AM, why the lucky stiff wrote:

Okay, yeah, you'll need the latest Hpricot (0.4.43):

  gem install hpricot --source code.whytheluckystiff.net

Also, don't forget to remove `require_gem 'hpricot'` and use, instead,
`require 'hpricot'`.

_why

--
And those who were seen dancing were thought to be insane by those who could not hear the music.
-Friedrich Wilhelm Nietzsche, philosopher (1844-1900)

Okay, I get it. I guess I need to get //div[contains(text(), '...')]
working. Be assured, traverse_text will stick around.

_why

···

On Wed, Aug 16, 2006 at 01:04:46PM +0900, Chris Gehlker wrote:

Hpricott::Traverse#traverse_text is great for finding as the stuff
that's *not* markup, the pcdata, in an HTML file. I just added a
method to change that data.

Okay, I get it. I guess I need to get //div[contains(text(), '...')]
working.

Works for me!

Be assured, traverse_text will stick around.

Thanks why!

···

On Aug 16, 2006, at 12:25 PM, why the lucky stiff wrote:
--
Egotism is the anesthetic that dulls the pain of stupidity.
-Frank William Leahy, football coach (1908-1973)