Hpricot innerTEXT?

Hi

I'm using hpricot to parse the following file.

<item
rdf:about="http://del.icio.us/url/50666d1a3fe2b942b20819ec2919d2b7#morwyn">
<title>[from morwyn] * HTML for the Conceptually Challenged</title>
<link>http://del.icio.us/url/50666d1a3fe2b942b20819ec2919d2b7#morwyn</link>
<description>HTML for the Conceptually Challenged. Very basic tutorial,
plainly worded for people who hate to read instructions.</description>
<dc:creator>morwyn</dc:creator>
<dc:date>2006-10-10T07:28:28Z</dc:date>
<dc:subject>html imported webpagedesign</dc:subject>
<taxo:topics>
  <rdf:Bag>
    <rdf:li resource="http://del.icio.us/tag/imported" />
    <rdf:li resource="http://del.icio.us/tag/html" />
    <rdf:li resource="http://del.icio.us/tag/webpagedesign" />
  </rdf:Bag>
</taxo:topics>
</item>

I'm trying to get the content from <dc:subject> like this

doc = Hpricot.parse(File.read("965.xhtml"))

(doc/"item").each do |t|

  puts (t/"dc:subject").innerTEXT

end

but I got

<dc:subject>html internet tutorial web</dc:subject>

while I only need "html internet tutorial web"

Anyone knows what's the right function to call?

THanks

···

--
Posted via http://www.ruby-forum.com/.

replace innerTEXT by inner_html:

(doc/"item").each do |t|
   puts (t/"dc:subject").inner_html
end

regards
Lionel

···

On Apr 13, 10:11 am, Bontina Chen <abonc...@gmail.com> wrote:

Hi

I'm using hpricot to parse the following file.

<item
rdf:about="http://del.icio.us/url/50666d1a3fe2b942b20819ec2919d2b7#morwyn&quot;&gt;
<title>[from morwyn] * HTML for the Conceptually Challenged</title>
<link>http://del.icio.us/url/50666d1a3fe2b942b20819ec2919d2b7#morwyn&lt;/link&gt;
<description>HTML for the Conceptually Challenged. Very basic tutorial,
plainly worded for people who hate to read instructions.</description>
<dc:creator>morwyn</dc:creator>
<dc:date>2006-10-10T07:28:28Z</dc:date>
<dc:subject>html imported webpagedesign</dc:subject>
<taxo:topics>
  <rdf:Bag>
    <rdf:li resource="http://del.icio.us/tag/imported&quot; />
    <rdf:li resource="http://del.icio.us/tag/html&quot; />
    <rdf:li resource="http://del.icio.us/tag/webpagedesign&quot; />
  </rdf:Bag>
</taxo:topics>
</item>

I'm trying to get the content from <dc:subject> like this

doc = Hpricot.parse(File.read("965.xhtml"))

(doc/"item").each do |t|

  puts (t/"dc:subject").innerTEXT

end

but I got

<dc:subject>html internet tutorial web</dc:subject>

while I only need "html internet tutorial web"

Anyone knows what's the right function to call?

THanks

--
Posted viahttp://www.ruby-forum.com/.

Lionel Orry wrote:

···

On Apr 13, 10:11 am, Bontina Chen <abonc...@gmail.com> wrote:

<dc:creator>morwyn</dc:creator>

but I got
Posted viahttp://www.ruby-forum.com/.

replace innerTEXT by inner_html:

(doc/"item").each do |t|
   puts (t/"dc:subject").inner_html
end

regards
Lionel

Thx for your response , but I still get
<dc:subject>html internet tutorial web</dc:subject>

--
Posted via http://www.ruby-forum.com/\.

In fact, inner_text works as well. But you should have a look at the
warnings from ruby! The inner_text or inner_html function is applied
to 'puts (t/"dc:subject")' return object, which is nil.
So a warning appears:
rdf.rb:6: undefined method `inner_html' for nil:NilClass
(NoMethodError)

but 'puts (t/"dc:subject")' is executed, and so '<dc:subject>html
internet tutorial web</dc:subject>' is displayed anyway. Therefore I
recommend using a few parentheses there:

puts((t/"dc:subject").inner_text)

and it should work well this time.

Next time, look at the warnings!!! :wink:

regards
Lionel

···

On Apr 13, 12:10 pm, Bontina Chen <abonc...@gmail.com> wrote:

Lionel Orry wrote:
> On Apr 13, 10:11 am, Bontina Chen <abonc...@gmail.com> wrote:
>> <dc:creator>morwyn</dc:creator>

>> but I got
>> Posted viahttp://www.ruby-forum.com/.
> replace innerTEXT by inner_html:

> (doc/"item").each do |t|
> puts (t/"dc:subject").inner_html
> end

> regards
> Lionel

Thx for your response , but I still get
<dc:subject>html internet tutorial web</dc:subject>

--
Posted viahttp://www.ruby-forum.com/.

> Lionel Orry wrote:
> >> <dc:creator>morwyn</dc:creator>
>
> >> but I got
> >> Posted viahttp://www.ruby-forum.com/.
> > replace innerTEXT by inner_html:
>
> > (doc/"item").each do |t|
> > puts (t/"dc:subject").inner_html
> > end
>
> > regards
> > Lionel
>
> Thx for your response , but I still get
> <dc:subject>html internet tutorial web</dc:subject>
>
> --
> Posted viahttp://www.ruby-forum.com/.

In fact, inner_text works as well. But you should have a look at the
warnings from ruby! The inner_text or inner_html function is applied
to 'puts (t/"dc:subject")' return object, which is nil.
So a warning appears:
rdf.rb:6: undefined method `inner_html' for nil:NilClass
(NoMethodError)

That's not a warning, that's an exception, and the program will terminate at
that point. The OP didn't mention any errors.

but 'puts (t/"dc:subject")' is executed, and so '<dc:subject>html
internet tutorial web</dc:subject>' is displayed anyway. Therefore I
recommend using a few parentheses there:

puts((t/"dc:subject").inner_text)

and it should work well this time.

Next time, look at the warnings!!! :wink:

Good point, but it was OK the way he wrote it, with a space after puts.

irb(main):003:0> p (1+3).to_s
"4"
=> nil
irb(main):004:0> p(1+3).to_s
4
=> ""

In the first case, this is p( (1+3).to_s )

In the second case, this is ( p(1+3) ).to_s # i.e. nil.to_s

···

On Fri, Apr 13, 2007 at 08:45:08PM +0900, chickenkiller wrote:

On Apr 13, 12:10 pm, Bontina Chen <abonc...@gmail.com> wrote:
> > On Apr 13, 10:11 am, Bontina Chen <abonc...@gmail.com> wrote:

> > Lionel Orry wrote:
> > >> <dc:creator>morwyn</dc:creator>

> > >> but I got
> > >> Posted viahttp://www.ruby-forum.com/.
> > > replace innerTEXT by inner_html:

> > > (doc/"item").each do |t|
> > > puts (t/"dc:subject").inner_html
> > > end

> > > regards
> > > Lionel

> > Thx for your response , but I still get
> > <dc:subject>html internet tutorial web</dc:subject>

> > --
> > Posted viahttp://www.ruby-forum.com/.

> In fact, inner_text works as well. But you should have a look at the
> warnings from ruby! The inner_text or inner_html function is applied
> to 'puts (t/"dc:subject")' return object, which is nil.
> So a warning appears:
> rdf.rb:6: undefined method `inner_html' for nil:NilClass
> (NoMethodError)

That's not a warning, that's an exception, and the program will terminate at
that point. The OP didn't mention any errors.

Indeed I use the term 'warning' VERY abusively - I apologize for this.
This is an exception and nothing else.

> but 'puts (t/"dc:subject")' is executed, and so '<dc:subject>html
> internet tutorial web</dc:subject>' is displayed anyway. Therefore I
> recommend using a few parentheses there:

> puts((t/"dc:subject").inner_text)

> and it should work well this time.

> Next time, look at the warnings!!! :wink:

Good point, but it was OK the way he wrote it, with a space after puts.

irb(main):003:0> p (1+3).to_s
"4"
=> nil
irb(main):004:0> p(1+3).to_s
4
=> ""

In the first case, this is p( (1+3).to_s )

In the second case, this is ( p(1+3) ).to_s # i.e. nil.to_s

mmmh... interesting... It seems that the problem arises when in a
block:

# output text in comments...
require 'hpricot'

doc = Hpricot(File.open("rdf.xhtml"))

puts (doc/"item"/"dc:subject").inner_text
# html imported webpagedesign

(doc/"item").each do |t|
   puts((t/"dc:subject").inner_text)
end
# html imported webpagedesign

(doc/"item").each do |t|
   puts (t/"dc:subject").inner_text
end
# <dc:subject>html imported webpagedesign</dc:subject>
# rdf.rb:12: warning: don't put space before argument parentheses
# rdf.rb:12: undefined method `inner_text' for nil:NilClass
(NoMethodError)
# from rdf.rb:11:in `each'
# from rdf.rb:11

I am wondering where the difference is between the two last blocks.
Any ideas?

Lionel

···

On Apr 13, 1:53 pm, Brian Candler <B.Cand...@pobox.com> wrote:

On Fri, Apr 13, 2007 at 08:45:08PM +0900, chickenkiller wrote:
> On Apr 13, 12:10 pm, Bontina Chen <abonc...@gmail.com> wrote:
> > > On Apr 13, 10:11 am, Bontina Chen <abonc...@gmail.com> wrote:

Hmm, looks like this should be something that can be replicated without
hpricot.

$ cat x.rb
x = 3
puts (x-5).abs

1.times do
  puts (x-5).abs
end
$ ruby -v
ruby 1.8.4 (2005-12-24) [i486-linux]
$ ruby x.rb
x.rb:5: warning: don't put space before argument parentheses
2
-2
x.rb:5: undefined method `abs' for nil:NilClass (NoMethodError)
        from x.rb:4
$

Congratulations, I think you've found a bug in the parser :slight_smile: I'll post this
example to ruby-core.

Regards,

Brian.

···

On Fri, Apr 13, 2007 at 10:40:05PM +0900, chickenkiller wrote:

doc = Hpricot(File.open("rdf.xhtml"))

puts (doc/"item"/"dc:subject").inner_text
# html imported webpagedesign

(doc/"item").each do |t|
   puts((t/"dc:subject").inner_text)
end
# html imported webpagedesign

(doc/"item").each do |t|
   puts (t/"dc:subject").inner_text
end
# <dc:subject>html imported webpagedesign</dc:subject>
# rdf.rb:12: warning: don't put space before argument parentheses
# rdf.rb:12: undefined method `inner_text' for nil:NilClass
(NoMethodError)
# from rdf.rb:11:in `each'
# from rdf.rb:11

I am wondering where the difference is between the two last blocks.
Any ideas?

Thanks for your help. I have the same output with this version:

ruby 1.8.6 (2007-03-13 patchlevel 0) [i386-mswin32]

regards,
Lionel

···

On Apr 13, 3:48 pm, Brian Candler <B.Cand...@pobox.com> wrote:

On Fri, Apr 13, 2007 at 10:40:05PM +0900, chickenkiller wrote:
> doc = Hpricot(File.open("rdf.xhtml"))

> puts (doc/"item"/"dc:subject").inner_text
> # html imported webpagedesign

> (doc/"item").each do |t|
> puts((t/"dc:subject").inner_text)
> end
> # html imported webpagedesign

> (doc/"item").each do |t|
> puts (t/"dc:subject").inner_text
> end
> # <dc:subject>html imported webpagedesign</dc:subject>
> # rdf.rb:12: warning: don't put space before argument parentheses
> # rdf.rb:12: undefined method `inner_text' for nil:NilClass
> (NoMethodError)
> # from rdf.rb:11:in `each'
> # from rdf.rb:11

> I am wondering where the difference is between the two last blocks.
> Any ideas?

Hmm, looks like this should be something that can be replicated without
hpricot.

$ cat x.rb
x = 3
puts (x-5).abs

1.times do
  puts (x-5).abs
end
$ ruby -v
ruby 1.8.4 (2005-12-24) [i486-linux]
$ ruby x.rb
x.rb:5: warning: don't put space before argument parentheses
2
-2
x.rb:5: undefined method `abs' for nil:NilClass (NoMethodError)
        from x.rb:4
$

Congratulations, I think you've found a bug in the parser :slight_smile: I'll post this
example to ruby-core.

Regards,

Brian.

doc = Hpricot(File.open("rdf.xhtml"))

puts (doc/"item"/"dc:subject").inner_text
# html imported webpagedesign

(doc/"item").each do |t|
   puts((t/"dc:subject").inner_text)
end
# html imported webpagedesign

(doc/"item").each do |t|
   puts (t/"dc:subject").inner_text
end
# <dc:subject>html imported webpagedesign</dc:subject>
# rdf.rb:12: warning: don't put space before argument parentheses
# rdf.rb:12: undefined method `inner_text' for nil:NilClass
(NoMethodError)
# from rdf.rb:11:in `each'
# from rdf.rb:11

I am wondering where the difference is between the two last blocks.
Any ideas?

Hmm, looks like this should be something that can be replicated without
hpricot.

$ cat x.rb
x = 3
puts (x-5).abs

1.times do
  puts (x-5).abs
end
$ ruby -v
ruby 1.8.4 (2005-12-24) [i486-linux]
$ ruby x.rb
x.rb:5: warning: don't put space before argument parentheses
2
-2
x.rb:5: undefined method `abs' for nil:NilClass (NoMethodError)
        from x.rb:4
$

Congratulations, I think you've found a bug in the parser :slight_smile: I'll post this
example to ruby-core.

Regards,

Brian.

Inside the do-end or {} block, use this:
puts((x - 5).abs)
It is more explicit, but correct and works.

so,

(doc/"item").each do |t|
   puts (t/"dc:subject").inner_html
end

will work as
(doc/"item").each do |t|
  puts((t/"dc:subject").inner_html
end

···

On Apr 13, 2007, at 10:48 PM, Brian Candler wrote:

On Fri, Apr 13, 2007 at 10:40:05PM +0900, chickenkiller wrote:

I prefer this version for the initial problem:

irb(main):045:0> elements = doc.search('dc:subject/text()')
=> #<Hpricot::Elements["html imported webpagedesign"]>

irb(main):048:0> elements.first.to_s
=> "html imported webpagedesign"
irb(main):049:0> elements.first.parent
=> {elem <dc:subject> "html imported webpagedesign" </dc:subject>}

···

--
Posted via http://www.ruby-forum.com/.