hi all,
I have the following html fragment
I want to get the inner html content inside the
<p> <img></p> tag , not the between the <p> </p> tag.
for example in the following example i want to get the result as
"this is fun". I dont want to get the result including "NO FUN".
how to do with Hpricot
example html fragment:
···
----------------------
<p class=posted>
this is fun
<img src="" class="dhans"/>
</p>
<p class=posted>
NO FUN
</p>
hi all,
I have the following html fragment
I want to get the inner html content inside the
<p> <img></p> tag , not the between the <p> </p> tag.
for example in the following example i want to get the result as
"this is fun". I dont want to get the result including "NO FUN".
how to do with Hpricot
example html fragment:
----------------------
<p class=posted>
this is fun
<img src="" class="dhans"/>
</p>
<p class=posted>
NO FUN
</p>
I did not quite get you. You want the text of the first <p> because it
has an image?
Or what is the exact criterion to accept/reject <p>'s?
doc = Hpricot %q{<p class=posted>
this is fun
<img src="" class="dhans"/>
</p>
<p class=posted>
NO FUN
</p>
<p class=posted>
fun again!
<img src=""/>
</p>
<p class=posted>
NO FUN AT ALL!
</p>
}
You will need hpricot 0.4.84 because of inner_text - if you don't want
to install it (I did not experience any difficulties, so I can recommend
it) then you have to roll your own inner_text, but I guess this is not a
big problem.
Which once again makes me wish paragraphs = doc/'//p[img]/text()'
worked. This could be doable if you asked Hpricot to provide you with
the REXML document (it's probably out of scope for the intendedly simple
XPath engine Hpricot uses natively), but unfortunately I can't for the
heck of it figure out how to make REXML accept the final /text(), even
though the parser claims to support XPath 1.0 except a few exceptions,
that one not being noted.
text in xpath should return a text node if present. For example:
(doc/"/html/body/div[1]/*/table[0]/tr[0]/*/b[9]/text")
Currently I am using the search and next_node:
doc.search("/html/body/div[1]/*/table[0]/tr[0]/td/b"){|x|
@movie_plot=x.next_node.to_s.strip if x.inner_html=="Plot Outline:" }
And receive
Author:
why
Message:
* lib/hpricot/elements.rb: added support for selecting text
nodes with text(): //p/text(), //p[a]//text(), etc.
* lib/hpricot/traverse.rb: ditto.
* lib/hpricot/tag.rb: the pathname method reports the path
fragment needed to get to this node.
* lib/hpricot/parse.rb: handle possible empty processing instruction. http://code.whytheluckystiff.net/hpricot/changeset/87
···
On 12/13/06, David Vallner <david@vallner.net> wrote:
Which once again makes me wish paragraphs = doc/'//p[img]/text()'
worked. This could be doable if you asked Hpricot to provide you with
the REXML document (it's probably out of scope for the intendedly simple
XPath engine Hpricot uses natively), but unfortunately I can't for the
heck of it figure out how to make REXML accept the final /text(), even
though the parser claims to support XPath 1.0 except a few exceptions,
that one not being noted.
Thanks Peter ,
Your solution worked. and I just wanted to know , where can I find the syntax for Hpricot like the one you gave as a solution,
Hmm, except of what can be found on the Hpricot page, I am using
1) rdoc, ri
2) p SomeHpricotClass.methods.sort
3) my kind-of-decent XPath knowledge
4) source code browsing (you don't have to be a pro (I am a newbie myself) and you can get a surprisingly lot from there))
5) common sense
6) ruby mailing list
Roughly in this order... A cheatsheet or something would be handy.. maybe there is already one somewhere?
Also, I'd take that in preference to point 3, using an XPath -ish sort of query
and then using a syntax element the implementation happens to not understand is
rather infuriating. (Aght REXML not supporting text() in a POLS way, if at
all.)
Sorry, I have been archiving ruby talk at rubytalk@gmail.com since 10/14/04.
Stephen Becker IV
···
On 12/16/06, David Vallner <david@vallner.net> wrote:
ruby talk wrote:
> Ask:
>
> http://code.whytheluckystiff.net/hpricot/ticket/32
>
> text in xpath should return a text node if present. For example:
> (doc/"/html/body/div[1]/*/table[0]/tr[0]/*/b[9]/text")
>
Well, it's 'text()' not 'text'. Luckily _why noticed.
> * lib/hpricot/elements.rb: added support for selecting text
> nodes with text(): //p/text(), //p[a]//text(), etc.
W00t
Thanks for pointing this out.
David Vallner
PS: Your email address name confuses the heck out of me. Please use
something that doesn't cause a mental namespace clash?