Why does #content method in nokogiri not printing the full text?

7stud2 · 14 April 2013 16:19

Here is the documentation: http://www.rubydoc.info/gems/nokogiri/frames

Why does below code not printing the full text?

Code:

···

======

require 'nokogiri'

html = <<-END
<html>

<head>

<title> A Dirge </title>

</head>

            Rough wind, that moanest loud
              Grief too sad for song;
            Wild wind, when sullen cloud
              Knells all the night long;
            Sad storm, whose tears are vain,
            Bare woods, whose branches strain,
            Deep caves and dreary main, -
              Wail, for the world's wrong!

</pre></body>

</html>
END

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.children.each do |ch|
p ch.content if ch.text?
end

Output:

"\n\n \n\n "
"\n\n "

Expected output:

        Rough wind, that moanest loud
              Grief too sad for song;
            Wild wind, when sullen cloud
              Knells all the night long;
            Sad storm, whose tears are vain,
            Bare woods, whose branches strain,
            Deep caves and dreary main, -
              Wail, for the world's wrong!

--
Posted via http://www.ruby-forum.com/.

Tamara_Temple1 · 14 April 2013 16:59

If you actually look at the structure of doc, the next to last entry
in it's children contains children as well, which you need to loop
through. Try this:

(load your code into irb)
require 'pp'
pp doc

and see what the structure is.

···

On Sun, Apr 14, 2013 at 11:19 AM, Love U Ruby <lists@ruby-forum.com> wrote:

Here is the documentation: File: README — Documentation for nokogiri (1.16.0)

Why does below code not printing the full text?

Code:

require 'nokogiri'

html = <<-END
<html>

    <head>

    <title> A Dirge </title>

    <link rel = "schema.DC"
          href = "http://purl.org/DC/elements/1.0/">

    <meta name = "DC.Title"
          content = "A Dirge">

    <meta name = "DC.Creator"
          content = "Shelley, Percy Bysshe">

    <meta name = "DC.Type"
          content = "poem">

    <meta name = "DC.Date"
          content = "1820">

    <meta name = "DC.Format"
          content = "text/html">

    <meta name = "DC.Language"
          content = "en">

    </head>

    <body><pre>

            Rough wind, that moanest loud
              Grief too sad for song;
            Wild wind, when sullen cloud
              Knells all the night long;
            Sad storm, whose tears are vain,
            Bare woods, whose branches strain,
            Deep caves and dreary main, -
              Wail, for the world's wrong!

    </pre></body>

    </html>
END

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.children.each do |ch|
  p ch.content if ch.text?
end

Output:

"\n\n \n\n "
"\n\n "

Expected output:

        Rough wind, that moanest loud
              Grief too sad for song;
            Wild wind, when sullen cloud
              Knells all the night long;
            Sad storm, whose tears are vain,
            Bare woods, whose branches strain,
            Deep caves and dreary main, -
              Wail, for the world's wrong!

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 14 April 2013 19:41

Finally I got the output what I was looking for:

require 'nokogiri'
require 'pp'

html = <<-END
<html>

<head>

<title> A Dirge </title>

</head>

                Rough wind, that moanest loud
                  Grief too sad for song;
                Wild wind, when sullen cloud
                  Knells all the night long;
                Sad storm, whose tears are vain,
                Bare woods, whose branches strain,
                Deep caves and dreary main, -
                  Wail, for the world's wrong!

</pre></body>

</html>
END

doc = Nokogiri::HTML::DocumentFragment.parse(html)

    doc.children.each do |ch|
      puts ch.child.content if ch.node_name == 'pre'
    end

output:

···

=======

            Rough wind, that moanest loud
              Grief too sad for song;
            Wild wind, when sullen cloud
              Knells all the night long;
            Sad storm, whose tears are vain,
            Bare woods, whose branches strain,
            Deep caves and dreary main, -
              Wail, for the world's wrong!

--
Posted via http://www.ruby-forum.com/.

11142 · 14 April 2013 20:43

`ch.text?` will only return true when a node is a text node - ie., it's not a tag. Since the document root contains no text itself apart from whitespace, this just prints the whitespace. Remove the `if ch.text?` part to print contents of everything (or just use `doc.content`).

···

On Sun, 14 Apr 2013 18:19:00 +0200, Love U Ruby <lists@ruby-forum.com> wrote:

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.children.each do |ch|
p ch.content if ch.text?
end

--
Matma Rex

7stud2 · 20 April 2013 07:26

Just looking for a definition of the use: When should I need to think
of what to use from below ?

Nokogiri::HTML::Document and Nokogiri::HTML::DocumentFragment

and when I should think to use `parse` method of each?

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 29 May 2013 20:45

Hi,

I wrote the below code:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.homeshop18.com/'))
p doc.css("div#megamenu-sub-nav li span:nth-child(2)").map{|x|
x.parent.text.strip}
#=> ["books", "clothing", "footwear", "fashion accessories", "health &
beauty", "jewellery", "watches", "mobiles", "gsm mobiles\r\rnew", "upto
62% off\r\rnew", "camera & camcorders", "computers", "electronics",
"home & kitchen", "household appliances", "kids & toys", "gift &
flowers", "office & stationery"]

But in the array output, I am getting 2 extra items - "gsm
mobiles\r\rnew", "upto 62% off\r\rnew", which I don't expect.

Could anyone tell me where I did the mistake.

···

--
Posted via http://www.ruby-forum.com/.

Tamara_Temple1 · 14 April 2013 17:03

Follow-up: since you have a complete html document, why treat it as a
fragment? You can call Nokogiri::HTML.parse(html) instead and get the
actual complete document tree with all the proper nesting.

···

On Sun, Apr 14, 2013 at 11:59 AM, tamouse mailing lists <tamouse.lists@gmail.com> wrote:

On Sun, Apr 14, 2013 at 11:19 AM, Love U Ruby <lists@ruby-forum.com> wrote:

Here is the documentation: File: README — Documentation for nokogiri (1.16.0)

Why does below code not printing the full text?

Code:

require 'nokogiri'

html = <<-END
<html>

    <head>

    <title> A Dirge </title>

    <link rel = "schema.DC"
          href = "http://purl.org/DC/elements/1.0/">

    <meta name = "DC.Title"
          content = "A Dirge">

    <meta name = "DC.Creator"
          content = "Shelley, Percy Bysshe">

    <meta name = "DC.Type"
          content = "poem">

    <meta name = "DC.Date"
          content = "1820">

    <meta name = "DC.Format"
          content = "text/html">

    <meta name = "DC.Language"
          content = "en">

    </head>

    <body><pre>

            Rough wind, that moanest loud
              Grief too sad for song;
            Wild wind, when sullen cloud
              Knells all the night long;
            Sad storm, whose tears are vain,
            Bare woods, whose branches strain,
            Deep caves and dreary main, -
              Wail, for the world's wrong!

    </pre></body>

    </html>
END

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.children.each do |ch|
  p ch.content if ch.text?
end

Output:

"\n\n \n\n "
"\n\n "

Expected output:

        Rough wind, that moanest loud
              Grief too sad for song;
            Wild wind, when sullen cloud
              Knells all the night long;
            Sad storm, whose tears are vain,
            Bare woods, whose branches strain,
            Deep caves and dreary main, -
              Wail, for the world's wrong!

--
Posted via http://www.ruby-forum.com/\.

If you actually look at the structure of doc, the next to last entry
in it's children contains children as well, which you need to loop
through. Try this:

(load your code into irb)
require 'pp'
pp doc

and see what the structure is.

7stud2 · 14 April 2013 17:20

tamouse mailing lists wrote in post #1105601:

(load your code into irb)
require 'pp'
pp doc

and see what the structure is.

Now, I tried

doc = Nokogiri::HTML::DocumentFragment.parse(html)
pp doc
doc.children.each do |ch|
p ch.content if ch.text?
end

output:

···

On Sun, Apr 14, 2013 at 11:19 AM, Love U Ruby <lists@ruby-forum.com>

=======
children = [
        #(Text "\n\n Rough wind, that moanest loud\n
Grief too sad for song;\n Wild wind, when sullen cloud\n
Knells all the night long;\n Sad storm, whose tears are
vain,\n Bare woods, whose branches strain,\n Deep
caves and dreary main, -\n Wail, for the world's wrong!\n\n
")]
      }),
    #(Text "\n\n ")

--
"\n\n \n\n "
"\n\n "

where does go the middle characters between the first "\n\n \n\n "
?

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 15 April 2013 08:18

Bartosz Dziewoński wrote in post #1105615:

···

On Sun, 14 Apr 2013 18:19:00 +0200, Love U Ruby <lists@ruby-forum.com> > wrote:

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.children.each do |ch|
p ch.content if ch.text?
end

`ch.text?` will only return true when a node is a text node - ie., it's
not a tag. Since the document root contains no text itself apart from
whitespace, this just prints the whitespace. Remove the `if ch.text?`
part to print contents of everything (or just use `doc.content`).

Thank you very much for your comments.

--
Posted via http://www.ruby-forum.com/\.

Tamara_Temple1 · 20 April 2013 08:58

What do you suppose the meaning of "fragment" is, and why would that make a
distinction?

···

On Apr 20, 2013 2:28 AM, "Love U Ruby" <lists@ruby-forum.com> wrote:

Just looking for a definition of the use: When should I need to think
of what to use from below ?

Nokogiri::HTML::Document and Nokogiri::HTML::DocumentFragment

and when I should think to use `parse` method of each?

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 14 April 2013 17:07

tamouse mailing lists wrote in post #1105602:

···

On Sun, Apr 14, 2013 at 11:59 AM, tamouse mailing lists > <tamouse.lists@gmail.com> wrote:

Follow-up: since you have a complete html document, why treat it as a
fragment? You can call Nokogiri::HTML.parse(html) instead and get the
actual complete document tree with all the proper nesting.

I am just learning this `Nokogiri` first time. So don't have that much
knowledge about their uses.

Could you tell me please?

When should I use `Nokogiri::HTML.parse(html)`, and the when the other?

--
Posted via http://www.ruby-forum.com/\.

Tamara_Temple1 · 14 April 2013 17:22

I see.

You did not actually read what `pp doc` told you, did you?

···

On Sun, Apr 14, 2013 at 12:20 PM, Love U Ruby <lists@ruby-forum.com> wrote:

tamouse mailing lists wrote in post #1105601:

On Sun, Apr 14, 2013 at 11:19 AM, Love U Ruby <lists@ruby-forum.com>

(load your code into irb)
require 'pp'
pp doc

and see what the structure is.

Now, I tried

doc = Nokogiri::HTML::DocumentFragment.parse(html)
pp doc
doc.children.each do |ch|
  p ch.content if ch.text?
end

output:

children = [
        #(Text "\n\n Rough wind, that moanest loud\n
Grief too sad for song;\n Wild wind, when sullen cloud\n
Knells all the night long;\n Sad storm, whose tears are
vain,\n Bare woods, whose branches strain,\n Deep
caves and dreary main, -\n Wail, for the world's wrong!\n\n
")]
      }),
    #(Text "\n\n ")

--
"\n\n \n\n "
"\n\n "

where does go the middle characters between the first "\n\n \n\n "
?

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 20 April 2013 09:05

tamouse mailing lists wrote in post #1106373:

···

On Apr 20, 2013 2:28 AM, "Love U Ruby" <lists@ruby-forum.com> wrote:

Posted via http://www.ruby-forum.com/\.

What do you suppose the meaning of "fragment" is, and why would that
make a
distinction?

I understand,but looking for what would be perfect use-case to select
the best one. means when I must think that I have to use
`Nokogiri::HTML::Document` and when the other?

--
Posted via http://www.ruby-forum.com/\.

Tamara_Temple1 · 14 April 2013 17:22

tamouse mailing lists wrote in post #1105602:

Follow-up: since you have a complete html document, why treat it as a
fragment? You can call Nokogiri::HTML.parse(html) instead and get the
actual complete document tree with all the proper nesting.

I am just learning this `Nokogiri` first time. So don't have that much
knowledge about their uses.

Could you tell me please?

No. I will tell you this though. You have *entirely* the wrong
strategy for learning how to be a developer. You have adopted the
strategy of "someone must tell me". You need to adopt the strategy of
"try things out *until* I learn what works". If you get stuck on this
low a level of understanding, you will never progress, and as you have
seen, it just frustrates people whom you continuously run back to with
every single step. You may think you are learning, but you are not at
all learning how to learn, which is the more important step. You are
not learning how to solve problems, especially your own. People are
NOT on this list to teach you. We are not your instructors. We answer
questions out of the goodness of our hearts, but repeated trips to the
well for every sip wears everyone here down. Frankly, it makes me want
to part this list and go elsewhere. It makes it very unenjoyable, and
very unpleasant.

When should I use `Nokogiri::HTML.parse(html)`, and the when the other?

Please compare and contrast the terms "Document" and "Document Fragment"

--
Posted via http://www.ruby-forum.com/\.

What do the words "Document" and "Document Fragment" mean to you?

···

On Sun, Apr 14, 2013 at 12:07 PM, Love U Ruby <lists@ruby-forum.com> wrote:

On Sun, Apr 14, 2013 at 11:59 AM, tamouse mailing lists >> <tamouse.lists@gmail.com> wrote:

7stud2 · 14 April 2013 17:26

tamouse mailing lists wrote in post #1105606:

I see.

You did not actually read what `pp doc` told you, did you?

I have given the partial output that I got from `pp` here.

···

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 15 April 2013 06:09

tamouse mailing lists wrote in post #1105606:

I see.

You did not actually read what `pp doc` told you, did you?

Thanks to you for the hints `pp doc.It helped me great. Just one more
thing to tell you. Can you suggest in what other ways I could solve the
same problem? I just want to learn `Nokogiri`. Give me only hints, I
will try to solve using that the same assignment as above.

···

--
Posted via http://www.ruby-forum.com/\.

Tamara_Temple1 · 20 April 2013 10:03

If you understand the difference, then you have your 'perfect' use-case.

···

On Apr 20, 2013 4:06 AM, "Love U Ruby" <lists@ruby-forum.com> wrote:

tamouse mailing lists wrote in post #1106373:
> On Apr 20, 2013 2:28 AM, "Love U Ruby" <lists@ruby-forum.com> wrote:
>> Posted via http://www.ruby-forum.com/\.
>>
>
> What do you suppose the meaning of "fragment" is, and why would that
> make a
> distinction?

I understand,but looking for what would be perfect use-case to select
the best one. means when I must think that I have to use
`Nokogiri::HTML::Document` and when the other?

--
Posted via http://www.ruby-forum.com/\.

Tamara_Temple1 · 14 April 2013 17:58

You copying it in to a message and you reading it are two entirely
different things.

···

On Sun, Apr 14, 2013 at 12:26 PM, Love U Ruby <lists@ruby-forum.com> wrote:

tamouse mailing lists wrote in post #1105606:

I see.

You did not actually read what `pp doc` told you, did you?

I have given the partial output that I got from `pp` here.

--
Posted via http://www.ruby-forum.com/\.

Tamara_Temple1 · 16 April 2013 03:04

Write your own Mechanize gem.

···

On Mon, Apr 15, 2013 at 1:09 AM, Love U Ruby <lists@ruby-forum.com> wrote:

tamouse mailing lists wrote in post #1105606:

I see.

You did not actually read what `pp doc` told you, did you?

Thanks to you for the hints `pp doc.It helped me great. Just one more
thing to tell you. Can you suggest in what other ways I could solve the
same problem? I just want to learn `Nokogiri`. Give me only hints, I
will try to solve using that the same assignment as above.

Topic		Replies	Views
Nikogiri ruby-talk	20	544	14 August 2016
Nokogiri help parsing HTML ruby-talk	17	535	29 March 2013
Using Nokogiri ruby-talk	17	135	13 November 2009
Why Nokogiri::HTML::DocumentFragment not working, while Nokogiri::HTML::Document working well ruby-talk	0	152	11 November 2014
Print - and strip text between tags using Nokogiri ruby-talk	12	633	17 December 2012

Why does #content method in nokogiri not printing the full text?

Output:

Expected output:

Code:

Output:

Expected output:

Code:

Output:

Expected output:

output:

Related topics