Stripping unwanted html

Wild_Al · 6 October 2006 02:13

Hi everyone:

I'm trying to strip html with the exception of a few html tags.

I have found the following code:

  def strip_tags(html)
    if html.index("<")
      text = ""
      tokenizer = HTML::Tokenizer.new(html)

while token = tokenizer.next
        node = HTML::Node.parse(nil, 0, 0, token, false)
        # result is only the content of any Text nodes
        text << node.to_s if node.class == HTML::Text
      end
      # strip any comments, and if they have a newline at the end (ie.
line with
      # only a comment) strip that too
      text.gsub(/[\n]?/m, "")
    else
      html # already plain text
    end
end

I'm trying to understand what is going on in this code but cannot find
documenation for HTML::Tokenizer or HTML::Node.parse. Does anyone know
the use of the parameters in the parse method?

In the while loop, how do you access the html tag. If I could access
the html tags, I could then decide if I wanted to keep the tag or not.

Thanks for reading,
Wild Al

···

--
Posted via http://www.ruby-forum.com/.

Michael_Moen · 6 October 2006 03:38

Wild Al wrote:

Hi everyone:

I'm trying to strip html with the exception of a few html tags.

I have found the following code:

  def strip_tags(html)
    if html.index("<")
      text = ""
      tokenizer = HTML::Tokenizer.new(html)

      while token = tokenizer.next
        node = HTML::Node.parse(nil, 0, 0, token, false)
        # result is only the content of any Text nodes
        text << node.to_s if node.class == HTML::Text
      end
      # strip any comments, and if they have a newline at the end (ie.
line with
      # only a comment) strip that too
      text.gsub(/[\n]?/m, "")
    else
      html # already plain text
    end
end

I'm trying to understand what is going on in this code but cannot find
documenation for HTML::Tokenizer or HTML::Node.parse. Does anyone know
the use of the parameters in the parse method?

In the while loop, how do you access the html tag. If I could access
the html tags, I could then decide if I wanted to keep the tag or not.

Thanks for reading,
Wild Al

Al- I recently needed a parser similar to Perl's HTML::Scrubber. <a
href="http://www.underpantsgnome.com/2006/09/09/using-hpricot-to-scrub-html/">This</a>
is what I came up with , you may find it useful.

Michael

···

--
Posted via http://www.ruby-forum.com/\.

Paul_Lutus · 6 October 2006 05:50

Wild Al wrote:

/ ...

In the while loop, how do you access the html tag. If I could access
the html tags, I could then decide if I wanted to keep the tag or not.

Why not create your own method? This should give you some ideas:

#!/usr/bin/ruby -w

data = File.read("/path/page.html")

data.scan(/<(\w+?)>/) { |tag|
puts tag
}

This puts out the tag name extracted from each HTML tag in the page.

···

--
Paul Lutus
http://www.arachnoid.com

eden · 6 October 2006 09:20

Hey, that's RoR's strip_tags method

I ran into the same issue so went down your path and instead of
fretting about no docs, just hacked my way through using IRB.

Here's a modified version of what I came up with, maybe you'll find it
useful?

  def strip_tags_except(html, exceptions = )
    if html.index("<")
      text = ""
      tokenizer = HTML::Tokenizer.new(html)
      while token = tokenizer.next
        case node = HTML::Node.parse(nil, 0, 0, token, false)
        when HTML::Tag
          text << node.to_s if exceptions.include?(node.name)
        when HTML::Text
          text << node.to_s
        end
      end
      text
    else
      html
    end
  end

The one I had also stripped attributes and closed up dangling tags if
it found any. Have a look at RoR's strip_links for more examples of
HTML::Node/HTML::Tokenizer usage.

Wild Al wrote:

···

Hi everyone:

I'm trying to strip html with the exception of a few html tags.

I have found the following code:

  def strip_tags(html)
    if html.index("<")
      text = ""
      tokenizer = HTML::Tokenizer.new(html)

      while token = tokenizer.next
        node = HTML::Node.parse(nil, 0, 0, token, false)
        # result is only the content of any Text nodes
        text << node.to_s if node.class == HTML::Text
      end
      # strip any comments, and if they have a newline at the end (ie.
line with
      # only a comment) strip that too
      text.gsub(/[\n]?/m, "")
    else
      html # already plain text
    end
end

I'm trying to understand what is going on in this code but cannot find
documenation for HTML::Tokenizer or HTML::Node.parse. Does anyone know
the use of the parameters in the parse method?

In the while loop, how do you access the html tag. If I could access
the html tags, I could then decide if I wanted to keep the tag or not.

Thanks for reading,
Wild Al

--
Posted via http://www.ruby-forum.com/\.

eden · 6 October 2006 09:23

eden wrote:

Have a look at RoR's strip_links for more examples of
HTML::Node/HTML::Tokenizer usage.

Sorry, I meant sanitize:
http://api.rubyonrails.com/classes/ActionView/Helpers/TextHelper.html#M000516

Wild_Al · 8 October 2006 19:54

eden wrote:

Here's a modified version of what I came up with, maybe you'll find it
useful?

  def strip_tags_except(html, exceptions = )
    if html.index("<")
      text = ""
      tokenizer = HTML::Tokenizer.new(html)
      while token = tokenizer.next
        case node = HTML::Node.parse(nil, 0, 0, token, false)
        when HTML::Tag
          text << node.to_s if exceptions.include?(node.name)
        when HTML::Text
          text << node.to_s
        end
      end
      text
    else
      html
    end
  end

I found this method very useful; it is exactly what I needed. Thanks.
To all others: your suggestions helped too, especially in understanding
ruby. Thanks again...

···

--
Posted via http://www.ruby-forum.com/\.

eden · 9 October 2006 05:36

Glad I could help. One security-related caveat.

The method I posted doesn't strip attributes, so it may be possible for
someone to "hack" your site by putting javascript onto one of the
allowed tags.

You can fix that by changing the last line of the first branch of the
if statement from "text" to "sanitize(text)", eg:

  if html.index("<")
    ...
    sanitize(text)
  else
  ...

Wild Al wrote:

···

I found this method very useful; it is exactly what I needed. Thanks.
To all others: your suggestions helped too, especially in understanding
ruby. Thanks again...

Topic		Replies	Views
Strinpping html using regexp ruby-talk	4	82	5 May 2009
Oneline:strip_tags ruby-talk	1	74	20 November 2009
Help help please help ruby-talk	1	78	27 October 2005
Trying to use regex ruby-talk	3	99	20 June 2007
Sanitizing html tags (content) ruby-talk	2	69	22 October 2009

Stripping unwanted html

Related topics