Strip tags?

Max_Benjamin · 23 July 2006 06:45

Is there an easy way to strip html tags from strings?
Thanks

···

--
Posted via http://www.ruby-forum.com/.

Harold_Hausman · 23 July 2006 07:32

How about using Lynx?
http://lynx.isc.org/lynx2.8.5/index.html

hth,
-Harold

···

On 7/23/06, Max Benjamin <moore.joseph@gmail.com> wrote:

Is there an easy way to strip html tags from strings?
Thanks

--
Posted via http://www.ruby-forum.com/\.

Stefan_Scholl1 · 23 July 2006 07:45

A regex isn't always the _best_ way to deal with markup
languages, but for an _easy_ way it's good enough.

$ irb
irb(main):001:0> a = '<strong>This is strong stuff!</strong><br><img src="foo.png" alt="Some foo">'
=> "<strong>This is strong stuff!</strong><br><img src=\"foo.png\" alt=\"Some foo\">"
irb(main):002:0> a.gsub(/<.*?>/, '')
=> "This is strong stuff!"

···

Max Benjamin <moore.joseph@gmail.com> wrote:

Is there an easy way to strip html tags from strings?

Daniel_Baird · 23 July 2006 07:51

the problem is, it's not always the _correct_ way.

But, having a > in an attribute is rare; if one-in-a-blue-moon errors are
ok, regexes are a nice easy solution.

;D

···

On 7/23/06, Stefan Scholl <stesch@no-spoon.de> wrote:

Max Benjamin <moore.joseph@gmail.com> wrote:
> Is there an easy way to strip html tags from strings?

A regex isn't always the _best_ way to deal with markup
languages, but for an _easy_ way it's good enough.

--
Daniel Baird
http://tiddlyspot.com (free, effortless TiddlyWiki hosting)
http://danielbaird.com (TiddlyW;nks! :: Whiteboard Koala :: Blog :: Things
That Suck)

Andreas_S1 · 23 July 2006 08:28

Daniel Baird wrote:

···

On 7/23/06, Stefan Scholl <stesch@no-spoon.de> wrote:

Max Benjamin <moore.joseph@gmail.com> wrote:
> Is there an easy way to strip html tags from strings?

A regex isn't always the _best_ way to deal with markup
languages, but for an _easy_ way it's good enough.

the problem is, it's not always the _correct_ way.

<div id="weird>id"></div>

This is no correct HTML, < and > have to be encoded as entities.

--
Posted via http://www.ruby-forum.com/\.

Daniel_Baird · 23 July 2006 10:03

That is true.. if the original poster has the luxury of only dealing with
correct html, he's a lucky fellow, and can kludge up some regexen that will
do the job. Even in a well-coded site, it's not unthinkable that you could
forget to do some encoding and end up with angle-brackets inside a textarea
or something, though.

How useful a regex approach is depends on the data. I have used a bit of
regex-type html parsing before and it worked fine, for the data that I was
parsing. Horses for courses.

;D

···

On 7/23/06, Andreas S. <f@andreas-s.net> wrote:

Daniel Baird wrote:
> On 7/23/06, Stefan Scholl <stesch@no-spoon.de> wrote:
>> A regex isn't always the _best_ way to deal with markup
>> languages, but for an _easy_ way it's good enough.
>
> the problem is, it's not always the _correct_ way.
> <div id="weird>id"></div>

This is no correct HTML, < and > have to be encoded as entities.

--
Daniel Baird
http://tiddlyspot.com (free, effortless TiddlyWiki hosting)
http://danielbaird.com (TiddlyW;nks! :: Whiteboard Koala :: Blog :: Things
That Suck)

Christian_Neukirche1 · 23 July 2006 16:25

"Andreas S." <f@andreas-s.net> writes:

Daniel Baird wrote:

> Is there an easy way to strip html tags from strings?

A regex isn't always the _best_ way to deal with markup
languages, but for an _easy_ way it's good enough.

the problem is, it's not always the _correct_ way.

<div id="weird>id"></div>

This is no correct HTML, < and > have to be encoded as entities.

It's valid XHTML:

$ echo '<bar quux="foo>bar" />' | xmllint -
<?xml version="1.0"?>
<bar quux="foo>bar"/>

However, '<' needs to be escaped:

$ echo '<bar quux="foo<bar" />' | xmllint -
-:1: parser error : Unescaped '<' not allowed in attributes values
<bar quux="foo<bar" />

···

On 7/23/06, Stefan Scholl <stesch@no-spoon.de> wrote:

Max Benjamin <moore.joseph@gmail.com> wrote:

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Max_Benjamin · 23 July 2006 16:17

Daniel Baird wrote:

···

On 7/23/06, Andreas S. <f@andreas-s.net> wrote:

Daniel Baird wrote:
> On 7/23/06, Stefan Scholl <stesch@no-spoon.de> wrote:
>> A regex isn't always the _best_ way to deal with markup
>> languages, but for an _easy_ way it's good enough.
>
> the problem is, it's not always the _correct_ way.
> <div id="weird>id"></div>

This is no correct HTML, < and > have to be encoded as entities.

That is true.. if the original poster has the luxury of only dealing
with
correct html, he's a lucky fellow, and can kludge up some regexen that
will
do the job. Even in a well-coded site, it's not unthinkable that you
could
forget to do some encoding and end up with angle-brackets inside a
textarea
or something, though.

How useful a regex approach is depends on the data. I have used a bit
of
regex-type html parsing before and it worked fine, for the data that I
was
parsing. Horses for courses.

;D

Thanks for the quick replys.
I should have been more explicit in my question. I want to strip html
tags in order to sanitize form input. I'm a bit of a ruby noob and I
was hoping to find a function similar to PHP's strip_tags, one that
would remove both html and ruby code.
Best

--
Posted via http://www.ruby-forum.com/\.

W_James · 24 July 2006 04:20

Christian Neukirchen wrote:

"Andreas S." <f@andreas-s.net> writes:

> Daniel Baird wrote:
>>>
>>> > Is there an easy way to strip html tags from strings?
>>>
>>> A regex isn't always the _best_ way to deal with markup
>>> languages, but for an _easy_ way it's good enough.
>>
>>
>> the problem is, it's not always the _correct_ way.
>>
>> <div id="weird>id"></div>
>
> This is no correct HTML, < and > have to be encoded as entities.

It's valid XHTML:

$ echo '<bar quux="foo>bar" />' | xmllint -
<?xml version="1.0"?>
<bar quux="foo>bar"/>

However, '<' needs to be escaped:

$ echo '<bar quux="foo<bar" />' | xmllint -
-:1: parser error : Unescaped '<' not allowed in attributes values
<bar quux="foo<bar" />

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

re = %r{
    <
      (?:
        # Any characters but > or " .
        [^>"] +
        >
        # Characters within quotes.
        # Allow escaped quotes.
        "
          (?:
              # Accept any escaped character.
              \\.
              >
              [^"\\] +
          ) *
        "
      ) *
    >
}xm

print DATA.read.gsub( re, '' )

__END__
Some<><"">
<bar quux="\"foo>bar" /> text
to <?xml version="1.0"?>
<bar quux="foo>bar"/> save
for <bar quux="foo<bar" />
<bar quux="\"foo><bar>" />reading.

···

>> On 7/23/06, Stefan Scholl <stesch@no-spoon.de> wrote:
>>> Max Benjamin <moore.joseph@gmail.com> wrote:

Mat_Schaffer · 23 July 2006 18:32

For sanitizing input, just escaping might be a better idea because it has less chance of being destructive. If you're on rails there's an h() function for this. If you're doing something else, maybe check out how rails does it and replicate it. There might be something easy that someone on this list knows that don't.

If you really want to strip them, I'd bet the regexp solution is no less effective than PHP's strip_tags.
-Mat

···

On Jul 23, 2006, at 12:17 PM, Max Benjamin wrote:

Thanks for the quick replys.
I should have been more explicit in my question. I want to strip html
tags in order to sanitize form input. I'm a bit of a ruby noob and I
was hoping to find a function similar to PHP's strip_tags, one that
would remove both html and ruby code.
Best

Christian_Neukirche1 · 25 July 2006 10:47

"William James" <w_a_x_man@yahoo.com> writes:

re = %r{
    <
      (?:
        # Any characters but > or " .
        [^>"] +
        >
        # Characters within quotes.
        # Allow escaped quotes.
        "
          (?:
              # Accept any escaped character.
              \\.
              >
              [^"\\] +
          ) *
        "
      ) *
    >
}xm

print DATA.read.gsub( re, '' )

···

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Max_Benjamin · 23 July 2006 19:33

Mat Schaffer wrote:

···

On Jul 23, 2006, at 12:17 PM, Max Benjamin wrote:

Thanks for the quick replys.
I should have been more explicit in my question. I want to strip html
tags in order to sanitize form input. I'm a bit of a ruby noob and I
was hoping to find a function similar to PHP's strip_tags, one that
would remove both html and ruby code.
Best

For sanitizing input, just escaping might be a better idea because it
has less chance of being destructive. If you're on rails there's an h
() function for this. If you're doing something else, maybe check
out how rails does it and replicate it. There might be something
easy that someone on this list knows that don't.

If you really want to strip them, I'd bet the regexp solution is no
less effective than PHP's strip_tags.
-Mat

Thanks for the help everybody.
Best,
Max

--
Posted via http://www.ruby-forum.com/\.

W_James · 25 July 2006 18:30

Christian Neukirchen wrote:

"William James" <w_a_x_man@yahoo.com> writes:

> re = %r{
> <
> (?:
> # Any characters but > or " .
> [^>"] +
> >
> # Characters within quotes.
> # Allow escaped quotes.
> "
> (?:
> # Accept any escaped character.
> \\.
> >
> [^"\\] +
> ) *
> "
> ) *
> >
> }xm
>
> print DATA.read.gsub( re, '' )

<foo bar='"quux"' />

re = %r{
    <
      (?:
          [^>"'] +
        >
          "
            (?: \\. | [^\\"] + ) *
          "
        >
          '
            (?: \\. | [^\\'] + ) *
          '
      ) *
    >

}xm

print DATA.read.gsub( re, '' )

__END__
Some<><"">
<bar quux='"foo>bar' /> text
to <?xml version="1.0"?>
<bar quux="foo>bar"/> save
for <bar quux="foo<bar" />
<foo bar='"quux"' /> later
<bar quux="\"foo><bar>" />reading.

Topic		Replies	Views
Remove HTML from String? ruby-talk	11	222	13 June 2012
Strinpping html using regexp ruby-talk	4	82	5 May 2009
Noob Question - String Manipulation ruby-talk	4	82	5 May 2006
Oneline:strip_tags ruby-talk	1	74	20 November 2009
Regexp Help ruby-talk	5	121	28 July 2009

Strip tags?

Related topics