Noob Question - String Manipulation

Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

···

--
Posted via http://www.ruby-forum.com/.

I don't know of any libraries offhand that can do this (CGI only has
escape/unescape), but it's fairly simple:

html_string.gsub(/<[^>]+>/, "")

Replacing that regex with something better, probably.

···

On 5/5/06, Joe Cairns <joe.cairns@gmail.com> wrote:

Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

There's more to this than meets the eye - it's often best to hand off
the hard stuff to someone else :slight_smile:

def fetchtext(uri)
  `lynx --dump #{uri}`
end

puts fetchtext('www.google.com')
# =>
# [1]Personalised Home | [2]Sign in

···

On Sat, 2006-05-06 at 02:28 +0900, Joe Cairns wrote:

Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

#
# Google
#
# Web [3]Images [4]Groups [5]News [6]Froogle [7]more »
#
# ... [snipped] ...

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

I've been using

html_string.gsub(/<.*?>/,"")

for this. But it's always seemed more a "perl way" than a "ruby way".

···

On 5/5/06, Joseph Michaels <jmichaels@gmail.com> wrote:

On 5/5/06, Joe Cairns <joe.cairns@gmail.com> wrote:
> Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
> things as much as I can. What's the best way to solve this problem:
>
> I have a string that contains html formating. All I want is the plain
> text.
>

I don't know of any libraries offhand that can do this (CGI only has
escape/unescape), but it's fairly simple:

html_string.gsub(/<[^>]+>/, "")

Replacing that regex with something better, probably.

Joseph Michaels wrote:

I have a string that contains html formating. All I want is the plain
text.

html_string.gsub(/<[^>]+>/, "")
Replacing that regex with something better, probably.

gsubbing breaks down for more complex test cases, such as things
containing source code, or problematic attribute strings (e.g. <sometag
someattr="some>str">).

If you really want to be accurate, I suggest using an XML or HTML
parsing tool, such as Mechanize or RubyfulSoup.

Pistos

···

--
Posted via http://www.ruby-forum.com/\.