Noob Question - String Manipulation

Joe_Cairns · 5 May 2006 17:28

Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

···

--
Posted via http://www.ruby-forum.com/.

Joseph_Michaels · 5 May 2006 17:47

I don't know of any libraries offhand that can do this (CGI only has
escape/unescape), but it's fairly simple:

html_string.gsub(/<[^>]+>/, "")

Replacing that regex with something better, probably.

···

On 5/5/06, Joe Cairns <joe.cairns@gmail.com> wrote:

Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

Ross_Bamford4 · 5 May 2006 18:49

There's more to this than meets the eye - it's often best to hand off
the hard stuff to someone else

def fetchtext(uri)
`lynx --dump #{uri}`
end

puts fetchtext('www.google.com')
# =>
# [1]Personalised Home | [2]Sign in

···

On Sat, 2006-05-06 at 02:28 +0900, Joe Cairns wrote:

Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

#
# Google
#
# Web [3]Images [4]Groups [5]News [6]Froogle [7]more »
#
# ... [snipped] ...

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

Francisco_Ortiz · 5 May 2006 17:56

I've been using

html_string.gsub(/<.*?>/,"")

for this. But it's always seemed more a "perl way" than a "ruby way".

···

On 5/5/06, Joseph Michaels <jmichaels@gmail.com> wrote:

On 5/5/06, Joe Cairns <joe.cairns@gmail.com> wrote:
> Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
> things as much as I can. What's the best way to solve this problem:
>
> I have a string that contains html formating. All I want is the plain
> text.
>

I don't know of any libraries offhand that can do this (CGI only has
escape/unescape), but it's fairly simple:

html_string.gsub(/<[^>]+>/, "")

Replacing that regex with something better, probably.

Pistos_Christou1 · 5 May 2006 18:15

Joseph Michaels wrote:

I have a string that contains html formating. All I want is the plain
text.

html_string.gsub(/<[^>]+>/, "")
Replacing that regex with something better, probably.

gsubbing breaks down for more complex test cases, such as things
containing source code, or problematic attribute strings (e.g. <sometag
someattr="some>str">).

If you really want to be accurate, I suggest using an XML or HTML
parsing tool, such as Mechanize or RubyfulSoup.

Pistos

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Need script: convert html-text to text ruby-talk	3	108	4 January 2006
Remove HTML from String? ruby-talk	11	223	13 June 2012
Strip tags? ruby-talk	12	67	25 July 2006
Strinpping html using regexp ruby-talk	4	83	5 May 2009
Converting the string ruby-talk	3	83	1 July 2009

Noob Question - String Manipulation

Related topics