Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:
I have a string that contains html formating. All I want is the plain
text.
···
--
Posted via http://www.ruby-forum.com/.
I don't know of any libraries offhand that can do this (CGI only has
escape/unescape), but it's fairly simple:
html_string.gsub(/<[^>]+>/, "")
Replacing that regex with something better, probably.
···
On 5/5/06, Joe Cairns <joe.cairns@gmail.com> wrote:
Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:
I have a string that contains html formating. All I want is the plain
text.
There's more to this than meets the eye - it's often best to hand off
the hard stuff to someone else 
def fetchtext(uri)
`lynx --dump #{uri}`
end
puts fetchtext('www.google.com')
# =>
# [1]Personalised Home | [2]Sign in
···
On Sat, 2006-05-06 at 02:28 +0900, Joe Cairns wrote:
Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can. What's the best way to solve this problem:
I have a string that contains html formating. All I want is the plain
text.
#
# Google
#
# Web [3]Images [4]Groups [5]News [6]Froogle [7]more »
#
# ... [snipped] ...
--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk
I've been using
html_string.gsub(/<.*?>/,"")
for this. But it's always seemed more a "perl way" than a "ruby way".
···
On 5/5/06, Joseph Michaels <jmichaels@gmail.com> wrote:
On 5/5/06, Joe Cairns <joe.cairns@gmail.com> wrote:
> Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
> things as much as I can. What's the best way to solve this problem:
>
> I have a string that contains html formating. All I want is the plain
> text.
>
I don't know of any libraries offhand that can do this (CGI only has
escape/unescape), but it's fairly simple:
html_string.gsub(/<[^>]+>/, "")
Replacing that regex with something better, probably.
Joseph Michaels wrote:
I have a string that contains html formating. All I want is the plain
text.
html_string.gsub(/<[^>]+>/, "")
Replacing that regex with something better, probably.
gsubbing breaks down for more complex test cases, such as things
containing source code, or problematic attribute strings (e.g. <sometag
someattr="some>str">).
If you really want to be accurate, I suggest using an XML or HTML
parsing tool, such as Mechanize or RubyfulSoup.
Pistos
···
--
Posted via http://www.ruby-forum.com/\.