Turning a non-ASCII character into a XML entity with REXML?

I asked this a little while back but maybe didn't ask the right way, so maybe somebody can help me if I rephrase:

I'm trying to build an RSS feed that takes, in its item descriptions, ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to take a non-ASCII character and turn it into a usable XML entity. So, for example, "\251" would get turned into "&#169":

str = "\251 2004 Francis Hwang"
elt = REXML::Element.new( 'elt' )
elt.text = str
elt.to_s
=> "<elt>\251 2004 Francis Hwang</elt>"
# But I want "<elt>&#169; 2004 Francis Hwang</elt>"

Is there some sort of setting I can twiddle in REXML so that I can assign a text that includes these sorts of characters, and REXML will know to turn them into entities on output? I know I can do this by hand and then prevent escaping by use the :raw flag, but I'd like to avoid that if possible.

Francis

I think there's an escapeHTML function on the CGI that might do it. Of course, it will also hit the &gt; and &lt;. You could still lift the code from there.

~ pat

···

On Friday, October 15, 2004, at 08:38 PM, Francis Hwang wrote:

I asked this a little while back but maybe didn't ask the right way, so maybe somebody can help me if I rephrase:

I'm trying to build an RSS feed that takes, in its item descriptions, ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to take a non-ASCII character and turn it into a usable XML entity. So, for example, "\251" would get turned into "&#169":

str = "\251 2004 Francis Hwang"
elt = REXML::Element.new( 'elt' )
elt.text = str
elt.to_s
=> "<elt>\251 2004 Francis Hwang</elt>"
# But I want "<elt>&#169; 2004 Francis Hwang</elt>"

Is there some sort of setting I can twiddle in REXML so that I can assign a text that includes these sorts of characters, and REXML will know to turn them into entities on output? I know I can do this by hand and then prevent escaping by use the :raw flag, but I'd like to avoid that if possible.

I'm trying to build an RSS feed that takes, in its item descriptions,
ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to take
a non-ASCII character and turn it into a usable XML entity. So, for
example, "\251" would get turned into "&#169"

Not exactly what you're asking for, but you could use Iconv to convert
ISO-8859-1 into UTF-8. It should be perfectly legal to include UTF-8
characters directly in XML, without turning them into character entities.

Alternatively, if it's sufficient to convert characters 160-255 straight
into numeric entity refs (which works if the top half of ISO-8859-1 maps
directly into Unicode, as I think it does), then how about

  a = "Copyright \251 2004"
  a.gsub!(/[\240-\377]/) { |c| "&#%d;" % c[0] }

  # => "Copyright &#169; 2004"

Regards,

Brian.

I just tried; it doesn't do it.

irb(main):004:0> CGI.escapeHTML( "<br>")
=> "&lt;br&gt;"
irb(main):005:0> CGI.escapeHTML( "<br>\251")
=> "&lt;br&gt;\251"

···

On Oct 16, 2004, at 2:15 AM, Patrick May wrote:

On Friday, October 15, 2004, at 08:38 PM, Francis Hwang wrote:

I asked this a little while back but maybe didn't ask the right way, so maybe somebody can help me if I rephrase:

I'm trying to build an RSS feed that takes, in its item descriptions, ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to take a non-ASCII character and turn it into a usable XML entity. So, for example, "\251" would get turned into "&#169":

str = "\251 2004 Francis Hwang"
elt = REXML::Element.new( 'elt' )
elt.text = str
elt.to_s
=> "<elt>\251 2004 Francis Hwang</elt>"
# But I want "<elt>&#169; 2004 Francis Hwang</elt>"

Is there some sort of setting I can twiddle in REXML so that I can assign a text that includes these sorts of characters, and REXML will know to turn them into entities on output? I know I can do this by hand and then prevent escaping by use the :raw flag, but I'd like to avoid that if possible.

I think there's an escapeHTML function on the CGI that might do it. Of course, it will also hit the &gt; and &lt;. You could still lift the code from there.

~ pat