ascii-text

Die_Optimisten · 8 July 2020 08:31

Hi,

How can I transform (US-ASCII) s='Eintr\xE4ge:' (individial chars!:
\,x,E,4) to correct UTF-8 ?

thanks
Andrew

Christer_Jansson · 8 July 2020 09:45

The string you start out with is not valid US-ASCII, so I have to make
assumptions about what you start out with here.

If you know (or can fairly safely assume) that your input is valid
ISO8859-1 (which is compatible with Unicode for codepoints < 256), you can
do:

utf8string = s.encode('UTF-8', 'ISO8859-1')

Or use String#encode! to do it in-place.

Cheers,

Christer Jansson

Christer Jansson Datakonsult AB
+46 70 88 55 020

···

Den ons 8 juli 2020 kl 10:31 skrev Die Optimisten <inform@die-optimisten.net >:

Hi,

How can I transform (US-ASCII) s='Eintr\xE4ge:' (individial chars!:
\,x,E,4) to correct UTF-8 ?

thanks
Andrew

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Die_Optimisten · 8 July 2020 10:27

Hi,
thanks for your try
the string is valid US-ASCII, EACH [part of the!] UTF-char coded, is
stored in 1 byte.

So a similar question could be:
How to convert/interpret s='text', as it would be written like s="..."
(change 'string' to "string" using the variable s, where s already
contains 'string' ...

thanks Andrew

Christer_Jansson · 8 July 2020 11:14

Ah, sorry, misreading.

Then, again assuming you want to interpret '\xE4' as the codepoint for "ä",

eval(%("#{s}")).encode('UTF-8', 'ISO8859-1')

But remember, eval is not very safe so don't use it if the text can contain
something like, say, '"; system %(sudo rm -rf /); "'.

Cheers,

Christer Jansson

Christer Jansson Datakonsult AB
+46 70 88 55 020

···

Den ons 8 juli 2020 kl 12:27 skrev Die Optimisten <inform@die-optimisten.net >:

Hi,
thanks for your try
the string is valid US-ASCII, EACH [part of the!] UTF-char coded, is
stored in 1 byte.

So a similar question could be:
How to convert/interpret s='text', as it would be written like s="..."
(change 'string' to "string" using the variable s, where s already
contains 'string' ...

thanks Andrew

Rob_Biedenharn · 8 July 2020 23:48

Or to very specifically handle \xFF sequences:

s.gsub(/\\x([0-9A-F]{2})/) {|h| h[-2,2].to_i(16).chr }.encode('UTF-8', 'ISO8859-1')

···

On 8 Jul 2020, at 07:14, Christer Jansson <datakonsult@janssons.org> wrote:

Ah, sorry, misreading.

Then, again assuming you want to interpret '\xE4' as the codepoint for "ä",

eval(%("#{s}")).encode('UTF-8', 'ISO8859-1')

But remember, eval is not very safe so don't use it if the text can contain something like, say, '"; system %(sudo rm -rf /); "'.

Cheers,

Christer Jansson

Christer Jansson Datakonsult AB
+46 70 88 55 020

Den ons 8 juli 2020 kl 12:27 skrev Die Optimisten <inform@die-optimisten.net <mailto:inform@die-optimisten.net>>:
Hi,
thanks for your try
the string is valid US-ASCII, EACH [part of the!] UTF-char coded, is
stored in 1 byte.

So a similar question could be:
How to convert/interpret s='text', as it would be written like s="..."
(change 'string' to "string" using the variable s, where s already
contains 'string' ...

thanks Andrew

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Die_Optimisten · 9 July 2020 09:58

Thanks for both these answers, just a line, but powerful!

Andrew

···

Am 7/9/20 um 1:48 AM schrieb Rob Biedenharn:

Then, again assuming you want to interpret '\xE4' as the codepoint for
"ä",

Christer_Jansson · 9 July 2020 10:02

Just to be clear, I would have chosen Rob's solution if I were you.

Cheers,

Christer Jansson

Christer Jansson Datakonsult AB
+46 70 88 55 020

···

Den tors 9 juli 2020 kl 11:58 skrev Die Optimisten < inform@die-optimisten.net>:

Am 7/9/20 um 1:48 AM schrieb Rob Biedenharn:
> Then, again assuming you want to interpret '\xE4' as the codepoint for
> "ä",

Thanks for both these answers, just a line, but powerful!

Andrew

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Topic		Replies	Views
US-ASCII to UTF-8 ruby-talk	2	140	9 March 2010
[ruby-talk:444453] codepoints ruby-talk	0	145	19 April 2024
Converting between ASCII-8BIT and UTF-8 ruby-talk	5	1624	5 November 2014
How does one transform UTF-8 encoded characters to ASCII? ruby-talk	13	139	25 May 2006
Str.encode! sets valid_encoding even if resulting string is invalid (1.9.3) ruby-talk	1	159	10 February 2012

ascii-text

Related topics