ascii-text

Hi,

How can I transform (US-ASCII) s='Eintr\xE4ge:' (individial chars!:
\,x,E,4) to correct UTF-8 ?

thanks
Andrew

The string you start out with is not valid US-ASCII, so I have to make
assumptions about what you start out with here.

If you know (or can fairly safely assume) that your input is valid
ISO8859-1 (which is compatible with Unicode for codepoints < 256), you can
do:

utf8string = s.encode('UTF-8', 'ISO8859-1')

Or use String#encode! to do it in-place.

Cheers,

Christer Jansson

Christer Jansson Datakonsult AB
+46 70 88 55 020

···

Den ons 8 juli 2020 kl 10:31 skrev Die Optimisten <inform@die-optimisten.net >:

Hi,

How can I transform (US-ASCII) s='Eintr\xE4ge:' (individial chars!:
\,x,E,4) to correct UTF-8 ?

thanks
Andrew

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Hi,
thanks for your try :wink:
the string is valid US-ASCII, EACH [part of the!] UTF-char coded, is
stored in 1 byte.

So a similar question could be:
How to convert/interpret s='text', as it would be written like s="..."
(change 'string' to "string" using the variable s, where s already
contains 'string' ...

thanks Andrew

Ah, sorry, misreading.

Then, again assuming you want to interpret '\xE4' as the codepoint for "ä",

eval(%("#{s}")).encode('UTF-8', 'ISO8859-1')

But remember, eval is not very safe so don't use it if the text can contain
something like, say, '"; system %(sudo rm -rf /); "'.

Cheers,

Christer Jansson

Christer Jansson Datakonsult AB
+46 70 88 55 020

···

Den ons 8 juli 2020 kl 12:27 skrev Die Optimisten <inform@die-optimisten.net >:

Hi,
thanks for your try :wink:
the string is valid US-ASCII, EACH [part of the!] UTF-char coded, is
stored in 1 byte.

So a similar question could be:
How to convert/interpret s='text', as it would be written like s="..."
(change 'string' to "string" using the variable s, where s already
contains 'string' ...

thanks Andrew

Or to very specifically handle \xFF sequences:

s.gsub(/\\x([0-9A-F]{2})/) {|h| h[-2,2].to_i(16).chr }.encode('UTF-8', 'ISO8859-1')

···

On 8 Jul 2020, at 07:14, Christer Jansson <datakonsult@janssons.org> wrote:

Ah, sorry, misreading.

Then, again assuming you want to interpret '\xE4' as the codepoint for "ä",

eval(%("#{s}")).encode('UTF-8', 'ISO8859-1')

But remember, eval is not very safe so don't use it if the text can contain something like, say, '"; system %(sudo rm -rf /); "'.

Cheers,

Christer Jansson

Christer Jansson Datakonsult AB
+46 70 88 55 020

Den ons 8 juli 2020 kl 12:27 skrev Die Optimisten <inform@die-optimisten.net <mailto:inform@die-optimisten.net>>:
Hi,
thanks for your try :wink:
the string is valid US-ASCII, EACH [part of the!] UTF-char coded, is
stored in 1 byte.

So a similar question could be:
How to convert/interpret s='text', as it would be written like s="..."
(change 'string' to "string" using the variable s, where s already
contains 'string' ...

thanks Andrew

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Thanks for both these answers, just a line, but powerful!

Andrew

···

Am 7/9/20 um 1:48 AM schrieb Rob Biedenharn:

Then, again assuming you want to interpret '\xE4' as the codepoint for
"ä",

Just to be clear, I would have chosen Rob's solution if I were you. :slight_smile:

Cheers,

Christer Jansson

Christer Jansson Datakonsult AB
+46 70 88 55 020

···

Den tors 9 juli 2020 kl 11:58 skrev Die Optimisten < inform@die-optimisten.net>:

Am 7/9/20 um 1:48 AM schrieb Rob Biedenharn:
> Then, again assuming you want to interpret '\xE4' as the codepoint for
> "ä",

Thanks for both these answers, just a line, but powerful!

Andrew

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>