Hi everyone!
I have a problem with Unicode in irb on Windows. I recognized it when
trying to save an attribute of an ActiveRecord-Model with an umlaut
(for example "ü") in script/console. If the database connection is
encoded in utf8, everything after the umlaut gets truncated, in the
default encoding I get funny characters back. It doesn't matter if the
$KCODE is set to UTF8 or NONE, the character number stays the same
(also on plain irb)!
Does anyone has a hint on how to solve this? Of course I could try
things such as Cygwin, but I am trying to find an elegant solution for
Windows-Users, which eventually could merge in the next
InstantRails-release, if Curt agrees.
Thanks a lot,
Michael
The windows console -- also used by cygwin -- doesn't recognise UTF-8.
(That is, it's not possible to properly display UTF-8 in cmd.exe, at
least so far as I can tell.)
-austin
···
On 11/7/06, michael.raidel@gmail.com <michael.raidel@gmail.com> wrote:
I have a problem with Unicode in irb on Windows. I recognized it when
trying to save an attribute of an ActiveRecord-Model with an umlaut
(for example "ü") in script/console. If the database connection is
encoded in utf8, everything after the umlaut gets truncated, in the
default encoding I get funny characters back. It doesn't matter if the
$KCODE is set to UTF8 or NONE, the character number stays the same
(also on plain irb)!
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca
A DOS console displays characters according to the OEM code page. Here is
an example showing how to properly display a string with 8bit chars (e.g. characters
with diacritics, or accent marks)...
# file: oemCodePage.rb
require 'chilkat'
# (The CkString class is freeware)
myStr = Chilkat::CkString.new()
# A DOS console does NOT display this correctly:
print "é ô à ç\n"
# What we need is the OEM (DOS) code page...
# OEM code pages are listed here:
···
# http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81rn.asp
myStr.appendAnsi("é ô à ç\n")
# Emit the string in the character encoding of your choice:
# ibm850 is the OEM code page for Latin1
print myStr.getEnc("ibm850")
# Chilkat supports these:
# us-ascii
# unicode
# unicodefffe
# iso-8859-1
# iso-8859-2
# iso-8859-3
# iso-8859-4
# iso-8859-5
# iso-8859-6
# iso-8859-7
# iso-8859-8
# iso-8859-9
# iso-8859-13
# iso-8859-15
# windows-874
# windows-1250
# windows-1251
# windows-1252
# windows-1253
# windows-1254
# windows-1255
# windows-1256
# windows-1257
# windows-1258
# utf-7
# utf-8
# utf-32
# utf-32be
# shift_jis
# gb2312
# ks_c_5601-1987
# big5
# iso-2022-jp
# iso-2022-kr
# euc-jp
# euc-kr
# macintosh
# x-mac-japanese
# x-mac-chinesetrad
# x-mac-korean
# x-mac-arabic
# x-mac-hebrew
# x-mac-greek
# x-mac-cyrillic
# x-mac-chinesesimp
# x-mac-romanian
# x-mac-ukrainian
# x-mac-thai
# x-mac-ce
# x-mac-icelandic
# x-mac-turkish
# x-mac-croatian
# asmo-708
# dos-720
# dos-862
# ibm037
# ibm437
# ibm500
# ibm737
# ibm775
# ibm850
# ibm852
# ibm855
# ibm857
# ibm00858
# ibm860
# ibm861
# ibm863
# ibm864
# ibm865
# cp866
# ibm869
# ibm870
# cp875
# koi8-r
# koi8-u
At 05:07 PM 11/7/2006, you wrote:
On 11/7/06, michael.raidel@gmail.com <michael.raidel@gmail.com> wrote:
I have a problem with Unicode in irb on Windows. I recognized it when
trying to save an attribute of an ActiveRecord-Model with an umlaut
(for example "ü") in script/console. If the database connection is
encoded in utf8, everything after the umlaut gets truncated, in the
default encoding I get funny characters back. It doesn't matter if the
$KCODE is set to UTF8 or NONE, the character number stays the same
(also on plain irb)!
The windows console -- also used by cygwin -- doesn't recognise UTF-8.
(That is, it's not possible to properly display UTF-8 in cmd.exe, at
least so far as I can tell.)
-austin
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006
Ack my bad. I had forgotten: you can specify the UTF-8 codepage (CP_UTF8) with:
chcp 65001
There are some caveats, of course:
http://blogs.msdn.com/michkap/archive/2006/03/06/544251.aspx
-austin
···
On 11/7/06, Austin Ziegler <halostatue@gmail.com> wrote:
On 11/7/06, michael.raidel@gmail.com <michael.raidel@gmail.com> wrote:
> I have a problem with Unicode in irb on Windows. I recognized it when
> trying to save an attribute of an ActiveRecord-Model with an umlaut
> (for example "ü") in script/console. If the database connection is
> encoded in utf8, everything after the umlaut gets truncated, in the
> default encoding I get funny characters back. It doesn't matter if the
> $KCODE is set to UTF8 or NONE, the character number stays the same
> (also on plain irb)!
The windows console -- also used by cygwin -- doesn't recognise UTF-8.
(That is, it's not possible to properly display UTF-8 in cmd.exe, at
least so far as I can tell.)
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca
Austin Ziegler wrote:
Ack my bad. I had forgotten: you can specify the UTF-8 codepage
(CP_UTF8) with:
chcp 65001
There are some caveats, of course:
http://blogs.msdn.com/michkap/archive/2006/03/06/544251.aspx
Also the good old combo of "mode con codepage select=65001".
lists pretty much all the numbers you can use. (The pain of navigating
to that on the MSDN website.)
Amusingly enough, none of those are even present anymore on WinXP Pro
x64. For yet more hilarity, the console is by default set to the DOS OEM
codepage of the given locale, instead of the newer ANSI ones that are
ISO extensions, which causes great fun when trying to use software
that's ever so smart and autodetects my locale as my preferred language
(Postgres, assorted GNU stuff being too clever by half) instead of using
the OS language version.
And "there are some caveats" is an understatement, the UTF-8 support in
the console is a sham - I couldn't get a trivial C program using
arbitrary combinations of tchar.h, wchar.h, -DUNICODE, cmd.exe, the
Windows console, a Cygwin and an MSYS rxvt to do something as daunting
as input random characters that aren't shared between Latin1 and Latin2
codepages, store them as multibyte internally, and then write them out
to a text file and to the console successfully without one step
breaking. The fact whole of CMD broke down in tears from changing that
setting is also worth noting - IIRC, had problems doing output
redirection to a file and whatnot (I can't play around with this without
setting up a virtual machine with a 32bit XP). Basically, the Path Less
Annoying is to only use the console for working in your "native"
codepage, and use a non-console tool for everything else.
end # of rant
David Vallner
Ack my bad. I had forgotten: you can specify the UTF-8 codepage (CP_UTF8) with:
chcp 65001
Thank you Austin for the nice hint!
The problem is, that as soon as I switch the codepage, irb (and also
script/console) stops working (it doesn't even start anymore, it just
quits immediately without an error-message).
Michael
That's one of the caveats mentioned: batch files no longer work.
I don't know why. However, if you have Ruby installed in C:\Ruby, you can do:
copy C:\Ruby\bin\irb C:\Ruby\bin\irb.rb
irb.rb
Or:
ruby C:\Ruby\bin\irb
And you'll get a working irb.
-austin
···
On 11/8/06, michael.raidel@gmail.com <michael.raidel@gmail.com> wrote:
> Ack my bad. I had forgotten: you can specify the UTF-8 codepage (CP_UTF8) with:
>
> chcp 65001
Thank you Austin for the nice hint!
The problem is, that as soon as I switch the codepage, irb (and also
script/console) stops working (it doesn't even start anymore, it just
quits immediately without an error-message).
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca