Strange Encoding Behavior

Lui_Core · 17 January 2010 03:55

The encoding of __FILE__ is always the same as Encoding.default_external
even if there is a magic column. Sometimes it is necessary to convert
the string into another encoding. Here is some code to demonstrate the
issue:

#coding: utf-8
# put the script in a not-pure-ascii path to see the difference
path = File.expand_path File.dirname __FILE__

    puts RUBY_VERSION + ' ' + RUBY_PLATFORM
      #=> "1.9.1 i386-mingw32" is my ruby version
    puts path.encoding
      #=> "GB2312" on my OS

    # usually this "string.encode to, from" works,
    # but HERE the new string's content bytes seems unchanged
    puts \
      path.encode 'utf-8', Encoding.default_external

    path.force_encoding Encoding.default_external
    puts path.encode 'utf-8'
      # changed at last

···

--
Posted via http://www.ruby-forum.com/.

Luis_Lavena1 · 17 January 2010 04:25

At 1.9.1, and some part of 1.9.2 still display certain issues with
path/folders with non-ascii characters:

http://redmine.ruby-lang.org/issues/show/1685

···

On Jan 17, 12:55 am, Lui Kore <usur...@gmail.com> wrote:

The encoding of __FILE__ is always the same as Encoding.default_external
even if there is a magic column. Sometimes it is necessary to convert
the string into another encoding. Here is some code to demonstrate the
issue:

#coding: utf-8
# put the script in a not-pure-ascii path to see the difference
path = File.expand_path File.dirname __FILE__
puts RUBY\_VERSION \+ &#39; &#39; \+ RUBY\_PLATFORM
  \#=&gt; &quot;1\.9\.1 i386\-mingw32&quot; is my ruby version
puts path\.encoding
  \#=&gt; &quot;GB2312&quot; on my OS

\# usually this &quot;string\.encode to, from&quot; works,
\# but HERE the new string&#39;s content bytes seems unchanged
puts \\
  path\.encode &#39;utf\-8&#39;, Encoding\.default\_external

path\.force\_encoding Encoding\.default\_external
puts path\.encode &#39;utf\-8&#39;
  \# changed at last

--
Luis Lavena

Robert_K1 · 17 January 2010 19:30

I believe the point you are missing is that String#encode does not change the String but it returns a new String with the desired encoding. If you want inplace modification you need to use String#encode! which does just that.

irb(main):006:0> s="foo"
=> "foo"
irb(main):007:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):008:0> x = s.encode "ASCII"
=> "foo"
irb(main):009:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):010:0> x.encoding
=> #<Encoding:US-ASCII>
irb(main):011:0>

Kind regards

robert

···

On 01/17/2010 04:55 AM, Lui Kore wrote:

The encoding of __FILE__ is always the same as Encoding.default_external
even if there is a magic column. Sometimes it is necessary to convert
the string into another encoding. Here is some code to demonstrate the
issue:

#coding: utf-8
    # put the script in a not-pure-ascii path to see the difference
    path = File.expand_path File.dirname __FILE__

    puts RUBY_VERSION + ' ' + RUBY_PLATFORM
      #=> "1.9.1 i386-mingw32" is my ruby version
    puts path.encoding
      #=> "GB2312" on my OS

    # usually this "string.encode to, from" works,
    # but HERE the new string's content bytes seems unchanged
    puts \
      path.encode 'utf-8', Encoding.default_external

    path.force_encoding Encoding.default_external
    puts path.encode 'utf-8'
      # changed at last

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Lui_Core · 17 January 2010 06:13

I think #1685 is a little bit different.
Maybe the following code is a bit clearer:

# coding: ascii-8bit
puts Encoding.default_external #=> GBK

    def enc s
      s.encode 'utf-8', Encoding.default_external
    end
    p1 = File.expand_path File.dirname __FILE__
    p2 = p1.dup
    p2.force_encoding p2.encoding # strange, but makes it different

    puts p1 == p2 #=> true
    puts p1.encoding == p2.encoding #=> true
    puts enc(p1) == enc(p2) #=> sometimes false ???

Run in console:

    D:\其他>ruby t.rb
    GBK
    true
    true
    false

put it in another folder:

    D:\other>ruby t.rb
    GBK
    true
    true
    true

Luis Lavena wrote:

···

On Jan 17, 12:55�am, Lui Kore <usur...@gmail.com> wrote:

� � � #=> "1.9.1 i386-mingw32" is my ruby version
� � � # changed at last

At 1.9.1, and some part of 1.9.2 still display certain issues with
path/folders with non-ascii characters:

http://redmine.ruby-lang.org/issues/show/1685

--
Posted via http://www.ruby-forum.com/\.

Lui_Core · 18 January 2010 04:23

I know String#encode doesn't change the original string, but the result
is encoded.

To understand the problem, you should try in a gbk/shift-jis environment
with some Chinese or Japanese path.

The point is:
For some path p1 and p2,
when p1 == p2 and p1.encoding == p2.encoding,
p1.encode('utf-8') == p2.encode('utf-8') is not always true.

To describe it in a "encode!" version:
For some path p1 and p2,
when p1 == p2 and p1.encoding == p2.encoding,
p1.encode!('utf-8')
p2.encode!('utf-8')
p1 == p2 is still not always true

Robert Klemme wrote:

···

On 01/17/2010 04:55 AM, Lui Kore wrote:

      #=> "1.9.1 i386-mingw32" is my ruby version
      # changed at last

I believe the point you are missing is that String#encode does not
change the String but it returns a new String with the desired encoding.
  If you want inplace modification you need to use String#encode! which
does just that.

irb(main):006:0> s="foo"
=> "foo"
irb(main):007:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):008:0> x = s.encode "ASCII"
=> "foo"
irb(main):009:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):010:0> x.encoding
=> #<Encoding:US-ASCII>
irb(main):011:0>

Kind regards

  robert

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 18 January 2010 08:35

Apparently I misread your posting, sorry. Is UTF-8 capable of
representing those Japanese or Chinese characters? I believe I
remember Matz saying that UTF-8 is insufficient to properly represent
Japanese characters. If this is the case then I guess all bets are
off and you get undefined behavior. Although it might be desirable to
get the same garbage it may not be worthwhile to ensure this purely
for efficiency reasons.

Kind regards

robert

···

2010/1/18 Lui Kore <usurffx@gmail.com>:

I know String#encode doesn't change the original string, but the result
is encoded.

To understand the problem, you should try in a gbk/shift-jis environment
with some Chinese or Japanese path.

The point is:
For some path p1 and p2,
when p1 == p2 and p1.encoding == p2.encoding,
p1.encode('utf-8') == p2.encode('utf-8') is not always true.

To describe it in a "encode!" version:
For some path p1 and p2,
when p1 == p2 and p1.encoding == p2.encoding,
p1.encode!('utf-8')
p2.encode!('utf-8')
p1 == p2 is still not always true

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Default encoding UTF 8? ruby-talk	2	107	25 May 2010
A question about Ruby 1.9's "external encoding" ruby-talk	10	166	23 March 2011
Ruby 1.9 - US-ASCII vs UTF-8 ruby-talk	2	150	19 December 2009
String encoding issues ruby-talk	2	99	3 August 2010
Encoding issue for special characters on Windows ruby-talk	3	128	13 January 2009

Strange Encoding Behavior

Related topics