if i read and output to terminal an UTF-8 encoded file, i do not have
the same result with ruby 1.8.x and ruby 1.9
with 1.8 i get "�" correctly, with 1.9 i get it wrong "é" even if i
specify the encoding by :
open(__FILE__, "r:UTF-8") do ...
In my web browser onto ruby-forum, I see what you say is the "correct"
symbol as invalid above, and the "wrong" symbol is a valid one.
Are you in irb, or running code in a .rb file? Are you using "puts" or
are you looking at the string values as returned by irb, after the =>
prompt?
In either case, show your actual code. Beware that things behave
strangely in irb with 1.9. Some of the oddities I noticed in irb are
documented in
from about line 1648.
what did i missunderstood ?
Remember that encodings by themselves don't actually change the sequence
of bytes. If your code is something like this:
open("somefile.txt") do |f|
while line = f.gets
puts line
end
end
and you run it as a .rb script, I would expect it to work the same in
both 1.8 and 1.9. That is, it should read lines and squirt them back out
to stdout unchanged. No transcoding is done. If they appear wrongly, it
would be because the encoding of the file contents is not the same as
the encoding of your terminal.
Furthermore, it makes no difference in 1.9 if you do this:
open("somefile.txt","r:UTF-8") do |f|
while line = f.gets
puts line
end
end
In ruby 1.9, all this means is that the string 'line' will be tagged as
being UTF-8, rather than some encoding picked up from the environment.
However by default, the same sequence of bytes will be squirted out.
However in 1.9 you *can* cause the string to be transcoded, if:
(1) you specify a different internal and external encoding when reading
the data (so it gets transcoded on input); or
(2) you specify an external encoding when writing the data (so it gets
transcoded on output)
open(SIGNATURES_FILE, "r:UTF-8") do |file|
p file.internal_encoding
file.each do |line|
p [line.encoding.name, line]
end
end
open(SIGNATURES_FILE) do |file|
p file.internal_encoding
file.each do |line|
p [line.encoding.name, line]
end
end
------------------------------------------------------------------------
open(SIGNATURES_FILE) do |file|
file.each do |line|
puts line
end
end
------------------------------------------------------------------------
run from Term :
zsh-% ./essai.rb
--
« Un banquier est toujours en liberté provisoire »
(Henri Poincaré )
...
« Pour ceux qui vont chercher midi à quatorze heures,
la minute de vérité risque de se faire longtemps attendre. »
(Pierre Dac)
accentuated chars are correct now, notice i have to use "puts" instead
of "p" to get the chars otherwise i got the unicode code as
"v\303\251rit\303\251".
--
« L'essence même du génie, c'est de mettre en pratique
les idées les plus simples. »
(Charles Peguy)
Use 'puts' instead of 'p' and it may work. That is, I suspect
String#inspect is doing some mangling.
You really should look at your postings in ruby-forum:
Wherever you say ruby 1.9 is giving the 'wrong' output it is correct,
and where you say ruby 1.8 is giving the 'right' output it is wrong. I
have a suspicion that there is a mismatch between the file content and
the terminal.
What if you just type "cat /Users/yt/dev/Signature/signatures.txt" at
the terminal?
accentuated chars are correct now, notice i have to use "puts" instead
of "p" to get the chars otherwise i got the unicode code as
"v\303\251rit\303\251".
Yes, String#inspect in ruby 1.8 will mangle all values over 128 into
escaped form. String#inspect in ruby 1.9 behaves differently, and
doesn't always mangle them.
However, I just noticed 'macruby' in your scripts. Are you actually
running MacRuby, or genuine Matz Ruby Interpreter 1.9 ? If it's macruby
all bets are off - I thought it was a completely different interpreter
written from scratch. I have no Mac here to compare behaviour with, and
I have no idea what variation of 1.9 encoding rules MacRuby has
implemented.
In particular, I'm surprised that your program sees strings tagged as
"US-ASCII" rather than "UTF-8" when you explicitly opened the file with
external encoding of UTF-8. This makes me very suspicious of your actual
ruby platform.
Try adding this line to your code to get info about the Ruby platform:
p Object.constants.grep(/RUBY/).map { |n| [n, Object.const_get(n)] }
Regards,
Brian.
P.S. For comparison, here's what I get with an oldish ruby pre-1.9.2
under Linux. Try these on your system.
Wherever you say ruby 1.9 is giving the 'wrong' output it is correct,
and where you say ruby 1.8 is giving the 'right' output it is wrong. I
have a suspicion that there is a mismatch between the file content and
the terminal.
What if you just type "cat /Users/yt/dev/Signature/signatures.txt" at
the terminal?
--
« Un banquier est toujours en liberté provisoire »
(Henri Poincaré )
...
--
« Pour ceux qui vont chercher midi à quatorze heures,
la minute de vérité risque de se faire longtemps attendre. »
(Pierre Dac)
> accentuated chars are correct now, notice i have to use "puts" instead
> of "p" to get the chars otherwise i got the unicode code as
> "v\303\251rit\303\251".
Yes, String#inspect in ruby 1.8 will mangle all values over 128 into
escaped form. String#inspect in ruby 1.9 behaves differently, and
doesn't always mangle them.
However, I just noticed 'macruby' in your scripts. Are you actually
running MacRuby, or genuine Matz Ruby Interpreter 1.9 ? If it's macruby
all bets are off - I thought it was a completely different interpreter
written from scratch. I have no Mac here to compare behaviour with, and
I have no idea what variation of 1.9 encoding rules MacRuby has
implemented.
In particular, I'm surprised that your program sees strings tagged as
"US-ASCII" rather than "UTF-8" when you explicitly opened the file with
external encoding of UTF-8. This makes me very suspicious of your actual
ruby platform.
right now, that's to say using puts in place of p, i get the right
chars.
But those strings are still taged by "US-ASCII"...
Try adding this line to your code to get info about the Ruby platform:
p Object.constants.grep(/RUBY/).map { |n| [n, Object.const_get(n)] }
def get_signatures
t = "".force_encoding("UTF-8")
open(SIGNATURES_FILE, "r:UTF-8") do |file| #open(SIGNATURES_FILE) do |file|
file.each do |line|
t += line.force_encoding("UTF-8")
end
end #File.open(SIGNATURES_FILE, "r:UTF-8").each {|l| t += l }
return t.split(NEEDLE)
end
(notice i've forced the encoding)
signatures = get_signatures
c = signatures.count
puts "Nombre de signatures : #{c}"
Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.
yes right, I'll ask to the MacRuby list.
In fact, i do have also a buitin ruby 1.8.x but i'd rather make use of
1.9 because i do have to count UTF-8 chars and i know this is internal
with ruby 1.9 and because i might design an UI it's better using MacRuby
because it is written on top of Obj-C and Cocoa.
···
Brian Candler <b.candler@pobox.com> wrote:
--
« L'essence même du génie, c'est de mettre en pratique
les idées les plus simples. »
(Charles Peguy)
Primitives Classes
The primitive Ruby classes (String, Array, and Hash) have been
re-implemented on top of their Cocoa equivalents (respectively,
NSString, NSArray, and NSDictionary).
As an example, String is no longer a class, but a pointer (alias) to
NSMutableString. All strings in MacRuby are genuine Cocoa strings and
can be passed (without conversion) to underlying C or Objective-C APIs
that expect Cocoa strings.
The whole String interface was re-implemented on top of NSString. This
means that you can call any method of String on any Cocoa string.
Because Cocoa strings can be either mutable and immutable, if you try to
call a method that is supposed to modify its receiver on an immutable
string, a runtime exception will be raised.
···
Brian Candler <b.candler@pobox.com> wrote:
Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.
--
« L'essence même du génie, c'est de mettre en pratique
les idées les plus simples. »
(Charles Peguy)
yes, i get the answer from MacRuby list :
1.9 encodings in trunk have very little support for now, but we
significantly improved them in a branch that might get merged into trunk
in a few days (maybe today). I will post an update here once it's done.
···
Brian Candler <b.candler@pobox.com> wrote:
Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.
--
« L'essence même du génie, c'est de mettre en pratique
les idées les plus simples. »
(Charles Peguy)