Benoit Daloze wrote:
> But then I come with things like:
> /Users/benoitdaloze/Library/GlestGame/data/lang/espan>̃<ol.lng
>
> (The ~ is separated from the n and then is not ñ). The Regexp is acting
> like
> it is 2 different characters. How to handle that easily? I tried to
> change
> the script encoding in MacRoman, but it produced an error of bad
> encoding
> not matching UTF-8.
I don't know what you mean. If Dir. tells you that the file name is
<e> <s> <p> <a> <n> <~> <o> <l> <.> <l> <n> <g>, is that not the true
filename?
I suggest you try something like this:
puts "Source encoding: #{"".encoding}"
puts "External encoding: #{Encoding.default_external}"
Dir["*.lng"] do |fn|
puts "Name: #{fn.inspect}"
puts "Encoding: #{fn.encoding}"
puts "Chars: #{fn.chars.to_a.inspect}"
puts "Codepoints: #{fn.codepoints.to_a.inspect}"
puts "Bytes: #{fn.bytes.to_a.inspect}"
puts
end
then post the results for this file here. Then also post what you think
the true filename is.
The true filename is (from the Finder and Terminal):
-rw-r--r--@ 1 benoitdaloze staff 3758 Jul 17 2008 español.lng
So, with the 'ñ'.
I don't know which is the encoding of the filename on HFS+, from Wikipedia
it s said as UTF-16, with Decomposition:
"names which are also character encoded in
UTF-16<http://en.wikipedia.org/wiki/UTF-16>and normalized to a form
very nearly the same as Unicode
Normalization Form D (NFD)<http://en.wikipedia.org/wiki/Unicode_normalization>
[4] <HFS Plus - Wikipedia; (which means that
precomposed characters like é are decomposed in the HFS+ filename and
therefore count as two
characters[5]<HFS Plus - Wikipedia;
So, that's probably a problem of encoding for Dir.
I changed a little the script, to compare with a String hard-coded inside
the script (rn = "español.lng")
ruby 1.9.2dev (2009-12-11 trunk 26067) [x86_64-darwin10.2.0]
Source encoding: UTF-8
External encoding: UTF-8
Format:
String in the code
filename from Dir
String equality: false
Name:
"español.lng"
"español.lng"
Encoding:
UTF-8
UTF-8
Chars:
["e", "s", "p", "a", "ñ", "o", "l", ".", "l", "n", "g"]
["e", "s", "p", "a", "n", "̃", "o", "l", ".", "l", "n", "g"]
Codepoints:
[101, 115, 112, 97, 241, 111, 108, 46, 108, 110, 103]
[101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]
Bytes:
[101, 115, 112, 97, 195, 177, 111, 108, 46, 108, 110, 103]
[101, 115, 112, 97, 110, 204, 131, 111, 108, 46, 108, 110, 103]
Then you can see whether: (1) Dir. is returning the correct sequence
of bytes for the filename or not; and (2) Dir. is tagging the string
with the correct encoding or not.
(1) Dir seems to return a correct String in UTF-8, while being different
(!!) from a String inside in UTF-8
But looking at the codepoints and bytes, it's very different ...
(2) That's probably the case, let's look by forcing the encoding to
MacRoman:
Or not ... making crazy results like: "espan\xCC\x83ol.lng" or
"espan\u0303ol.lng"
Well, this is out of my poor knowledge of encoding I'm afraid 
The most frustrating is it's printing the same...
P.S.: Well I got also filenames with "\r", quite weared,no? ("Target
Application Alias\r", and it "\r" is shown as "?" in the Terminal)
(This is one of the thousands of cases I did *not* document in
string19.rb; I did some of the core methods on String, but of course
every method in every class which either returns a string or accepts a
string argument needs to document how it handles encodings)
> as output of this script (which is then not able to rename any wrong
> file,
> because tr! seem to not work either on name) :
>
> path = ARGV[0] || "/"
>
> ALLOWED_CHARS = "A-Za-z0-9 %#:$@?!=+~&|'()\\[\\]{}.,\r_-"
>
> Dir["#{File.expand_path(path)}/**/*"].each { |f|
> name = File.basename(f)
> unless name =~ /^[#{ALLOWED_CHARS}]+$/
> puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/,
> ">\\1<")
>
> if name.tr!('éèê', 'e') =~ /^[#{ALLOWED_CHARS}]+$/ # Here it is
> not
> complete, it is just a test, but it doesn't work even for 'filéname'
> File.rename(f, File.dirname(f) + '/' + name)
> puts "\trenamed in #{name}"
> break
> end
> end
> }
What error do you get? Is it failing to match the é at all (tr! returns
nil), or is an encoding error raised in tr!, or is an error raised by
File.rename ?
--
Posted via http://www.ruby-forum.com/\.
Yes, tr! returns nil on name.tr!('ñ', 'n'), but it would work on a String
inside the script (eg: "eño".tr!('ñ', 'n'))
···
2009/12/28 Brian Candler <b.candler@pobox.com>