When I read in a text with accents from a file under cygwin, these get
converted to something like '\352'.
You can then search for these using regexps:
a="un texte extrêmement énervant"
p splitted_text=a.split(/(?=)/)
b=/extr\352mement/
d=a.match(b)
p d[0] => extr\352mement
When I write the result to a file, it appears correctly as "extrêmement".
f=File.new("t.txt",'w')
f.puts d[0]
f.close
Hope that helps,
Best regards,
Axel
Hi Axel,
thanks for the reply. If I try your code, my characters with accents
don't get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of ê . Could you split the following
sentence for me and let me know what the result is:
a="Ils sont très énervé les regexps."
splitted_text=a.split(/\s/)
Not my best French. But if I try this, 'très énervé les' is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like 'très', 'énervé', 'les' please let me know!!
Kind regards,
Nick
···
--
Posted via http://www.ruby-forum.com/.
The odds are your text is in non-UTF8 encoding, but in CP1252 or similar.
Then indeed, if $KCODE = 'u' split won't work right.
···
2006/1/31, Nick Snels <nick.snels@gmail.com>:
Hi Axel,
thanks for the reply. If I try your code, my characters with accents
don't get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of ê . Could you split the following
sentence for me and let me know what the result is:
a="Ils sont très énervé les regexps."
splitted_text=a.split(/\s/)
Not my best French. But if I try this, 'très énervé les' is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like 'très', 'énervé', 'les' please let me know!!
Kind regards,
Nick
--
Posted via http://www.ruby-forum.com/\.
Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven't found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.
Anyway I remove $KCODE altogether in config/environment.rb and now it
works. And Axel I also get the numbers. In config/environment.rb I
added:
$KCODE = 'u'
require 'jcode'
to get Gettext to work. So it turns out that if you aren't fully working
in UTF-8, you have to be carefull adding this.
Thanks for pointing me to $KCODE, twice!
Kind regards,
Nick
···
--
Posted via http://www.ruby-forum.com/.
Nick Snels wrote:
Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven't found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.
I have to deal with similar problems when processing the infamous german
umlaute äöü. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
utf8_string=latin1_string.unpack("C*").pack("U*")
and the other way round with
latin1_string=utf8_string.unpack("U*").pack("C*")
Did work so far and does not include changes in the environment.
HTH,
Lars
Hi Nikolai,
thanks for the suggestion I will definitely give Iconv a try. Hope it
doesn't slow things down a lot.
Kind regards,
Nick
···
--
Posted via http://www.ruby-forum.com/.