Premature end of regular expression with non-ascii chara

Nuralanur · 30 January 2006 23:24

When I read in a text with accents from a file under cygwin, these get
converted to something like '\352'.
You can then search for these using regexps:

a="un texte extrêmement énervant"
p splitted_text=a.split(/(?=)/)
b=/extr\352mement/
d=a.match(b)
p d[0] => extr\352mement

When I write the result to a file, it appears correctly as "extrêmement".

f=File.new("t.txt",'w')
f.puts d[0]
f.close

Hope that helps,

Best regards,

Axel

Nick_Snels · 31 January 2006 11:08

Hi Axel,

thanks for the reply. If I try your code, my characters with accents
don't get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of ê . Could you split the following
sentence for me and let me know what the result is:

a="Ils sont très énervé les regexps."
splitted_text=a.split(/\s/)

Not my best French. But if I try this, 'très énervé les' is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like 'très', 'énervé', 'les' please let me know!!

Kind regards,

Nick

···

--
Posted via http://www.ruby-forum.com/.

Lugovoi_Nikolai · 31 January 2006 11:32

The odds are your text is in non-UTF8 encoding, but in CP1252 or similar.
Then indeed, if $KCODE = 'u' split won't work right.

···

2006/1/31, Nick Snels <nick.snels@gmail.com>:

Hi Axel,

thanks for the reply. If I try your code, my characters with accents
don't get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of ê . Could you split the following
sentence for me and let me know what the result is:

a="Ils sont très énervé les regexps."
splitted_text=a.split(/\s/)

Not my best French. But if I try this, 'très énervé les' is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like 'très', 'énervé', 'les' please let me know!!

Kind regards,

Nick

--
Posted via http://www.ruby-forum.com/\.

Nick_Snels · 31 January 2006 12:11

Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven't found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

Anyway I remove $KCODE altogether in config/environment.rb and now it
works. And Axel I also get the numbers. In config/environment.rb I
added:

$KCODE = 'u'
require 'jcode'

to get Gettext to work. So it turns out that if you aren't fully working
in UTF-8, you have to be carefull adding this.

Thanks for pointing me to $KCODE, twice!

Kind regards,

Nick

···

--
Posted via http://www.ruby-forum.com/.

Lugovoi_Nikolai · 31 January 2006 12:22

use Iconv library

···

2006/1/31, Nick Snels <nick.snels@gmail.com>:

Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven't found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

Lars_Broecker1 · 1 February 2006 21:46

Nick Snels wrote:

Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven't found
a way of converting these strings to Unicode in Ruby. é and è etc. form
part of ISO-8859-1.

I have to deal with similar problems when processing the infamous german
umlaute äöü. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
utf8_string=latin1_string.unpack("C*").pack("U*")

and the other way round with
latin1_string=utf8_string.unpack("U*").pack("C*")

Did work so far and does not include changes in the environment.
HTH,
Lars

Nick_Snels · 31 January 2006 13:18

Hi Nikolai,

thanks for the suggestion I will definitely give Iconv a try. Hope it
doesn't slow things down a lot.

Kind regards,

Nick

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Premature end of regular expression with non-ascii chara ruby-talk	1	115	31 January 2006
Premature end of regular expression with non-ascii character ruby-talk	5	110	1 February 2006
Premature end of regular expression with non-ascii character ruby-talk	0	141	31 January 2006
Encoding question ruby-talk	16	123	23 September 2012
UTF-8 support - still stuck ruby-talk	9	172	8 March 2011

Premature end of regular expression with non-ascii chara

Related topics