UTF-8 support - still stuck

Thomas_Luedeke1 · 5 March 2011 17:53

OK, I appreciate the feedback on my last post regarding pattern matching
accented French characters. But I am still not getting anywhere.

I'm running Ruby 1.9.2p0.

Here's the type of pseudo-code I want to use.

···

====================================

variable = "exagérer"

if variable =~ /érer$/ then
print "the verb was #{variable}"
end

====================================

I've tried using jcode (which is apparently gone), -u extensions, having
the string # coding: UTF-8 at the beginning of the script, etc.

What I really want to do is read in a comprehensive list of verbs (with
various French accented characters), then have a simple I/O where I test
myself on a conjugation. So I need to be able to read, write, and
pattern match the accented characters.

What do I have to do to make this work?

TPL

--
Posted via http://www.ruby-forum.com/.

Marvin_GA_lker · 5 March 2011 19:31

Encode your string as UTF-8 and match it against an UTF-8 regexp.
Simplest way to do this is to do something like this:

···

Am 05.03.2011 18:53, schrieb Thomas Luedeke:

OK, I appreciate the feedback on my last post regarding pattern matching
accented French characters. But I am still not getting anywhere.

I'm running Ruby 1.9.2p0.

Here's the type of pseudo-code I want to use.

====================================

variable = "exagérer"

if variable =~ /érer$/ then
print "the verb was #{variable}"
end

====================================

I've tried using jcode (which is apparently gone), -u extensions, having
the string # coding: UTF-8 at the beginning of the script, etc.

What I really want to do is read in a comprehensive list of verbs (with
various French accented characters), then have a simple I/O where I test
myself on a conjugation. So I need to be able to read, write, and
pattern match the accented characters.

What do I have to do to make this work?

TPL

==============================
#Encoding: UTF-8

variable = "exagérer"

puts "The verb was #{variable}" if variable =~ /érer/

Ensure that your editor saves the file in UTF-8 (some don't do this by
default, notably Window's notepad and SciTE).

If you have the verbs in an external file (which I suppose), and that
file is encoded in UTF-8, you can do (assuming that there is one verb
per line):

=================================
#Encoding: UTF-8

verbs = File.readlines("verbs.txt")

puts "The verb was #{verbs.first}" if verbs.first =~ /érer/

If the file is in another encoding, e.g. Windows-1252, do

==================================
#Encoding: UTF-8

verbs = File.open("verbs.txt", "r:Windows-1252"){|f| f.readlines}

puts "The verb was #{verbs.first}" if verbs.first =~ /érer/

The line saying "#Encoding: UTF-8" is a so-called magic comment that
tells Ruby that it should treat the content of this file as
UTF-8-encoded text. If you leave it out, Ruby assumes your file is
encoded in ASCII-8Bit, which will cause errors as soon as you start to
use characters not defined in ASCII. As an alternative, you may start
Ruby with the -U (capital U) switch, but I didn't try this.

Read up on String#encode and String#force_encoding if you want to
convert between encodings or change the encoding tag of a string without
actual touching of the data in it.

Since Ruby 1.9, Ruby has quite good support for encodings other than ASCII.

Just a thought: Is there anything such as Regexp#encode?

Vale,
Marvin

7stud · 5 March 2011 22:33

You can try and troubleshoot the problems you are having by determining
the encoding of every string in your program.

To determine your source code's encoding, i.e. what the literal strings
you type in your program get encoded as, do this:

puts __ENCODING__

To determine a particular string's encoding, e.g. a string you read from
a file, do this:

puts the_str.encoding.name

···

--
Posted via http://www.ruby-forum.com/.

7stud · 5 March 2011 23:11

By the way, if you read the strings from a file, it might be easier to
change the encoding of the regex to match the encoding of the strings.

···

--
Posted via http://www.ruby-forum.com/.

Thomas_Luedeke1 · 6 March 2011 07:47

The script is:

···

========================================

#! /bin/ruby -vU

#Encoding: UTF-8

verb = "appèler"
if ${verb} =~ /èler/ then print "The verb was #{verb}" end

========================================

The error I get is:

ruby note.rb

note.rb:9: invalid multibyte char (UTF-8)

note.rb:9: syntax error, unexpected tIDENTIFIER, expecting $end

verb = "appΦler"

--
Posted via http://www.ruby-forum.com/.

Alexey_Petrushin · 7 March 2011 03:22

Add this line to your ~/.profile.

export RUBYOPT="-Ku -rrubygems"

Sadly, there's no other way to set global default source encoding in
ruby 1.9

···

--
Posted via http://www.ruby-forum.com/.

Marvin_GA_lker · 5 March 2011 19:52

What I forgot to mention: Some editors put an invisible BOM (Byte Order
Mark) at the beginning of UTF-8 files. That one can cause problems
because the first line is not read properly in that case. So ensure your
editor doesn't write the BOM.

Vale,
Marvin

···

Am 05.03.2011 20:31, schrieb Quintus:

=================================
#Encoding: UTF-8

Marvin_GA_lker · 6 March 2011 08:05

Don't leave a blank line between the shebang line and the magic comment.
The magic comment must either be the very first line, or the second one
if you have a shebang.

Vale,
Marvin

···

Am 06.03.2011 08:47, schrieb Thomas Luedeke:

The script is:

========================================

#! /bin/ruby -vU

#Encoding: UTF-8

verb = "appèler"
if ${verb} =~ /èler/ then print "The verb was #{verb}" end

========================================

7stud · 8 March 2011 22:59

Thomas Luedeke wrote in post #985708:

This seemed to have work in NotePad ++, set to UTF-8 and with the BOM
off:

======================

#! /bin/ruby -Kn

#Encoding: UTF-8

verb = "appèler"
if( verb =~ /èler/) then print "The verb was #{verb}" end

======================

I think it was the -Kn flag, although I don't understand what that
changes. I'll look into it. Thanks for all your help!

In ruby, there is a variable called $KCODE. If you set it to "UTF-8"
(or just "U"), then it makes regular expressions match characters rather
than single bytes. If you set $KCODE to "N" (the default), then
regular expressions will match single bytes (unless you use the /u flag
on your regular expression).

You can set $KCODE from the command line, e.g. -Ku or -Kn.

···

--
Posted via http://www.ruby-forum.com/\.

Brabuhr · 8 March 2011 23:52

Are you absolutely certain that your file is UTF-8 encoded?

$ cat i.rb
#Encoding: UTF-8
verb = "appèler"
puts "The verb was #{verb}" if verb =~ /èler/

$ ruby -v i.rb
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
The verb was appèler

$ enca -L none i.rb
Universal transformation format 8 bits; UTF-8

$ iconv -t LATIN1 -f UTF8 < i.rb > l.rb

$ enca -L none l.rb
Unrecognized encoding

$ ruby -v l.rb
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
l.rb:2: invalid multibyte char (UTF-8)
l.rb:2: syntax error, unexpected tIDENTIFIER, expecting $end
verb = "app?ler"
^

···

On Sun, Mar 6, 2011 at 2:47 AM, Thomas Luedeke <tluedeke@excite.com> wrote:

The script is:

========================================
#! /bin/ruby -vU
#Encoding: UTF-8
verb = "appèler"
if ${verb} =~ /èler/ then print "The verb was #{verb}" end

note.rb:9: invalid multibyte char (UTF-8)

note.rb:9: syntax error, unexpected tIDENTIFIER, expecting $end

verb = "appΦler"

Topic		Replies	Views
Pattern matching French accented characters ruby-talk	4	285	2 March 2011
Problem matching accented chars on OS X ruby-talk	0	115	11 June 2005
Problem matching accented chars on OS X ruby-talk	0	105	11 June 2005
Non-english characters ruby-talk	10	101	18 December 2003
Problem matching accented chars on OS X ruby-talk	0	105	11 June 2005

UTF-8 support - still stuck

puts "The verb was #{variable}" if variable =~ /érer/

puts "The verb was #{verbs.first}" if verbs.first =~ /érer/

puts "The verb was #{verbs.first}" if verbs.first =~ /érer/

======================================== #! /bin/ruby -vU #Encoding: UTF-8 verb = "appèler" if ${verb} =~ /èler/ then print "The verb was #{verb}" end

Related topics

========================================
#! /bin/ruby -vU
#Encoding: UTF-8
verb = "appèler"
if ${verb} =~ /èler/ then print "The verb was #{verb}" end