Non-english characters

Daniel10 · 17 December 2003 05:01

how do I match non-english alphabetical characters? Such as the german
double-s ? (ß)

db

···

–
A.D. 1844: Samuel Morse invents Morse code. Cryptography export
restrictions prevent the telegraph’s use outside the U.S. and Canada.

Yukihiro_Matsumoto2 · 17 December 2003 05:31

Hi,

···

In message “non-english characters” on 03/12/17, Daniel Bretoi lists@debonair.net writes:

how do I match non-english alphabetical characters? Such as the german
double-s ? (ß)

Which encoding do you wish to use?

						matz.

Daniel10 · 17 December 2003 07:00

I’m not sure, how can I find out what the germans use? and once I know
that part, how do I use it?

db

···

On Wed, Dec 17, 2003 at 02:31:32PM +0900, Yukihiro Matsumoto wrote:

Hi,

In message “non-english characters” > on 03/12/17, Daniel Bretoi lists@debonair.net writes:

how do I match non-english alphabetical characters? Such as the german
double-s ? (?)

Which encoding do you wish to use?

Yukihiro_Matsumoto2 · 17 December 2003 07:05

Hi,

···

In message “Re: non-english characters” on 03/12/17, Daniel Bretoi lists@debonair.net writes:

I’m not sure, how can I find out what the germans use? and once I know
that part, how do I use it?

Ask somebody around you to find out. Then if you’re going to use
Unicode (UTF-8), write your script in UTF-8 and invoke Ruby with -Ku
option. If you use ISO-8859-* or any other single byte encoding, you
don’t have to do anything special.

						matz.

Josef_Jupp_SCHUGT · 18 December 2003 00:30

Hi!

Daniel Bretoi; 2003-12-17, 19:03 UTC:

Hi,

how do I match non-english alphabetical characters? Such as the german
double-s ? (ß)

Which encoding do you wish to use?

I’m not sure, how can I find out what the germans use? and once I
know that part, how do I use it?

For German you can use an awful lot of different encodings Take a
look at the charsets listed at http://dwd.da.ru/charsets/index.html

Most likely ISO 8859-1, ISO 8859-15, or UTF-8 are used but ISO 8859-2
is also in use. The ISO charsets have Umlauts and ß in identical
positions. So the question reduces to UTF-8 vs. ISO-8859 (Windows
Codepages those one would consider are ISO 8859 charsets with
additional characters in the 128…159 region that is unused by the
ISO 8859 charsets.

Josef ‘Jupp’ SCHUGT

···

On Wed, Dec 17, 2003 at 02:31:32PM +0900, Yukihiro Matsumoto wrote:

In message “non-english characters” > > on 03/12/17, Daniel Bretoi lists@debonair.net writes:
–
http://oss.erdfunkstelle.de/ruby/ - German comp.lang.ruby-FAQ
http://rubyforge.org/users/jupp/ - Ruby projects at Rubyforge
...................................
Windows are best when they are “unseen” – Chet Noll 27 Oct 2000

M_Mohr · 17 December 2003 08:27

hmm.

regexp works fine for me with unicode. either with “ruby -Ku” on
startup or with the /u as regexp-option.

but with ISO-8859-+ (1 or 15 in my case) i don’t get \w to match
accented characters.

no big deal, i’m just curious what i’m doing wrong here. i’m using
ruby-1.8.1 from debian testing.

···

On Wed, Dec 17, 2003 at 04:05:32PM +0900, Yukihiro Matsumoto wrote:

Hi,

In message “Re: non-english characters” > on 03/12/17, Daniel Bretoi lists@debonair.net writes:

I’m not sure, how can I find out what the germans use? and once I know
that part, how do I use it?

Ask somebody around you to find out. Then if you’re going to use
Unicode (UTF-8), write your script in UTF-8 and invoke Ruby with -Ku
option. If you use ISO-8859-* or any other single byte encoding, you
don’t have to do anything special.
  					matz.

Yukihiro_Matsumoto2 · 17 December 2003 08:51

Hi,

···

In message “Re: non-english characters” on 03/12/17, messju mohr messju@lammfellpuschen.de writes:

but with ISO-8859-+ (1 or 15 in my case) i don’t get \w to match
accented characters.

That’s restriction, character class is defined as [a-zA-Z_].
This restriction will be removed in the Ruby 1.9 by using ISO-8859-*
specific encoding.

						matz.

Robert · 17 December 2003 09:12

“messju mohr” messju@lammfellpuschen.de schrieb im Newsbeitrag
news:20031217082712.GE17320@pharao.lammfellpuschen.de…

···

On Wed, Dec 17, 2003 at 04:05:32PM +0900, Yukihiro Matsumoto wrote:

Hi,

In message “Re: non-english characters” > > on 03/12/17, Daniel Bretoi lists@debonair.net writes:

I’m not sure, how can I find out what the germans use? and once I know
that part, how do I use it?

Ask somebody around you to find out. Then if you’re going to use
Unicode (UTF-8), write your script in UTF-8 and invoke Ruby with -Ku
option. If you use ISO-8859-* or any other single byte encoding, you
don’t have to do anything special.

matz.

hmm.

regexp works fine for me with unicode. either with “ruby -Ku” on
startup or with the /u as regexp-option.

but with ISO-8859-+ (1 or 15 in my case) i don’t get \w to match
accented characters.

I guess \w is defined in terms of ASCII - and there you don’t have “ß”, “é”
and similar chars.

Regards

robert

M_Mohr · 17 December 2003 10:40

“messju mohr” messju@lammfellpuschen.de schrieb im Newsbeitrag
news:20031217082712.GE17320@pharao.lammfellpuschen.de…

Hi,

I’m not sure, how can I find out what the germans use? and once I know
that part, how do I use it?

Ask somebody around you to find out. Then if you’re going to use
Unicode (UTF-8), write your script in UTF-8 and invoke Ruby with -Ku
option. If you use ISO-8859-* or any other single byte encoding, you
don’t have to do anything special.

matz.

hmm.

regexp works fine for me with unicode. either with “ruby -Ku” on
startup or with the /u as regexp-option.

but with ISO-8859-+ (1 or 15 in my case) i don’t get \w to match
accented characters.

I guess \w is defined in terms of ASCII - and there you don’t have “ß”, “é”
and similar chars.

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error.

···

On Wed, Dec 17, 2003 at 06:12:00PM +0900, Robert Klemme wrote:

On Wed, Dec 17, 2003 at 04:05:32PM +0900, Yukihiro Matsumoto wrote:

In message “Re: non-english characters” > > > on 03/12/17, Daniel Bretoi lists@debonair.net writes:

Regards
robert

Ara.T.Howard2 · 17 December 2003 17:36

depends on your definition of ‘treats’ and ‘locale’

-bash-2.05b$ cat /etc/redhat-release
Red Hat Enterprise Linux WS release 3 (Taroon)

-bash-2.05b$ perl -v | head -2 # why so much output!

This is perl, v5.8.0 built for i386-linux-thread-multi

-bash-2.05b$ ruby -v
ruby 1.6.8 (2002-12-24) [i386-linux-gnu]

BROKEN “TREATMENT” OF LOCALE

-bash-2.05b$ export LANG=en_US.UTF-8
-bash-2.05b$ echo abc | perl -ne ‘print if /[^\s]+/’
-bash-2.05b$ echo abc | ruby -ne ‘print if /[^\s]+/’
abc

THIS IS OK

-bash-2.05b$ export LANG=en_US
-bash-2.05b$ echo abc | perl -ne ‘print if /[^\s]+/’
abc
-bash-2.05b$ echo abc | ruby -ne ‘print if /[^\s]+/’
abc

definitely need to examine output carefully where regexes and locale are in
effect - probably better off using ruby since matz presumably has more
experience with multibyte chars than 'ol larry!

-a

···

On Wed, 17 Dec 2003, messju mohr wrote:

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error.

–

ATTN: please update your address books with address below!

===============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
STP :: http://www.ngdc.noaa.gov/stp/
NGDC :: http://www.ngdc.noaa.gov/
NESDIS :: http://www.nesdis.noaa.gov/
NOAA :: http://www.noaa.gov/
US DOC :: http://www.commerce.gov/

The difference between art and science is that science is what we
understand well enough to explain to a computer.
Art is everything else.
– Donald Knuth, “Discover”

/bin/sh -c ‘for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done’
===============================================================================

M_Mohr · 17 December 2003 18:01

yes, it looks like i got confused by the PCRE library which treats \w
according to the current locale. too-many-languages error.

depends on your definition of ‘treats’ and ‘locale’

-bash-2.05b$ cat /etc/redhat-release
Red Hat Enterprise Linux WS release 3 (Taroon)

-bash-2.05b$ perl -v | head -2 # why so much output!

This is perl, v5.8.0 built for i386-linux-thread-multi

-bash-2.05b$ ruby -v
ruby 1.6.8 (2002-12-24) [i386-linux-gnu]

BROKEN “TREATMENT” OF LOCALE

-bash-2.05b$ export LANG=en_US.UTF-8
-bash-2.05b$ echo abc | perl -ne ‘print if /[^\s]+/’
-bash-2.05b$ echo abc | ruby -ne ‘print if /[^\s]+/’
abc

THIS IS OK

-bash-2.05b$ export LANG=en_US
-bash-2.05b$ echo abc | perl -ne ‘print if /[^\s]+/’
abc
-bash-2.05b$ echo abc | ruby -ne ‘print if /[^\s]+/’
abc

definitely need to examine output carefully where regexes and locale are in
effect - probably better off using ruby since matz presumably has more
experience with multibyte chars than 'ol larry!

i was talking about ISO-8859-* charactersets and already said, that
UTF-8 works for me.
your example works fine for me with
“This is perl, v5.8.2 built for i386-linux-thread-multi” (from
debian unstable)
i meant the PCRE library from
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ . it’s meant
to be perl compatible but it is not the actual implementation in
the perl-interpreter, AFAIK

no need to convince me to use ruby over perl

greetings
messju

···

On Thu, Dec 18, 2003 at 02:36:58AM +0900, Ara.T.Howard wrote:

On Wed, 17 Dec 2003, messju mohr wrote:

-a

Topic		Replies	Views
Regex \w allows non english characters ruby-talk	7	86	14 May 2007
What character sets are available in Ruby? ruby-talk	16	149	10 March 2003
Unicode in Regex ruby-talk	32	328	7 December 2007
Puzzling regex behaviour ruby-talk	23	118	16 February 2007
Perl regexp to ruby one conversion? ruby-talk	13	99	23 March 2006

Non-english characters

BROKEN “TREATMENT” OF LOCALE

THIS IS OK

BROKEN “TREATMENT” OF LOCALE

THIS IS OK

Related topics