How to match acented letters on windows

Hi gurus and nubys,

I just noticed that accented letters like èàéòùì (actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.

I’ve not tryed on *nix with proper locale set, but I wonder if,
anyway, there is something special I should do to allow this kind of
special letters to be matched as letters.

(running pragprog ruby 1.8.1 on win xp)

Hi Gabriele,

gabriele renzi wrote:

Hi gurus and nubys,

I just noticed that accented letters like èàéòùì (actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.

irb
irb(main):001:0> a = “Wrongly áccèntêd”
=> "Wrongly \240cc\212nt\210d"
irb(main):002:0> a =~ /é/
=> nil
irb(main):003:0> a =~ /è/
=> 11
irb(main):004:0> a =~ /ê/
=> 14
irb(main):005:0>

They are matched, but they’re not part of [a-z] apparently.
What I think is, [a-z] is somehow mapped to the ASCII (or whatever) code.
Anyway

irb(main):002:0> a =~ /[é-ê]/
=> 14
irb(main):003:0>

(Running a 1.8.0 on WinXP)

BTW, you could as well write “yöûr stríng gòés héré” =~ /[\224-\239]/
Using that you’ll get at least the vocals (I know of).
Note that there are a whole lot of accented letters: Check your
character table, somewhere in Start|Programs|Utilities|System…
I translated from my German menu entries, so YMMV a bit.

I’ve not tryed on *nix with proper locale set, but I wonder if,
anyway, there is something special I should do to allow this kind of
special letters to be matched as letters.

(running pragprog ruby 1.8.1 on win xp)

Now, that leads me to a question: Should accented letters be matched
by [a-z]? I personally am not sure whether it does…

Happy rubying

Stephan

Stephan Kämper wrote:

Now, that leads me to a question: Should accented letters be matched
by [a-z]? I personally am not sure whether it does…

Well if your locale was french then you would expect the accented
characters to match those used in french but it should ignore the
icelandic thorn of the dutch y umlaut thingy.

So the answer is depends, and it depends on locale.

I wouldn’t want it set based on a pre-set locale… I don’t think that
would be dynamic enough. What if you need to match characters from more
than one language?

Maybe this should be handled by character classes? What if we could
modify a character class with a country/language code? Something like:

/[[:alpha-es:]]*/.match(“mañana”).to_s #=> “mañana”

or to match valid characters in any language:

str = “mañana n’êtes”
/[[:alpha-all:] ]*/.match(str).to_s #=> “mañana n’êtes”

this would probably have trouble if the text wasn’t unicode, though.

–Mark

···

On Apr 1, 2004, at 6:27 AM, Peter Hickman wrote:

Stephan Kämper wrote:

Now, that leads me to a question: Should accented letters be
matched by [a-z]? I personally am not sure whether it does…

Well if your locale was french then you would expect the accented
characters to match those used in french but it should ignore the
icelandic thorn of the dutch y umlaut thingy.

So the answer is depends, and it depends on locale.

Have a look at this document for more info about i18n/m17n in regexp
http://www.unicode.org/unicode/reports/tr18/

However neither Gnu nor Oniguruma supports it fully.

···

On Fri, 02 Apr 2004 02:16:55 +0900, Mark Hubbart wrote:

On Apr 1, 2004, at 6:27 AM, Peter Hickman wrote:

Stephan Kämper wrote:

Well if your locale was french then you would expect the accented
characters to match those used in french but it should ignore the
icelandic thorn of the dutch y umlaut thingy.

So the answer is depends, and it depends on locale.

I wouldn’t want it set based on a pre-set locale… I don’t think that
would be dynamic enough. What if you need to match characters from more
than one language?

Maybe this should be handled by character classes? What if we could
modify a character class with a country/language code? Something like:


Simon Strandgaard

I like perl solution, take a look here at the section 'matching
letters’
http://pleac.sourceforge.net/pleac_perl/patternmatching.html

···

il Fri, 2 Apr 2004 01:16:55 +0900, Mark Hubbart discord@mac.com ha scritto::

I wouldn’t want it set based on a pre-set locale… I don’t think that
would be dynamic enough. What if you need to match characters from more
than one language?