How to match acented letters on windows

Mehr_Assaph_Assaph · 6 April 2004 07:30

I just noticed that accented letters like èàéòùì (actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.

I’ve not tryed on *nix with proper locale set, but I wonder if,
anyway, there is something special I should do to allow this kind of
special letters to be matched as letters.

if you don’t mind altering the string before checking, look at unac (http://www.gnu.org/directory/text/Misc/unac.html).
‘unac’ is a C library and command that removes accents from a string.
On one app I had to match words in different languages (including right-to-left langs). I ended up doing it in reverse: matching everything that wasn’t a space or punctuation mark. I can send it to you if you want (it’s not complete, as I had some constraints on the text).

HTH,
Assaph

Ps. The regexp engine seems to handle UTF-8 with no problem.

Sam_Roberts · 6 April 2004 13:21

Wrote “Mehr, Assaph (Assaph)” assaph@avaya.com, on Tue, Apr 06, 2004 at 04:30:34PM +0900:

I just noticed that accented letters like èàéòùì (actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.

I’ve not tryed on *nix with proper locale set, but I wonder if,
anyway, there is something special I should do to allow this kind of
special letters to be matched as letters.

Are you sure you want to do this?

In latin languages, like french, you can strip the accents, and you have
basically the same pronunciation (though my girlfriend doesn’t agree,
but my pronounciation of her language is so heavily accented anyhow, I
can’t seeing it matter). But, there are other languages where a
character with an accent is a completely different vowel. I can’t
recall the details, but one of the northern European languages famously
has its two most common vowels (think “e” and “o” in english)
distinguished by an accent. English speakers (myself included!) tend to
strip all accents while reading, assuming they are some kind of fine
detail, just a small modification of the basic sound, but that’s not
true in a number of languages.

Cheers,
Sam

···

–
Sam Roberts sroberts@certicom.com

Joao_Pedrosa2 · 6 April 2004 13:36

Hi,

Wrote “Mehr, Assaph (Assaph)” assaph@avaya.com, on
Tue, Apr 06, 2004 at 04:30:34PM +0900:

I just noticed that accented letters like èàéòùì
(actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.

I’ve not tryed on *nix with proper locale set,
but I wonder if,
anyway, there is something special I should do
to allow this kind of
special letters to be matched as letters.

Are you sure you want to do this?

In latin languages, like french, you can strip the
accents, and you have
basically the same pronunciation (though my
girlfriend doesn’t agree,
but my pronounciation of her language is so heavily
accented anyhow, I
can’t seeing it matter). But, there are other
languages where a
character with an accent is a completely different
vowel. I can’t
recall the details, but one of the northern European
languages famously
has its two most common vowels (think “e” and “o” in
english)
distinguished by an accent. English speakers (myself
included!) tend to
strip all accents while reading, assuming they are
some kind of fine
detail, just a small modification of the basic
sound, but that’s not
true in a number of languages.

Sometimes accented letters are only pronunciation
helpers, but some words are differentiated by them as
well. Take these Portuguese words as an example:
Está → equals is as in “here is my cousin”.
Esta → equals “this” as in “this house is for sale”.

Cheers,
Joao

···

— Sam Roberts sroberts@certicom.com wrote:

Cheers,
Sam

Do you Yahoo!?
Yahoo! Small Business $15K Web Design Giveaway
http://promotions.yahoo.com/design_giveaway/

Peter3 · 6 April 2004 14:21

Are you sure you want to do this?

In latin languages, like french, you can strip the accents, and you have
basically the same pronunciation (though my girlfriend doesn’t agree,
but my pronounciation of her language is so heavily accented anyhow, I
can’t seeing it matter). But, there are other languages where a
character with an accent is a completely different vowel. I can’t
recall the details, but one of the northern European languages famously
has its two most common vowels (think “e” and “o” in english)
distinguished by an accent. English speakers (myself included!) tend to
strip all accents while reading, assuming they are some kind of fine
detail, just a small modification of the basic sound, but that’s not
true in a number of languages.

In French accents make a big difference in pronunciation. School example
being “élève”, where the first “e” is pronounced as the start of the “a”
in “safe”, the second “e” is pronounced as “ai” in “pair”, and the last
“e” is pronounced as the “i” in “first” (though shorter). Those are
definitely different pronunciations of the same vowel, and they are
(almost) always pronounced the same, depending on the accent only.

Also accents can change the meaning of words, classic example being “ou”
(or) and “où” (where), or “a” (has) and “à” (at). But that doesn’t matter
for pronunciation. Can’t find an example off the top of my head where both
the pronunciation and the meaning are different.

Peter

Laurent_Julliard4 · 6 April 2004 21:09

well, I can’t cause even if wovels have just one name in my home
language (i.e. a and à are just ‘a’) they may have great differences
in meaning. say,
passo stands for ‘step’ while passò stands for ‘passed’.

Anyway, my problem arises from a simple problem: counting words in a
file.
I can’t just count any \w+ block cause many of the letters are
considered wrong.
I could hardcode the values of accented letters in my script but if
the local settings for a user are different from mine an accented
value hardcoded in the script would still be different from the one
from the user

···

il Tue, 6 Apr 2004 22:21:26 +0900, Sam Roberts sroberts@certicom.com ha scritto::

Are you sure you want to do this?

Guillaume_Marcais1 · 6 April 2004 15:23

Not to make it a discussion about the French language, accent can change
the tense of a verb:

mange: eat
mangé: eaten

There the pronociation and the sense have been changed.

Guillaume.

···

On Tue, 2004-04-06 at 10:21, Peter wrote:

Are you sure you want to do this?

In latin languages, like french, you can strip the accents, and you have
basically the same pronunciation (though my girlfriend doesn’t agree,
but my pronounciation of her language is so heavily accented anyhow, I
can’t seeing it matter). But, there are other languages where a
character with an accent is a completely different vowel. I can’t
recall the details, but one of the northern European languages famously
has its two most common vowels (think “e” and “o” in english)
distinguished by an accent. English speakers (myself included!) tend to
strip all accents while reading, assuming they are some kind of fine
detail, just a small modification of the basic sound, but that’s not
true in a number of languages.

In French accents make a big difference in pronunciation. School example
being “élève”, where the first “e” is pronounced as the start of the “a”
in “safe”, the second “e” is pronounced as “ai” in “pair”, and the last
“e” is pronounced as the “i” in “first” (though shorter). Those are
definitely different pronunciations of the same vowel, and they are
(almost) always pronounced the same, depending on the accent only.

Also accents can change the meaning of words, classic example being “ou”
(or) and “où” (where), or “a” (has) and “à” (at). But that doesn’t matter
for pronunciation. Can’t find an example off the top of my head where both
the pronunciation and the meaning are different.

Peter

Sam_Roberts1 · 7 April 2004 00:35

Quoteing surrender_it@rc1.vip.ukl.yahoo.com, on Wed, Apr 07, 2004 at 06:09:21AM +0900:

···

il Tue, 6 Apr 2004 22:21:26 +0900, Sam Roberts sroberts@certicom.com > ha scritto::

Are you sure you want to do this?

well, I can’t cause even if wovels have just one name in my home
language (i.e. a and à are just ‘a’) they may have great differences
in meaning. say,
passo stands for ‘step’ while passò stands for ‘passed’.

Anyway, my problem arises from a simple problem: counting words in a
file.
I can’t just count any \w+ block cause many of the letters are
considered wrong.

How about doing the opposite, counting the number of blocks that
are not space characters?

[^\s]+

Cheers,
Sam

Topic		Replies	Views
How to match acented letters on windows ruby-talk	5	134	1 April 2004
Regexp match error on windows and unicode error on linux ruby-talk	2	144	10 May 2010
Letters with accent marks ruby-talk	2	74	18 November 2007
Problem matching accented chars on OS X ruby-talk	0	107	11 June 2005
Accented letters ruby-talk	10	104	19 August 2008

How to match acented letters on windows

Related topics