I just noticed that accented letters like èàéòùì (actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.
I’ve not tryed on *nix with proper locale set, but I wonder if,
anyway, there is something special I should do to allow this kind of
special letters to be matched as letters.
On one app I had to match words in different languages (including right-to-left langs). I ended up doing it in reverse: matching everything that wasn’t a space or punctuation mark. I can send it to you if you want (it’s not complete, as I had some constraints on the text).
HTH,
Assaph
Ps. The regexp engine seems to handle UTF-8 with no problem.
Wrote “Mehr, Assaph (Assaph)” assaph@avaya.com, on Tue, Apr 06, 2004 at 04:30:34PM +0900:
I just noticed that accented letters like èàéòùì (actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.
I’ve not tryed on *nix with proper locale set, but I wonder if,
anyway, there is something special I should do to allow this kind of
special letters to be matched as letters.
Are you sure you want to do this?
In latin languages, like french, you can strip the accents, and you have
basically the same pronunciation (though my girlfriend doesn’t agree,
but my pronounciation of her language is so heavily accented anyhow, I
can’t seeing it matter). But, there are other languages where a
character with an accent is a completely different vowel. I can’t
recall the details, but one of the northern European languages famously
has its two most common vowels (think “e” and “o” in english)
distinguished by an accent. English speakers (myself included!) tend to
strip all accents while reading, assuming they are some kind of fine
detail, just a small modification of the basic sound, but that’s not
true in a number of languages.
Wrote “Mehr, Assaph (Assaph)” assaph@avaya.com, on
Tue, Apr 06, 2004 at 04:30:34PM +0900:
I just noticed that accented letters like èàéòùì
(actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.
I’ve not tryed on *nix with proper locale set,
but I wonder if,
anyway, there is something special I should do
to allow this kind of
special letters to be matched as letters.
Are you sure you want to do this?
In latin languages, like french, you can strip the
accents, and you have
basically the same pronunciation (though my
girlfriend doesn’t agree,
but my pronounciation of her language is so heavily
accented anyhow, I
can’t seeing it matter). But, there are other
languages where a
character with an accent is a completely different
vowel. I can’t
recall the details, but one of the northern European
languages famously
has its two most common vowels (think “e” and “o” in
english)
distinguished by an accent. English speakers (myself
included!) tend to
strip all accents while reading, assuming they are
some kind of fine
detail, just a small modification of the basic
sound, but that’s not
true in a number of languages.
Sometimes accented letters are only pronunciation
helpers, but some words are differentiated by them as
well. Take these Portuguese words as an example:
Está → equals is as in “here is my cousin”.
Esta → equals “this” as in “this house is for sale”.
In latin languages, like french, you can strip the accents, and you have
basically the same pronunciation (though my girlfriend doesn’t agree,
but my pronounciation of her language is so heavily accented anyhow, I
can’t seeing it matter). But, there are other languages where a
character with an accent is a completely different vowel. I can’t
recall the details, but one of the northern European languages famously
has its two most common vowels (think “e” and “o” in english)
distinguished by an accent. English speakers (myself included!) tend to
strip all accents while reading, assuming they are some kind of fine
detail, just a small modification of the basic sound, but that’s not
true in a number of languages.
In French accents make a big difference in pronunciation. School example
being “élève”, where the first “e” is pronounced as the start of the “a”
in “safe”, the second “e” is pronounced as “ai” in “pair”, and the last
“e” is pronounced as the “i” in “first” (though shorter). Those are
definitely different pronunciations of the same vowel, and they are
(almost) always pronounced the same, depending on the accent only.
Also accents can change the meaning of words, classic example being “ou”
(or) and “où” (where), or “a” (has) and “à” (at). But that doesn’t matter
for pronunciation. Can’t find an example off the top of my head where both
the pronunciation and the meaning are different.
well, I can’t cause even if wovels have just one name in my home
language (i.e. a and à are just ‘a’) they may have great differences
in meaning. say,
passo stands for ‘step’ while passò stands for ‘passed’.
Anyway, my problem arises from a simple problem: counting words in a
file.
I can’t just count any \w+ block cause many of the letters are
considered wrong.
I could hardcode the values of accented letters in my script but if
the local settings for a user are different from mine an accented
value hardcoded in the script would still be different from the one
from the user
Not to make it a discussion about the French language, accent can change
the tense of a verb:
mange: eat
mangé: eaten
There the pronociation and the sense have been changed.
Guillaume.
···
On Tue, 2004-04-06 at 10:21, Peter wrote:
Are you sure you want to do this?
In latin languages, like french, you can strip the accents, and you have
basically the same pronunciation (though my girlfriend doesn’t agree,
but my pronounciation of her language is so heavily accented anyhow, I
can’t seeing it matter). But, there are other languages where a
character with an accent is a completely different vowel. I can’t
recall the details, but one of the northern European languages famously
has its two most common vowels (think “e” and “o” in english)
distinguished by an accent. English speakers (myself included!) tend to
strip all accents while reading, assuming they are some kind of fine
detail, just a small modification of the basic sound, but that’s not
true in a number of languages.
In French accents make a big difference in pronunciation. School example
being “élève”, where the first “e” is pronounced as the start of the “a”
in “safe”, the second “e” is pronounced as “ai” in “pair”, and the last
“e” is pronounced as the “i” in “first” (though shorter). Those are
definitely different pronunciations of the same vowel, and they are
(almost) always pronounced the same, depending on the accent only.
Also accents can change the meaning of words, classic example being “ou”
(or) and “où” (where), or “a” (has) and “à” (at). But that doesn’t matter
for pronunciation. Can’t find an example off the top of my head where both
the pronunciation and the meaning are different.
well, I can’t cause even if wovels have just one name in my home
language (i.e. a and à are just ‘a’) they may have great differences
in meaning. say,
passo stands for ‘step’ while passò stands for ‘passed’.
Anyway, my problem arises from a simple problem: counting words in a
file.
I can’t just count any \w+ block cause many of the letters are
considered wrong.
How about doing the opposite, counting the number of blocks that
are not space characters?