Newb: Rails character encoding and validation

I'm putting together a basic rails application, and writing my first units tests for it.. It occured to me that the user 'name' field might want to contain foreign characters (like é,â,ì,ø... etc.) But two problems have popped up. Firstly, I can't dig up a good reference for a suitable regular expression for validating the field.
at the moment, I'm using:
validates_format of :name, :with => /^[-' a-zA-Z]+$/
but this isn't going to allow the foreign characters, so the test fails.

The second problem is the error message I get when I run the unit test: My test framework sets the name to José, the the failure message when I run the script returns JosÜ.
It looks like the character encoding of my editor isn't the same as the character encoding that rails is using.

so, a) any clues as to what is going on? and b) is there a consistent way of dealing with foreign characters for validation purposes?

Many thanks in advance!

Mark.

It is not really viable to validade a name field with a regex if you
are willing to accept Unicode characters. The only reasonable
validation is to check whether the field is empty.

As for the testing, you'll have to configure your editor and your
terminal to the same encoding to get consistent results. UTF-8 is the
best bet, as it is also a W3C standard that every modern browser
supports.

Cheers,

Luciano

···

On 12/15/06, Mark <user@example.net> wrote:

I'm putting together a basic rails application, and writing my first
units tests for it.. It occured to me that the user 'name' field might
want to contain foreign characters (like י,ג,ל,ר... etc.) But two
problems have popped up. Firstly, I can't dig up a good reference for a
suitable regular expression for validating the field.
at the moment, I'm using:
validates_format of :name, :with => /^[-' a-zA-Z]+$/
but this isn't going to allow the foreign characters, so the test fails.

The second problem is the error message I get when I run the unit test:
My test framework sets the name to Josי, the the failure message when I
run the script returns Jos.
It looks like the character encoding of my editor isn't the same as the
character encoding that rails is using.

so, a) any clues as to what is going on? and b) is there a consistent
way of dealing with foreign characters for validation purposes?

Many thanks in advance!

Mark.

Luciano Ramalho wrote:

It is not really viable to validade a name field with a regex if you
are willing to accept Unicode characters. The only reasonable
validation is to check whether the field is empty.

It is so viable. Just not using [a-zA-Z].

Character classes are your friend.

David Vallner wrote:

Luciano Ramalho wrote:

It is not really viable to validade a name field with a regex if you
are willing to accept Unicode characters. The only reasonable
validation is to check whether the field is empty.

It is so viable. Just not using [a-zA-Z].

Character classes are your friend.

For clarification: I am unsure just how well Ruby's regexp engine
handles Unicode "extended latin" characters, a trivial test using $KCODE
= 'u', require 'jcode', and iconv failed for me. But that could be me
getting the codepages wrong. The above is just saying that there is
nothing saying that a regexp engine properly supporting Unicode and
character classes would be unsuitable to validate non-ASCII text.

David Vallner

You mean, using the Unicode database?

Yes, I know that is possible. My point was that it is not worthwhile
(that's why I wrote "not viable" instead of "impossible"; sorry if I
was not clear: English is not my first language).

Besides all sort of letter-like characters, ideograms and so on, a
person's name may contain hyphens, apostrophes and who-knows-what
other characters.

Remember Prince's name when he used to be called "the artist formerly
known as Prince"? [1]

I just do not think it is "economically viable" the effort to try to
validate a name, except to verify that it contains something other
than blanks.

BTW, which would be a safe way to know whether a Unicode string
contains something other than blanks? Because AFAIK unicode has many
other blank characters besides the old ASCII ones. Can a Ruby regex
cope with that?

Cheers,

Luciano

[1] Prince (musician) - Wikipedia

···

On 12/16/06, David Vallner <david@vallner.net> wrote:

It is so viable. Just not using [a-zA-Z].

Character classes are your friend.

Luciano Ramalho wrote:

I just do not think it is "economically viable" the effort to try to
validate a name, except to verify that it contains something other
than blanks.

That is true. It doesn't have anything to do with Unicode however, as I
think your post implied.

Speaking of which, I wonder if there's a database name record out there
at all containing someone with a retroflex click in his name. And if
it's recorded as the exclamation mark, or U+01C3 :stuck_out_tongue_winking_eye:

BTW, which would be a safe way to know whether a Unicode string
contains something other than blanks? Because AFAIK unicode has many
other blank characters besides the old ASCII ones. Can a Ruby regex
cope with that?

It Should Be Able To.

I think at least oniguruma can do this sort of "industrial-strength"
processing, no idea about the current engine.

Speaking of which, is there a Oniguruma 1.8 backport (?) that you could
use as an add-on regexp engine? (I think currently you can use it as a
drop-in replacement if you built Ruby from source, I was thinking of a
more orthogonal way of using the Shiny Features. Where orthogonal really
means from a binary gem.)

David Vallner