Problem with String encoding when modifying it in C method

Hi, I've added a method "multi_capitalize" to String class. This
method is done in C and basically modifies the string:

  "record-roUTE".multi_capitalize => "Record-Route"

The problem is that after the method execution, the new String has
ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
1.9.1).

···

--------------------------------------------------------------------------------

hname = "record-rouTE-€"

"record-rouTE-€"

hname.encoding

#<Encoding:UTF-8>

hname2 = hname.multi_capitalize

"Record-Route-\xE2\x82\xAC" <------- !!!

hname2.encoding

#<Encoding:ASCII-8BIT> <------- !!!

hname2.force_encoding("utf-8")

"Record-Route-€"

hname2.encoding

#<Encoding:UTF-8>
--------------------------------------------------------------------------------

What should I add to my C method to mantain the UTF-8 codification
after the changes in the string?
Could I invoke the C "force_encoding()" function from the C code
before returning the modified string? How to invoke it?

Thanks a lot.

--
Iñaki Baz Castillo
<ibc@aliax.net>

You can call it as (untested):

  rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));

I'm not sure how to make your multi-capitalize method do the right
thing, but maybe reading the source of rb_str_capitalize_bang in
string.c helps.

Best,
Andre

···

On Sat, 2009-04-04 at 01:39 +0900, Iñaki Baz Castillo wrote:

Could I invoke the C "force_encoding()" function from the C code
before returning the modified string? How to invoke it?

Hi,

···

On Sat, Apr 4, 2009 at 1:39 AM, Iñaki Baz Castillo <ibc@aliax.net> wrote:

Hi, I've added a method "multi_capitalize" to String class. This
method is done in C and basically modifies the string:

"record-roUTE".multi_capitalize => "Record-Route"

The problem is that after the method execution, the new String has
ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
1.9.1).

    rb_encoding *enc = rb_enc_get(original_string)

    /* create a new string with the encoding same with the original string */
    return rb_enc_str_new(char_pointer, length, enc);

rb_str_new() makes a ASCII-8BIT string.

Thanks a lot, I will check it.

···

El Viernes 03 Abril 2009, Andre Nathan escribió:

On Sat, 2009-04-04 at 01:39 +0900, Iñaki Baz Castillo wrote:
> Could I invoke the C "force_encoding()" function from the C code
> before returning the modified string? How to invoke it?

You can call it as (untested):

  rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));

I'm not sure how to make your multi-capitalize method do the right
thing, but maybe reading the source of rb_str_capitalize_bang in
string.c helps.

--
Iñaki Baz Castillo <ibc@aliax.net>

Thanks.

···

El Sábado 04 Abril 2009, KUBO Takehiro escribió:

Hi,

On Sat, Apr 4, 2009 at 1:39 AM, Iñaki Baz Castillo <ibc@aliax.net> wrote:
> Hi, I've added a method "multi_capitalize" to String class. This
> method is done in C and basically modifies the string:
>
> "record-roUTE".multi_capitalize => "Record-Route"
>
> The problem is that after the method execution, the new String has
> ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
> 1.9.1).

    rb_encoding *enc = rb_enc_get(original_string)

    /* create a new string with the encoding same with the original string
*/ return rb_enc_str_new(char_pointer, length, enc);

rb_str_new() makes a ASCII-8BIT string.

--
Iñaki Baz Castillo <ibc@aliax.net>

Yes, rb_str_capitralize_bang handles a lot of stuf realted to encoding:

    c = rb_enc_codepoint(s, send, enc);
    if (rb_enc_islower(c, enc)) {
  rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
  modify = 1;
    }
    s += rb_enc_codelen(c, enc);

so this is the way :slight_smile:

Thanks a lot.

···

El Viernes 03 Abril 2009, Iñaki Baz Castillo escribió:

El Viernes 03 Abril 2009, Andre Nathan escribió:
> On Sat, 2009-04-04 at 01:39 +0900, Iñaki Baz Castillo wrote:
> > Could I invoke the C "force_encoding()" function from the C code
> > before returning the modified string? How to invoke it?
>
> You can call it as (untested):
>
> rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));
>
> I'm not sure how to make your multi-capitalize method do the right
> thing, but maybe reading the source of rb_str_capitalize_bang in
> string.c helps.

Thanks a lot, I will check it.

--
Iñaki Baz Castillo <ibc@aliax.net>