Utf8 string with reverse question

Mitch · 15 August 2005 05:52

Hi everyone, long time lurker here. Even longer Ruby user.

I have a minor problem with a utf8 string.

In short I see this behavior:

"Stuhlu".sub(/u/,'ü')
=> "Stühlu"
"Stuhlu".reverse.sub(/u/,'ü').reverse
=> "Stuhl\274\303"
"Stuhlu".reverse.sub(/u/,'ü').split(//).reverse.join
=> "Stuhlü"

The general goal is to sub the final "u" in that word with an umlauted version and not the first. I started irb with -Ku so that I get utf8 support in all things ruby. But the behavior of reverse on the substituted string is really baffling me.

Does anyone know the reason for the weirdness of reverse after the sub? The last version was a hack to get things to just work. Am I mising a Regexp option that would make the final match work? I don't normally look for a final match to substitute on. And reverse seemed the most logical choice for a solution.

Any help would be appreciated!

Thanks,
Mitch

Pit · 15 August 2005 06:11

Mitch Tishmack schrieb:

In short I see this behavior:

"Stuhlu".sub(/u/,'ü')
=> "Stühlu"
"Stuhlu".reverse.sub(/u/,'ü').reverse
=> "Stuhl\274\303"
"Stuhlu".reverse.sub(/u/,'ü').split(//).reverse.join
=> "Stuhlü"

The general goal is to sub the final "u" in that word with an umlauted version and not the first.
...
Am I mising a Regexp option that would make the final match work?

Can't help with the reverse behaviour, but if you want to substitute single letters only the following regexp should work:

"Stuhlu".sub(/u(?=[^u]*$)/,'ü')

Regards,
Pit

Brian_Schroder1 · 15 August 2005 07:49

Hi everyone, long time lurker here. Even longer Ruby user.

I have a minor problem with a utf8 string.

In short I see this behavior:

"Stuhlu".sub(/u/,'ü')
=> "Stühlu"
"Stuhlu".reverse.sub(/u/,'ü').reverse
=> "Stuhl\274\303"

It seems like reverse is acting on the string as a byte-array. That
means you are reversing the two byte character ü = '\303\274' into the
non-character '\274\303' when reversing the string 'ühlutS'

Are you trying to build a german -> pig-türkisch translator

regards,

Brian

···

On 15/08/05, Mitch Tishmack <idylls@gmail.com> wrote:

"Stuhlu".reverse.sub(/u/,'ü').split(//).reverse.join
=> "Stuhlü"

The general goal is to sub the final "u" in that word with an
umlauted version and not the first. I started irb with -Ku so that I
get utf8 support in all things ruby. But the behavior of reverse on
the substituted string is really baffling me.

Does anyone know the reason for the weirdness of reverse after the
sub? The last version was a hack to get things to just work. Am I
mising a Regexp option that would make the final match work? I don't
normally look for a final match to substitute on. And reverse seemed
the most logical choice for a solution.

Any help would be appreciated!

Thanks,
Mitch

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

Levin_Alexander1 · 15 August 2005 17:19

You could do this:

   $KCODE = 'u'
   class String
     def reverse; self.scan(/./).reverse.join end
   end

"Stuhlü".reverse #=> "ülhutS"

found on <http://redhanded.hobix.com/inspect/closingInOnUnicodeWithJcode.html>

-Levin

···

Mitch Tishmack <idylls@gmail.com> wrote:

I have a minor problem with a utf8 string.

In short I see this behavior:

"Stuhlu".sub(/u/,'ü')
=> "Stühlu"
"Stuhlu".reverse.sub(/u/,'ü').reverse
=> "Stuhl\274\303"
"Stuhlu".reverse.sub(/u/,'ü').split(//).reverse.join
=> "Stuhlü"

Mitch · 15 August 2005 15:28

It seems like reverse is acting on the string as a byte-array. That
means you are reversing the two byte character ü = '\303\274' into the
non-character '\274\303' when reversing the string 'ühlutS'

That makes sense, but seems like incorrect behavior for this instance.

Are you trying to build a german -> pig-türkisch translator

Not quite :), I just picked a random German word and appended anther u
to it for testing. I suppose my example could have been Kuhlstuhl. I
am actually working on a German Noun/Verb helper, all it will do is
conjugate the verb/noun according to proper grammatical rules. ie der
Fisch -> die Fische etc... der Stuhl -> die Stühle

I was just worrying about compound nouns where the final noun is what
is conjugated.

Yes I DO have too much time on my hands right now.

Cheers,
Mitch

···

On 8/15/05, Brian Schröder <ruby.brian@gmail.com> wrote:

Mitch · 15 August 2005 15:30

Aha, I knew there was something I was missing in Regexp. Thanks for
confirming.

I will get back to my program after work today.

Thanks,
Mitch

···

On 8/15/05, Pit Capitain <pit@capitain.de> wrote:

Mitch Tishmack schrieb:
> In short I see this behavior:
>
> "Stuhlu".sub(/u/,'ü')
> => "Stühlu"
> "Stuhlu".reverse.sub(/u/,'ü').reverse
> => "Stuhl\274\303"
> "Stuhlu".reverse.sub(/u/,'ü').split(//).reverse.join
> => "Stuhlü"
>
> The general goal is to sub the final "u" in that word with an umlauted
> version and not the first.
> ...
> Am I mising a
> Regexp option that would make the final match work?

Can't help with the reverse behaviour, but if you want to substitute
single letters only the following regexp should work:

"Stuhlu".sub(/u(?=[^u]*$)/,'ü')

Regards,
Pit

Topic		Replies	Views
Behavior of substitutions with sub ruby-talk	2	112	5 November 2006
Multibyte regexps ruby-talk	4	78	26 December 2005
UTF-8 strings? ruby-talk	1	93	25 October 2004
Regexps and Unicode ruby-talk	0	82	18 March 2004
Win32 ruby1.9 regexp and cyrillic string ruby-talk	6	136	10 May 2010

Utf8 string with reverse question

Related topics