String problem

Fresh_Mix · 3 May 2009 08:34

What wrong?

# irb
irb(main):001:0> xxx = "лошадь"
=> "\320\273\320\276\321\210\320\260\320\264\321\214"
irb(main):002:0> xxx.length
=> 12

···

--
Posted via http://www.ruby-forum.com/.

Tom_Cloyd2 · 3 May 2009 09:26

Fresh Mix wrote:

What wrong?

# irb
irb(main):001:0> xxx = "лошадь"
=> "\320\273\320\276\321\210\320\260\320\264\321\214"
irb(main):002:0> xxx.length
=> 12

I assume you're wondering why each character appears to be represented by two bytes - and I believe it's because the encoding is, of necessity, UTF-8 or something very similar. If I recall correctly, this encoding is designed to be able to represent the world's alphabets, etc., rather than merely the limited character set used in western European languages, and so two bytes must be used to allow for all the possibilities.

If I don't have this quite right (or right at all), I'm sure I'll be set right by those who know more here (and they are legion!).

t.

···

--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Cloyd, MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< tc@tomcloyd.com >> (email)
<< TomCloyd.com >> (website) << sleightmind.wordpress.com >> (mental health weblog)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

7stud · 3 May 2009 11:21

Fresh Mix wrote:

# irb
irb(main):001:0> xxx = "лошадь"
=> "\320\273\320\276\321\210\320\260\320\264\321\214"
irb(main):002:0> xxx.length
=> 12

What wrong?

In 1.8.* versions, ruby doesn't recognize unicode, where characters are
represented by multiple bytes. ruby thinks everything in an ascii
character where characters are represented by one byte.

Try this:

xxx = "лошадь"
puts xxx.length

--output:--
12

$KCODE = "u"
require 'jcode'

puts xxx.jlength

--output:--
6

xxx.each_char do |u|
puts u
end

--output:--
л
о
ш
а
д
ь

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 3 May 2009 10:55

Actually I do not call myself in when it comes to encodings in Ruby. But I believe there is one important bit of information missing that's needed to properly answer the OP's question: what Ruby version did you use?

Kind regards

robert

···

On 03.05.2009 11:26, Tom Cloyd wrote:

Fresh Mix wrote:

What wrong?

# irb
irb(main):001:0> xxx = "лошадь"
=> "\320\273\320\276\321\210\320\260\320\264\321\214"
irb(main):002:0> xxx.length
=> 12

I assume you're wondering why each character appears to be represented by two bytes - and I believe it's because the encoding is, of necessity, UTF-8 or something very similar. If I recall correctly, this encoding is designed to be able to represent the world's alphabets, etc., rather than merely the limited character set used in western European languages, and so two bytes must be used to allow for all the possibilities.

If I don't have this quite right (or right at all), I'm sure I'll be set right by those who know more here (and they are legion!).

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

7stud · 3 May 2009 11:23

7stud -- wrote:

In 1.8.* versions, ruby doesn't recognize unicode, where characters are
represented by multiple bytes. ruby thinks everything in an ascii
character where characters are represented by one byte.

Corrections:

In 1.8.* versions, ruby doesn't recognize unicode, where characters [may
be]
represented by multiple bytes. ruby thinks everything [is] an ascii
character where characters are represented by one byte.

···

--
Posted via http://www.ruby-forum.com/\.

Fresh_Mix · 3 May 2009 11:01

Robert Klemme wrote:

what Ruby version did you
use?

$ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux]

···

--
Posted via http://www.ruby-forum.com/\.

7stud · 3 May 2009 11:27

Robert Klemme wrote:

But I believe there is one important bit of information missing that's
needed to properly answer the OP's question: what Ruby version did you
use?

Why is that relevant? Can unicode be switched off in ruby 1.9?

···

--
Posted via http://www.ruby-forum.com/\.

Yukihiro_Matsumoto2 · 3 May 2009 22:59

Hi,

···

In message "Re: String problem" on Sun, 3 May 2009 20:23:49 +0900, 7stud -- <bbxx789_05ss@yahoo.com> writes:

Corrections:

In 1.8.* versions, ruby doesn't recognize unicode, where characters [may
be]
represented by multiple bytes. ruby thinks everything [is] an ascii
character where characters are represented by one byte.

More Corrections:

In 1.8.* versions, string methods of ruby doesn't recognize multi-byte
characters. ruby thinks everything is a sequence of bytes. Regular
expressions in 1.8.* recognize UTF-8, EUC-JP, and Shift_JIS. So you
can handle Unicode strings by using regular expressions.

matz.

Robert_K1 · 3 May 2009 12:30

It is relevant because handling of encodings has significantly changed between 1.8 and 1.9, which I believe your other posting demonstrates.

Cheers

robert

···

On 03.05.2009 13:27, 7stud -- wrote:

Robert Klemme wrote:

But I believe there is one important bit of information missing that's
needed to properly answer the OP's question: what Ruby version did you use?

Why is that relevant? Can unicode be switched off in ruby 1.9?

James_Edward_Gray_II · 4 May 2009 03:13

And those interested in how all that works may find this series on my blog helpful:

James Edward Gray II

···

On May 3, 2009, at 5:59 PM, Yukihiro Matsumoto wrote:

Hi,

In message "Re: String problem" > on Sun, 3 May 2009 20:23:49 +0900, 7stud -- > <bbxx789_05ss@yahoo.com> writes:

>Corrections:
>
>In 1.8.* versions, ruby doesn't recognize unicode, where characters [may
>be]
>represented by multiple bytes. ruby thinks everything [is] an ascii
>character where characters are represented by one byte.

More Corrections:

In 1.8.* versions, string methods of ruby doesn't recognize multi-byte
characters. ruby thinks everything is a sequence of bytes. Regular
expressions in 1.8.* recognize UTF-8, EUC-JP, and Shift_JIS. So you
can handle Unicode strings by using regular expressions.

7stud · 4 May 2009 10:10

Yukihiro Matsumoto wrote:

Regular
expressions in 1.8.* recognize UTF-8, EUC-JP, and Shift_JIS. So you
can handle Unicode strings by using regular expressions.

Too vague.

James Gray wrote:

And those interested in how all that works may find this series on my
blog helpful:

Gray Soft / Not Found

Excellent website. <c-word here>

Here is something that is unclear:

···

----------
To use the jcode library, set $KCODE and then require the library.
Setting $KCODE first is important, and you will receive a warning if you
require jcode without setting it (as long as you took my advice and
turned ****them*** on)...

---------

In the sentence:

-------
Setting $KCODE first is important, and you will receive a warning if you
require jcode without setting it (as long as you took my advice and
turned them on)...
--------

'it' and 'them' are pronouns, which should refer to nouns. The pronoun
'it' looks like it might refer to 'jcode' when 'it' actually refers to
'$KCODE'. That is pretty easy to sort out.

However, what does 'them' refer to? 'them' should refer to a plural
noun, so if you actually stop and try to sort it out rather than just
dismissing the whole paragraph in confusion, 'them' looks like it must
refer to '$Kcode' and 'jcode'. However, that doesn't make sense because
you don't 'set' jcode--you require jcode.

Apparently, 'them' refers to 'warning', which is not only grammatically
incorrect but it is very hard to make that association. In any case, in
that sentence if you change 'it' and 'them' to $KCODE and 'warnings'
respectively, you will change a confusing and unreadable sentence into a
sentence whose clarity will be unmatched in modern literature:

-----
Setting $KCODE first is important, and you will receive a warning if you
require jcode without setting $KCODE (as long as you took my advice and
turned warnings on with -w)...
______

I'd bet that 90% of the readers of your article stop reading at that
exact spot.
--
Posted via http://www.ruby-forum.com/\.

Aldric_Giacomoni1 · 4 May 2009 12:10

7stud -- wrote:

Yukihiro Matsumoto wrote:


Regular
expressions in 1.8.* recognize UTF-8, EUC-JP, and Shift_JIS. So you
can handle Unicode strings by using regular expressions.

Too vague.

James Gray wrote:


And those interested in how all that works may find this series on my
blog helpful:

http://blog.grayproductions.net/articles/understanding_m17n

Excellent website. <c-word here>

Here is something that is unclear:

----------
To use the jcode library, set $KCODE and then require the library.
Setting $KCODE first is important, and you will receive a warning if you
require jcode without setting it (as long as you took my advice and
turned ****them*** on)...

Gray Soft / Not Found
---------

In the sentence:

-------
Setting $KCODE first is important, and you will receive a warning if you
require jcode without setting it (as long as you took my advice and
turned them on)...
--------

'it' and 'them' are pronouns, which should refer to nouns. The pronoun
'it' looks like it might refer to 'jcode' when 'it' actually refers to
'$KCODE'. That is pretty easy to sort out.

However, what does 'them' refer to? 'them' should refer to a plural
noun, so if you actually stop and try to sort it out rather than just
dismissing the whole paragraph in confusion, 'them' looks like it must
refer to '$Kcode' and 'jcode'. However, that doesn't make sense because
you don't 'set' jcode--you require jcode.

Apparently, 'them' refers to 'warning', which is not only grammatically
incorrect but it is very hard to make that association. In any case, in
that sentence if you change 'it' and 'them' to $KCODE and 'warnings'
respectively, you will change a confusing and unreadable sentence into a
sentence whose clarity will be unmatched in modern literature:

-----
Setting $KCODE first is important, and you will receive a warning if you
require jcode without setting $KCODE (as long as you took my advice and
turned warnings on with -w)...
______

I'd bet that 90% of the readers of your article stop reading at that
exact spot.

Alright, so the man who created Ruby doesn't write well enough for you,
and neither does one of the big guys in the community. Maybe you'd like
to offer yourself as proofreader for them, instead of programmer for
others? I liked James' series of articles and was able to read it just
fine... And english is my 4th language.

-- Aldric

James_Edward_Gray_II · 4 May 2009 15:57

I think that's dramatically overstating the problem.

I do really appreciate your feedback though. Obviously I want this content to be as helpful to everyone as possible. I've adjusted the sentence as you recommend. Thanks.

James Edward Gray II

···

On May 4, 2009, at 5:10 AM, 7stud -- wrote:

I'd bet that 90% of the readers of your article stop reading at that
exact spot.

Adam_Gardner · 4 May 2009 16:02

James Gray wrote:

···

On May 4, 2009, at 5:10 AM, 7stud -- wrote:

I'd bet that 90% of the readers of your article stop reading at that
exact spot.

I think that's dramatically overstating the problem.

I do really appreciate your feedback though. Obviously I want this
content to be as helpful to everyone as possible. I've adjusted the
sentence as you recommend. Thanks.

James Edward Gray II

Aw, you think so? I kinda liked the idea that, since I understood it
immediately, I was in the top 10% of all Ruby programmers, themselves
already a smart bunch.
--
Posted via http://www.ruby-forum.com/\.

7stud · 4 May 2009 16:50

James Gray wrote:

I've adjusted the
sentence as you recommend. Thanks.

Ahh..pure poetry. I didn't know about the 'u' flag for a regex. Thank
you.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
String encoding issues ruby-talk	2	99	3 August 2010
Encoding problem when i use unpack ruby-talk	0	157	27 March 2009
Problems with String#hash ruby-talk	7	66	25 November 2006
Default encoding in ruby 1.9 ruby-talk	2	136	19 June 2009
Multibyte regexps ruby-talk	4	78	26 December 2005

String problem

Related topics