A plan for another unicode string hack

Austin_Ziegler5 · 14 June 2006 16:17

My proposed change won't disturb anyone's existing codes unless you
set $KCODE to be 'u' as well in that code. If you did set $KCODE to
'u' in your previous projects, you don't have to apply this hack
(which hasn't been implemented yet) to that project.

Um. PDF::Writer is a library, and I think that I use both depending on
how the code reads.

Matz has said several times that he will maximize the breakage moving
to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as
implied in Guy Decoux's posting) I think I will just follow along. My
goal is to provide Ruby 2.0 forward compatible unicode support until
the move is complete.

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and which one returns the
number of bytes?

They should always *either* return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

-austin

···

On 6/14/06, Dae San Hwang <daesan@gmail.com> wrote:
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Anselm_Heaton · 15 June 2006 09:55

Matz has said several times that he will maximize the breakage moving
to Ruby 2.0.

If that is the case, is there a reason why we should continue using String for
bytes then ? We could have a ByteBuffer class or something similar. IO
objects would return ByteBuffer ; ByteBuffer.to_s would return the string
equivalent in the current encoding, and ByteBuffer.to_s('my encoding') into
the required encoding.

Anselm

···

--
------------------------------
Netuxo Ltd
a workers' co-operative
providing low-cost IT solutions
for peace, environmental and social justice groups
and the radical NGO sector

Registered as a company in England and Wales. No 4798478
Registered office: 5 Caledonian Road, London N1 9DY, Britain
------------------------------
email office@netuxo.co.uk
http://www.netuxo.co.uk
------------------------------

Gary_Wright · 14 June 2006 16:24

+1

Gary Wright

···

On Jun 14, 2006, at 12:17 PM, Austin Ziegler wrote:

They should always *either* return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

thbar · 15 June 2006 10:10

Hi!

If that is the case, is there a reason why we should continue using String

for
bytes then ? We could have a ByteBuffer class or something similar. IO
objects would return ByteBuffer ; ByteBuffer.to_s would return the string
equivalent in the current encoding, and ByteBuffer.to_s('my encoding')
into
the required encoding.

+1, that would be very nice. In some other platforms (like java or .net),
the programmer doesn't know about the bytes (only length in terms of chars)
unless he's willing to digg into them (using a specific class).

···

--
Thibaut

Dmitry_Severin · 15 June 2006 10:10

Will you volunteer to go throughout all source code of Ruby core library
(about 350 KSLOC of C and Ruby), inspect, fix and test all the consequent
issues? How long time could it take?

···

On 6/15/06, Anselm Heaton <anselm@netuxo.co.uk> wrote:

> Matz has said several times that he will maximize the breakage moving
> to Ruby 2.0.

If that is the case, is there a reason why we should continue using String
for
bytes then ? We could have a ByteBuffer class or something similar. IO
objects would return ByteBuffer ; ByteBuffer.to_s would return the string
equivalent in the current encoding, and ByteBuffer.to_s('my encoding')
into
the required encoding.

Michal_hramrach_Such · 15 June 2006 12:10

> My proposed change won't disturb anyone's existing codes unless you
> set $KCODE to be 'u' as well in that code. If you did set $KCODE to
> 'u' in your previous projects, you don't have to apply this hack
> (which hasn't been implemented yet) to that project.

Um. PDF::Writer is a library, and I think that I use both depending on
how the code reads.

> Matz has said several times that he will maximize the breakage moving
> to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as
> implied in Guy Decoux's posting) I think I will just follow along. My
> goal is to provide Ruby 2.0 forward compatible unicode support until
> the move is complete.

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and which one returns the
number of bytes?

Well, to me it is quite intuitive that length gives the number of
characters, and size returns the amount of space needed to store the
object.
The problem is that for other objects these would still be equivalent.
But the subject contains the word 'hack', mind you.

They should always *either* return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

I think that more descriptive aliases would be welcome as well.

Thanks

Michal

···

On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:

On 6/14/06, Dae San Hwang <daesan@gmail.com> wrote:

Dae_San_Hwang · 15 June 2006 13:36

I see your point now. Then I guess I won't utilize $KCODE to set default encoding. That way, only strings with explicit encoding will exhibit new behavior.

Dae San Hwang
daesan@gmail.com

···

On Jun 15, 2006, at 1:17 AM, Austin Ziegler wrote:

On 6/14/06, Dae San Hwang <daesan@gmail.com> wrote:

My proposed change won't disturb anyone's existing codes unless you
set $KCODE to be 'u' as well in that code. If you did set $KCODE to
'u' in your previous projects, you don't have to apply this hack
(which hasn't been implemented yet) to that project.

Um. PDF::Writer is a library, and I think that I use both depending on
how the code reads.

Julian_Julik_Tarkhan · 15 June 2006 11:26

I know it might sound terrible, but If Ruby as the language will progress and prosper
this will have to be done (patching string handling), and the sooner - the better. The same will have to be done with Python 3000 very soon.

It's good to break bad string handling that was arong to start with

···

On 15-jun-2006, at 12:10, Dmitry Severin wrote:

On 6/15/06, Anselm Heaton <anselm@netuxo.co.uk> wrote:

> Matz has said several times that he will maximize the breakage moving
> to Ruby 2.0.

If that is the case, is there a reason why we should continue using String
for
bytes then ? We could have a ByteBuffer class or something similar. IO
objects would return ByteBuffer ; ByteBuffer.to_s would return the string
equivalent in the current encoding, and ByteBuffer.to_s('my encoding')
into
the required encoding.

Will you volunteer to go throughout all source code of Ruby core library
(about 350 KSLOC of C and Ruby), inspect, fix and test all the consequent
issues? How long time could it take?

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Topic		Replies	Views
A plan for another unicode string hack ruby-talk	12	98	15 June 2006
Unicode in Ruby now? ruby-talk	51	352	23 December 2004
Strange behaviour of Strings in Range ruby-talk	5	107	5 May 2004
String.new what is default encoding? ruby-talk	3	255	26 November 2016
Utf-8 ruby-talk	1	75	6 January 2006

A plan for another unicode string hack

Related Topics