A plan for another unicode string hack

Dave_Howell2 · 14 June 2006 20:46

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and which one returns the
number of bytes?

I cannot agree. "Length" (to me) unavoidably implies that it's the answer to the question "How LONG is it?" I expect the answer to be "n characters long."

"Size" is the answer to "How BIG is it?" as in "How much space does this thing take up?" and if it's a UTF-8 string, I expect an answer like "1 byte per character + one more byte per character not in the 7-bit ASCII range"

I would never have to look up which is which. Obviously Austin's mileage varies.

···

On Jun 14, 2006, at 9:17, Austin Ziegler wrote:

Austin_Ziegler5 · 14 June 2006 20:51

As much as I like to say that I'm "from Ruby" these days, not everyone
will be. Some languages use string.length(); others use string.size().
I do not think that the proposed distinction is meaningful and
presents problems.

-austin

···

On 6/14/06, Dave Howell <groups@grandfenwick.net> wrote:

On Jun 14, 2006, at 9:17, Austin Ziegler wrote:
Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and which one returns the
number of bytes?

I cannot agree. "Length" (to me) unavoidably implies that it's the
answer to the question "How LONG is it?" I expect the answer to be "n
characters long."

"Size" is the answer to "How BIG is it?" as in "How much space does
this thing take up?" and if it's a UTF-8 string, I expect an answer
like "1 byte per character + one more byte per character not in the
7-bit ASCII range"

I would never have to look up which is which. Obviously Austin's
mileage varies.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Joel_VanderWerf1 · 15 June 2006 04:52

Dave Howell wrote:

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and which one returns the
number of bytes?

I cannot agree. "Length" (to me) unavoidably implies that it's the
answer to the question "How LONG is it?" I expect the answer to be "n
characters long."

"Size" is the answer to "How BIG is it?" as in "How much space does this
thing take up?" and if it's a UTF-8 string, I expect an answer like "1
byte per character + one more byte per character not in the 7-bit ASCII
range"

That's not a bad argument, but Hash#size and Array#size don't behave
that way in ruby.

···

On Jun 14, 2006, at 9:17, Austin Ziegler wrote:

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Leslie_Viljoen1 · 15 June 2006 08:08

I agree with Austin on this - the distinction is too vague. I'd leave
length and size the same and make a size_in_bytes method.

Les

···

On 6/15/06, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:

Dave Howell wrote:
>
> On Jun 14, 2006, at 9:17, Austin Ziegler wrote:
>
> Yes, I undertstand. Making #size and #length return different values
> is a mistake. Without referring to documentation, how would you know
> which returns the number of characters and which one returns the
> number of bytes?
>
> I cannot agree. "Length" (to me) unavoidably implies that it's the
> answer to the question "How LONG is it?" I expect the answer to be "n
> characters long."
>
> "Size" is the answer to "How BIG is it?" as in "How much space does this
> thing take up?" and if it's a UTF-8 string, I expect an answer like "1
> byte per character + one more byte per character not in the 7-bit ASCII
> range"

That's not a bad argument, but Hash#size and Array#size don't behave
that way in ruby.

Yukihiro_Matsumoto2 · 15 June 2006 08:55

Hi,

···

In message "Re: A plan for another unicode string hack" on Thu, 15 Jun 2006 17:08:41 +0900, "Leslie Viljoen" <leslieviljoen@gmail.com> writes:

I agree with Austin on this - the distinction is too vague. I'd leave
length and size the same and make a size_in_bytes method.

On my latest prototype (not checked in anywhere), String#length and
String#size behave same, and there is String#buffer_size to return
size in bytes. The method name might change in the future.

matz.

Paul_Battley · 15 June 2006 09:26

Actually, this makes a lot of sense. Why would you ever want to know
the actual byte length of a UTF-8 string? It's pretty meaningless for
most string-processing tasks: the main times you would need it would
be in allocation and interfacing with external systems and libraries.
Thus, something like buffer_size maps to real-world usage extremely
well, in my opinion.

Paul.

···

On 15/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

>I agree with Austin on this - the distinction is too vague. I'd leave
>length and size the same and make a size_in_bytes method.

On my latest prototype (not checked in anywhere), String#length and
String#size behave same, and there is String#buffer_size to return
size in bytes. The method name might change in the future.

Leslie_Viljoen1 · 15 June 2006 09:43

Of course the confusion here is caused by measurement units. Size in
bytes or size in characters? Length and size don't (clearly) indictate
that distinction, and neither does buffer_size. The name should
indicate the unit so that you could immediately see that adding (eg.)
length_in_characters to length_in_bytes would be in error.

Here's some naming convention insight:

Les

···

On 6/15/06, Paul Battley <pbattley@gmail.com> wrote:

On 15/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> >I agree with Austin on this - the distinction is too vague. I'd leave
> >length and size the same and make a size_in_bytes method.
>
> On my latest prototype (not checked in anywhere), String#length and
> String#size behave same, and there is String#buffer_size to return
> size in bytes. The method name might change in the future.

Actually, this makes a lot of sense. Why would you ever want to know
the actual byte length of a UTF-8 string? It's pretty meaningless for
most string-processing tasks: the main times you would need it would
be in allocation and interfacing with external systems and libraries.
Thus, something like buffer_size maps to real-world usage extremely
well, in my opinion.

Austin_Ziegler5 · 15 June 2006 12:17

I agree. #buffer_size or #byte_size is probably sufficient. #size and
#length should return the number of glyphs in the string.

-austin

···

On 6/15/06, Paul Battley <pbattley@gmail.com> wrote:

On 15/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> >I agree with Austin on this - the distinction is too vague. I'd leave
> >length and size the same and make a size_in_bytes method.
>
> On my latest prototype (not checked in anywhere), String#length and
> String#size behave same, and there is String#buffer_size to return
> size in bytes. The method name might change in the future.
Actually, this makes a lot of sense. Why would you ever want to know
the actual byte length of a UTF-8 string? It's pretty meaningless for
most string-processing tasks: the main times you would need it would
be in allocation and interfacing with external systems and libraries.
Thus, something like buffer_size maps to real-world usage extremely
well, in my opinion.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Patrick_Hurley1 · 15 June 2006 13:13

I have to concur. If we are taking votes, I like the byte_ prefix:
byte_size, byte_length -- and these will remain important even within
pure ruby (e.g. when doing block based IO).

Which actually leads to an important question how does the proposed
(2.0 or this library) string deal with incomplete trailing data? In
particular, if I am gathering data off a socket and end up with half
of an utf-8 extended key code pair at the end of my buffer -- what
will happen? Also this will happen at beginning of strings as well
(when I continue reading on socket and the rest of the data arrives).

One solution is to collect data in a non-encoded string, but we still
need a reasonable method to encode "as much as possible" and leave any
trailing bytes available for future byte based operations.

pth

···

On 6/15/06, Austin Ziegler <halostatue@gmail.com> wrote:

On 6/15/06, Paul Battley <pbattley@gmail.com> wrote:
> On 15/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> > >I agree with Austin on this - the distinction is too vague. I'd leave
> > >length and size the same and make a size_in_bytes method.
> >
> > On my latest prototype (not checked in anywhere), String#length and
> > String#size behave same, and there is String#buffer_size to return
> > size in bytes. The method name might change in the future.
> Actually, this makes a lot of sense. Why would you ever want to know
> the actual byte length of a UTF-8 string? It's pretty meaningless for
> most string-processing tasks: the main times you would need it would
> be in allocation and interfacing with external systems and libraries.
> Thus, something like buffer_size maps to real-world usage extremely
> well, in my opinion.

I agree. #buffer_size or #byte_size is probably sufficient. #size and
#length should return the number of glyphs in the string.

-austin
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

James_Edward_Gray_II · 15 June 2006 13:23

I agree.

James Edward Gray II

···

On Jun 15, 2006, at 8:13 AM, Patrick Hurley wrote:

I have to concur. If we are taking votes, I like the byte_ prefix:
byte_size, byte_length -- and these will remain important even within
pure ruby (e.g. when doing block based IO).

Yukihiro_Matsumoto2 · 15 June 2006 15:31

Hi,

···

In message "Re: A plan for another unicode string hack" on Thu, 15 Jun 2006 22:13:41 +0900, "Patrick Hurley" <phurley@gmail.com> writes:

Which actually leads to an important question how does the proposed
(2.0 or this library) string deal with incomplete trailing data?

It raises an exception when you touch the broken character. In other
words, if you don't try to extract broken characters, you are free to
hold or concatenate incomplete UTF-8 strings.

matz.

Paul_Battley · 15 June 2006 13:59

I don't agree. If you are doing block-based IO with a string, wouldn't
the best way be just to set its encoding to raw/unset? It would then
behave as a byte-based string. There's no need for a whole suite of
byte_-prefixed methods.

Paul.

···

On 15/06/06, James Edward Gray II <james@grayproductions.net> wrote:

On Jun 15, 2006, at 8:13 AM, Patrick Hurley wrote:

> I have to concur. If we are taking votes, I like the byte_ prefix:
> byte_size, byte_length -- and these will remain important even within
> pure ruby (e.g. when doing block based IO).

I agree.

Julian_Julik_Tarkhan · 15 June 2006 14:25

You do need them because otherwise you have to switch contexts just to access bytes transparently:

with_some_context do
byte_length = str.length
end

What I would love to have though is a reverse of my accessor thingy - so that we have a String#bytes, which bypasses to a byte string, but the methods
of the string itself are character-bound

···

On 15-jun-2006, at 15:59, Paul Battley wrote:

On 15/06/06, James Edward Gray II <james@grayproductions.net> wrote:

On Jun 15, 2006, at 8:13 AM, Patrick Hurley wrote:

> I have to concur. If we are taking votes, I like the byte_ prefix:
> byte_size, byte_length -- and these will remain important even within
> pure ruby (e.g. when doing block based IO).

I agree.

I don't agree. If you are doing block-based IO with a string, wouldn't
the best way be just to set its encoding to raw/unset? It would then
behave as a byte-based string. There's no need for a whole suite of
byte_-prefixed methods.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Topic		Replies	Views
A plan for another unicode string hack ruby-talk	27	153	15 June 2006
String utf8 size in kilobytes ruby-talk	2	108	5 January 2006
Ruby-dev summary 26385-26467 ruby-talk	1	118	18 July 2005
A few good articles on Unicode ruby-talk	3	136	16 June 2006
String#[] ruby-talk	0	88	26 September 2006

A plan for another unicode string hack

Related topics