Unicode roadmap?

Austin_Ziegler5 · 29 June 2006 02:48

As I indicated in a later post, that's also acceptable.

-austin

···

On 6/28/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

>That way, I *mark* the strings for which I want Unicode format. The
>encoding pragma makes it hard to do mixed content files.
I'd rather see r"\x89PNG\x0d\x0a\x1a\x0a" (or b"..."), since I expect
binary strings less often. It also removes unnecessary Unicode
expectation from users.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Julian_Julik_Tarkhan · 28 June 2006 18:46

Except that @top is guaranteed to not have an encoding -- at least it
damned well better not -- and @top.bytes is redundant in this case. I
see no reason to access #bytes unless I know I'm dealing with a
multibyte String.

You never know if you are, that's the problem. And no, it's NOT redundant. You should just get used
to the fact that _all_ strings might become multibyte.

Worse, why would "Not PNG." be treated as Unicode
under your scheme but "\x89PNG\x0d\x0a\x1a\x0a" not be? I don't think
you're thinking this through.

@top[0, 8] is sufficient when you can guarantee that sizeof(char) ==
sizeof(byte).

You can NEVER guarantee that. N e v e r. More languages and more people use multibyte characters by default than all
ASCII users combined.

It seems very pity but you still approcah multibyte strings as something "special".

On "raw" strings, this is always the case.

The only way to distinguish "raw" strings from multibyte strings is to subclass (which sucks for you as a byte user and for me as strings user).

On all
strings, @top[0, 8] would return the appropriate number of characters
-- not the number of bytes. It just so happens on binary strings that
the number of characters and bytes is exactly the same.

This is a very leaky abstraction - you can never expect what you will get. What's the problem with having bytes as an accessor?

What I'm arguing is that while the pragma may work for the less-common
encodings, both binary (non-)encoding and Unicode (probably UTF-8) are
going to be common enough that specific literal constructors are
probably a very good idea.

Python proved that to be wrong - both the subclassing part and the literals part.
The fact that you have to designate Unicode strings with literals is a bad decision and I can only suspect that it has to do with compiler intolerance,
and the need to do preprocessing.

···

On 28-jun-2006, at 20:36, Austin Ziegler wrote:

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Julian_Julik_Tarkhan · 28 June 2006 18:50

I meant C in this part, sorry.

···

On 28-jun-2006, at 20:46, Julian 'Julik' Tarkhanov wrote:

The fact that you have to designate Unicode strings with literals is a bad decision and I can only suspect that it has to do with compiler intolerance,
and the need to do preprocessing.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Austin_Ziegler5 · 28 June 2006 19:11

Except that @top is guaranteed to not have an encoding -- at least it
damned well better not -- and @top.bytes is redundant in this case. I
see no reason to access #bytes unless I know I'm dealing with a
multibyte String.

You never know if you are, that's the problem. And no, it's NOT
redundant. You should just get used to the fact that _all_ strings
might become multibyte.

How can you continue to be so wrong? All strings will *not* become
multibyte. Matz seems pretty committed to the m17n String, which means
that you're not going to get a Unicode String. This is *good*.

When you're not getting a String that is limited to Unicode, you don't
need a separate ByteArray. This is also good.

Worse, why would "Not PNG." be treated as Unicode under your scheme
but "\x89PNG\x0d\x0a\x1a\x0a" not be? I don't think you're thinking
this through.

@top[0, 8] is sufficient when you can guarantee that sizeof(char) ==
sizeof(byte).

You can NEVER guarantee that. N e v e r. More languages and more
people use multibyte characters by default than all ASCII users
combined.

Again, you are wrong. Horribly so. I *can* guarantee that sizeof(char)
== sizeof(byte) if String#encoding is a single-byte encoding or is "raw"
(or "binary", whichever Matz uses).

It seems very pity but you still approcah multibyte strings as
something "special".

It seems very sad, but you still aren't willing to comprehend what I'm
saying.

On "raw" strings, this is always the case.

The only way to distinguish "raw" strings from multibyte strings is to
subclass (which sucks for you as a byte user and for me as strings
user).

Incorrect. I do not need to have:

  UnicodeString
  BinaryString
  USASCIIString
  ISO88591String

Never have. Never will.

What you're not understanding -- and at this point, I am *really*
thinking that it's willful -- is that I don't consider multibyte strings
"special." I consider *all encodings* special. But I also don't think I
need full *classes* to support them. (I know for a fact that I don't.)
What's special is the encoding, not the string. Any string -- including
a UTF-32 string -- is *merely* a sequence of bytes. The encoding tells
me how large my "characters" are in terms of bytes. The encoding can
tell me more than that, too. This means that an encoding is simply a
*lens* through which that sequence of bytes gains meaning.

Therefore, I can do:

  s = b"Wh\xc3\xa4t f\xc3\xb6\xc3\xb6l\xc3\xafshn\xc3\xabss."
  s.encoding = :utf8
  s # "Whät föölïshnëss."

Gee. No subclass involved.

A substring of a "binary" (unencoded) string is simply the bytes
involved.

We're not talking rocket science here. We're talking being smart,
instead of being lemmings who apparently want Ruby to be more like Java.

On all strings, @top[0, 8] would return the appropriate number of
characters -- not the number of bytes. It just so happens on binary
strings that the number of characters and bytes is exactly the same.

This is a very leaky abstraction - you can never expect what you will
get. What's the problem with having bytes as an accessor?

What's the need, if I *know* that what I'm testing against is going to
be dealt with bytewise? You're expecting programmers to be stupid. I'm
expecting them to be smarter than that. Uninformed, perhaps, but not
stupid.

(And I would know in this case because the ultimate API that calls this
will have been given image data.)

What I'm arguing is that while the pragma may work for the
less-common encodings, both binary (non-)encoding and Unicode
(probably UTF-8) are going to be common enough that specific literal
constructors are probably a very good idea.

Python proved that to be wrong - both the subclassing part and the
literals part.

Python proved squat. Especially since you continue to think that I'm
talking about subclassing. Which I'm not and never have been.

The fact that you have to designate Unicode strings with literals is a
bad decision and I can only suspect that it has to do with compiler
intolerance, and the need to do preprocessing.

Have to nothing. You're simply not willing to understand anything that
doesn't bow to the god of Unicode. This has nothing to do with your
stupid assumptions, here. This has everything to do with being smarter
than you're apparently wanting Ruby to be.

The special literals are convenience items only. Syntax sugar. The real
magic is in the assignment of encodings. And those are *always* special,
whether you want to pretend such or not.

I'm through with trying to argue with you and a few others who aren't
listening and suggesting the same backwards stuff over and over again
without considering that you might be wrong. Contrary to what you might
believe, I *have* looked at a lot of this stuff and have really reached
the point where Unicode-only and separate class hierarchies is a waste
of everyone's time and energy.

Argue for first-class Unicode support. But you should do so within the
framework which Matz has said he prefers (m17n String and no separate
byte array). Think about API changes that can make this valuable. I
think that Matz *has* settled on the basic data structure, though, and
it's a fight you probably won't win with him. Since, as he pointed out
to Charles Nutter, he's in the percentage of humanity which needs to
deal with non-Unicode more than it needs to deal with Unicode.

-austin

···

On 6/28/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:

On 28-jun-2006, at 20:36, Austin Ziegler wrote:

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

dda · 28 June 2006 20:03

Byte arrays – memory blocks, whatever – *do* have their uses, although
mainly not for string ops. I know I've used memory blocks a lot, for
image processing or other exotic tasks. But never, far as I can tell,
for strings. However, byte-level ops can be useful on strings. I can
see two uses:

One is for 1-byte encodings. If you know that char==byte, byte-level
ops will speed up processing of the strings, since no second guessing
has to be done.

Another is because sometimes you have to rip multi-byte chars open and
look at their entrails. Say I want to decompose a hangul syllable into
its primary letters. Unless there is a function provided for that –
fat chance considering the lack of interest in Unicode from the BDFL –
I'll have to do my own cooking at the byte-level.
Example:
irb(main):001:0> "한글".length
==> 2
irb(main):002:0> "かたかな".lengthB
==> 12
[assuming utf-8 here of course.]

I don't really care about a memory block [using this term instead of
bytearray so that I don't get classified in any camp :)], but if
Strings go encodings-aware [hooray], we'll need both types of
operations...

However, I think it is a bit psychotic to base the fundations of an
important feature of the language on the whims and needs[?] of a
*small* percentage of the user base. Unicode is an international,
*working* standard, whereas this m17n thing has little to show so far,
both in terms of production and acceptance [who uses m17n outside a
few agitated fellows inside Japan?].

Besides, while some variants of sinograms, aka kanji, and other exotic
chars, may not be in the Unicode project *yet* [including the first
sinogram of my wife's given name, which is not to be found in any
dictionary listing less than 50,000 sinograms; yeah, blame my
father-in-law...], what's in there for CJKV covers day-to-day needs of
most people. Seriously, how many times have you seen transcripts of
bone inscriptions on web sites or e-docs? Or arcane kanji pulled out
of the Morohashi? Or chu nom chars? Or Jürchen script? Sure, some
people do work with this stuff. I studied this stuff, and probably
would have liked a way to input/display them. But how many? And how
many use^H^H^H know of Ruby? Let's not lose focus on who's using
what...

my 0.02€

···

--
Didier

On 6/28/06, Austin Ziegler <halostatue@gmail.com> wrote:

Argue for first-class Unicode support. But you should do so within the
framework which Matz has said he prefers (m17n String and no separate
byte array). Think about API changes that can make this valuable. I
think that Matz *has* settled on the basic data structure, though, and
it's a fight you probably won't win with him. Since, as he pointed out
to Charles Nutter, he's in the percentage of humanity which needs to
deal with non-Unicode more than it needs to deal with Unicode.

-austin

Jim_Weirich1 · 28 June 2006 20:25

Austin Ziegler wrote:

Again, you are wrong. Horribly so. I *can* guarantee that sizeof(char)
== sizeof(byte) if String#encoding is a single-byte encoding or is "raw"
(or "binary", whichever Matz uses).

I think his point is that for any arbitrary string you cannot guarantee
that it is a single byte encoding and that your code should be written:

raise "Not PNG." unless (SINGLE_BYTE_ENCODINGS.include?(@top.encoding)
&& @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a")

Of course, if you can guarantee that @top is indeed a single byte
encoded BEFORE hitting this line, then the encoding test is not needed
(and I think you assume that).

But in the general case, it just seems easier to write:

raise "Not PNG." unless @top.byte(0, 8) == "\x89PNG\x0d\x0a\x1a\x0a"

which will work in all cases, without any effort to ensure the
precondition that the encoding is a single byte encoding.

-- Jim Weirich

···

--
Posted via http://www.ruby-forum.com/\.

Charles_Nutter · 28 June 2006 20:43

I'll give a little ground on a few points. Perhaps I had a dream that
adjusted my perspectives a bit.

- String == ByteArray is reasonable if String is considered to be a
"ByteChunker". The byte-chunking logic holds true for both binary string and
encoded string models; what's parametric about it is the size of the chunks.
While I still believe that a general-purpose, high-performance ByteArray
would be useful (perhaps preferable) for many operations, I will concede
that ByteArray + ChunkSizer (encoding) := m17n String. I won't say I'm sold
on the ByteChunker pattern, but I think it will be easier to accept and
discuss m17n Strings from a ByteChunker perspective. It also may be fair to
say that m17n String provides a "view" into the underlying byte array, which
could in the raw case be a wholly-transparent view.
- If the intent is to provide a String that supports all encodings
universally, I will concede that the m17n String is probably the only way to
do it. As far as I know, there's no one character encoding, code page, or
character set that can encompass all other encodings without fail. Unicode
does, despite what detractors may say, make a truly gallant attempt to
achieve that impossible goal, and it deserves the 90+% of humanity that use
Unicode or Unicode-encodable character sets exclusively. But if at the end
of the day Ruby really needs a kitchen-sink approach to character encoding,
Unicode will not fit that requirement.

So, a short glossary:

String == ByteChunker
chunk == character == n bytes in a specific order

The first item above brings out a few discussion points
1. String provides an interface for managing a collection of ByteChunks. The
sizing and nature of this "chunking" is primarily based on character
encoding logic. I'll refer to String and ByteChunker interchangeably from
here on out.
2. Indexed operations act upon chunks, not bytes. It may be the case that
for some encodings, sizeof(chunk) == sizeof(byte). No assumptions should be
made about chunk size.
3. Altering String semantics from "byte ops always" to "chunk ops always"
also implies that chunked operations should not be generally purposed toward
byte-level operations, since there is no explicit guarantee you'll work with
byte-sized (har har) chunks
4. Therefore it should be mandatory and acceptable under the supposed
ByteChunker contract to provide a minimal set of explicitly byte-sized
operations, since the purpose of chunking is to provide a way of consuming
and digesting bytes. It would not be useful or recommended to completely
hide those raw bytes under any circumstances, since byte-level operations
will always be valid on a ByteChunker in the absence of a more specific
ByteArray type.
5. Byte-sized operations should be STRONGLY ENCOURAGED for byte-level work
over chunk-sized operations due to the changing size and nature of chunks.
This would mean that [0..5] should never be used instead of byte(0..5) for
retrieving the first five bytes in a ByteChunker
6. Methods on other classes whose purpose is to manipulate character data
(chunks) logically should never be assumed to work with byte-sized chunks
only (regex and friends)

A common theme here is that ByteChunker does have a set of logical
semantics, and m17n Strings as planned appear to be ByteChunkers. This seems
like a reasonable abstraction to me, though it does expose implementation
details many of us would prefer to keep hidden (namely that we're chunking
bytes, when a consumer shouldn't need to know what chunks are composed of).
If we can reasonably attempt to define the ByteChunker semantics, we can see
where the holes are.

I could do without a separate ByteArray if the m17n String provided explicit
byte-sized operations. The dual-purposing of String ops for both bytes and
chunks is very worrisome since it's bound to happen that chunk operations
get incorrectly used for byte operations when sizeof(chunk) != sizeof(byte).
For byte-sensitive cases, rather than saying "I know that my String is in
encoding X in which all chunks are byte-sized" it would be far safer (and
better encapsulation of String state) to say "I know I need to work with
bytes all the time". The byte-sized operations then follow. (Granted,
there's no way to force people *ahem Austin* to use the "byte-safe" methods,
but it feels like a really good best-practice to me).

A caveat to all this is that ByteChunker semantics are inherently more
complex than CharacterSequence semantics, and so the proposed m17n String
*is* a more complicated solution than using a single internal encoding. I'm
also not convinced that ByteChunker's semantics are simpler than separate
CharacterSequence and ByteArray semantics, they they may be more "Ruby."
ByteChunker itself is a more useful general-purpose entity in itself than
CharacterSequence or ByteArray alone.

complexity(ByteChunker) > (complexity(ByteArray) or
complexity(CharacterSequence))
complexity(ByteArray and CharacterSequence) maybe > complexity(ByteChunker)
generality(ByteChunker) > (generality(ByteArray) or
generality(CharacterSequence))

I think it's still a valid question whether there's not a happy medium
somewhere that would make life easiest for the folks using unicode. Rubyists
are fond of saying that Ruby makes easy problems easy and hard problems
possible. I would argue that unicode support should be the "easy problem"
that's easy and that support for incompatible encodings--worldwide--should
be the "hard problem" that's possible. Any plans for m17n that make unicode
harder to work with in Ruby than in comparable languages could prove fatal.

···

--
Charles Oliver Nutter @ headius.blogspot.com
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

Gary_Wright · 28 June 2006 20:50

Does it even make sense to talk about 'encodings' in the context of binary data?

I suppose you could extend the concept of encoding to capture some sort of
mime type characterization of the data but isn't that a bit beyond what this
thread has been talking about?

I like Austin's idea of an encoding as a 'lens' with respect to the raw data.

It is getting pretty hard to follow this entire discussion in the absence of
some concrete examples of the imagined APIs as well as some sort of taxonomy
of use cases with which to evaluate the APIs. For example:

  - create a copy of a text file when the text encoding is unknown
  - transmit a copy of a text file across a TCP/IP connection when the
    encoding is unknown
  - analyze binary data and guess at its text encoding
  - convert PNG image to a GIF image, in memory, to/from disk
  - input n characters from the keyboard/stdin/tty
  - count the number of words, lines, and characters in a file
       with an explicit encoding
       with an implicit encoding associated with a given locale
       with an implicit encoding associated with the process/thread

and so on.

Gary Wright

···

On Jun 28, 2006, at 4:25 PM, Jim Weirich wrote:

raise "Not PNG." unless (SINGLE_BYTE_ENCODINGS.include?(@top.encoding)
&& @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a")

Of course, if you can guarantee that @top is indeed a single byte
encoded BEFORE hitting this line, then the encoding test is not needed
(and I think you assume that).

Jim_Weirich1 · 28 June 2006 21:23

Charles O Nutter wrote:

The dual-purposing of String ops for both bytes
and chunks is very worrisome since it's bound to happen
that chunk operations get incorrectly used for byte
operations when sizeof(chunk) != sizeof(byte).

I also have this concern.

Here's radical idea. Perhaps it is time to deprecate the str[*arg]
operation in favor of str.char(*arg) and str.byte(*arg) like operations,
making it explicit which operation is to be used. It will break a lot
of code, but then, changing the semantics of from byte oriented to
character oriented operations will probably *silently* break a lot of
code as well. All things considered, I would prefer noisy breaking.

Like I said, its a radical idea.

-- Jim Weirich

···

--
Posted via http://www.ruby-forum.com/\.

Austin_Ziegler5 · 28 June 2006 21:51

I'll give a little ground on a few points. Perhaps I had a dream that
adjusted my perspectives a bit.

- String == ByteArray is reasonable if String is considered to be a
"ByteChunker". [...]

This is essentially how I view Strings. What makes a String special is
NOT the fact that it's a String. It's the encoding associated with the
String. This is an important distinction.

[...] It also may be fair to say that m17n String provides a "view"
into the underlying byte array, which could in the raw case be a
wholly-transparent view.

Precisely. This is why I've been describing encodings as a lens.

- If the intent is to provide a String that supports all encodings
universally, [...] it deserves the 90+% of humanity that use Unicode
or Unicode-encodable character sets exclusively. But if at the end of
the day Ruby really needs a kitchen-sink approach to character
encoding, Unicode will not fit that requirement.

And I believe this to be the case. But I *also* believe that Ruby's
support for Unicode needs to be first-rate. Where I am getting most
frustrated is that few people have understood that -- and even fewer
have understood that viewing first-rate support for Unicode isn't
incompatible with m17n String.

[...]

2. Indexed operations act upon chunks, not bytes. It may be the case
that for some encodings, sizeof(chunk) == sizeof(byte). No
assumptions should be made about chunk size.

Chunk size is variable based on the encoding. Specifically:

sizeof(char) == sizeof(chunk)

3. Altering String semantics from "byte ops always" to "chunk ops
   always" also implies that chunked operations should not be
   generally purposed toward byte-level operations, since there is no
   explicit guarantee you'll work with byte-sized (har har) chunks

Correct. And in this case, the use of a #bytes accessor may make it
possible to operate on byte-level operations *explicitly*. I will grant
that much.

[...]

5. Byte-sized operations should be STRONGLY ENCOURAGED for byte-level
   work over chunk-sized operations due to the changing size and
   nature of chunks. This would mean that [0..5] should never be used
   instead of byte(0..5) for retrieving the first five bytes in a
   ByteChunker

For some data, though, this may be irrelevant (the PNG example I gave
earlier).

6. Methods on other classes whose purpose is to manipulate character
data (chunks) logically should never be assumed to work with
byte-sized chunks only (regex and friends)

I believe that this is fair.

I could do without a separate ByteArray if the m17n String provided
explicit byte-sized operations. The dual-purposing of String ops for
both bytes and chunks is very worrisome since it's bound to happen
that chunk operations get incorrectly used for byte operations when
sizeof(chunk) != sizeof(byte). For byte-sensitive cases, rather than
saying "I know that my String is in encoding X in which all chunks are
byte-sized" it would be far safer (and better encapsulation of String
state) to say "I know I need to work with bytes all the time". The
byte-sized operations then follow. (Granted, there's no way to force
people *ahem Austin* to use the "byte-safe" methods, but it feels like
a really good best-practice to me).

And that may be the case. But I also think that it's unnecessary to have
byte-safe methods even if they're available. I can guarantee that JPEG
data will not be Unicode *per se*. Certain data (EXIF, for example)
inside of the JPEG could theoretically be Unicode, but it will always be
stored in what is *clearly* a binary data area and converted afterwards.

A caveat to all this is that ByteChunker semantics are inherently more
complex than CharacterSequence semantics, and so the proposed m17n
String *is* a more complicated solution than using a single internal
encoding. I'm also not convinced that ByteChunker's semantics are
simpler than separate CharacterSequence and ByteArray semantics, they
they may be more "Ruby." ByteChunker itself is a more useful
general-purpose entity in itself than CharacterSequence or ByteArray
alone.

Also note that separating CharacterSequence really *only* helps if
there's a single encoding to deal with. Otherwise, the CharacterSequence
has to interpret using ByteChunker-like facilities anyway.

[Edit: (C == complexity; G == generality)]

C(ByteChunker) > (C(ByteArray) or C(CharacterSequence))
C(ByteArray and CharacterSequence) maybe > C(ByteChunker)
G(ByteChunker) > (G(ByteArray) or G(CharacterSequence))

Um. As far as implementation complexity is concerned, I would agree with
your statements here. However, I firmly believe that *use* complexity of
a ByteChunker is of lower complexity than that of ByteArray and
CharSequence.

I think it's still a valid question whether there's not a happy medium
somewhere that would make life easiest for the folks using unicode.
Rubyists are fond of saying that Ruby makes easy problems easy and
hard problems possible. I would argue that unicode support should be
the "easy problem" that's easy and that support for incompatible
encodings--worldwide--should be the "hard problem" that's possible.
Any plans for m17n that make unicode harder to work with in Ruby than
in comparable languages could prove fatal.

Right. The problem is, adding a ByteArray makes a currently easy thing
harder, which is byte manipulation and acquisition. The added complexity
of that is not worth Unicode, IMO. But I do *not* see this as either or.
To be perfectly clear, I don't care if the ByteChunker is harder for
matz or someone else to implement if the API available for Unicode- and
ByteArray-semantics is at least as expressive and as powerful as what we
have today.

-austin

···

On 6/28/06, Charles O Nutter <headius@headius.com> wrote:
--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Austin_Ziegler5 · 28 June 2006 22:01

raise "Not PNG." unless (SINGLE_BYTE_ENCODINGS.include?
(@top.encoding) && @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a")

Of course, if you can guarantee that @top is indeed a single byte
encoded BEFORE hitting this line, then the encoding test is not
needed (and I think you assume that).

Does it even make sense to talk about 'encodings' in the context of
binary data?

I suppose you could extend the concept of encoding to capture some
sort of mime type characterization of the data but isn't that a bit
beyond what this thread has been talking about?

I like Austin's idea of an encoding as a 'lens' with respect to the
raw data.

It is getting pretty hard to follow this entire discussion in the
absence of some concrete examples of the imagined APIs as well as some
sort of taxonomy of use cases with which to evaluate the APIs. For
example:
  - create a copy of a text file when the text encoding is unknown
  - transmit a copy of a text file across a TCP/IP connection when the
    encoding is unknown

If you don't know the encoding, you must use binary (unencoded) data.
That will be unchanged ... unless we have a ByteArray.

without ByteArray, without pragma:
File.open(a, "r") { |b| File.open(c, "w") { |d| d.write b.read } }

  without ByteArray, with pragma:
    File.open(a, "r", encoding: "binary") { |b|
      File.open(c, "w", encoding: "binary") { |d|
        d.write b.read
      }
    }

  with ByteArray:
    File.open(a, "r") { |b|
      File.open(c, "w") { |d|
        d.write_bytes b.read_bytes
      }
    }

- analyze binary data and guess at its text encoding

This is a "hard" problem and not demonstratable easily; there are
expensive programs out there that do this. The problem is that encodings
like ISO-8859-1 and ISO-8859-5 *mean* different things and you'd have
serious textual analysis to determine which you're looking at. On the
other hand, simple analysis (e.g., determining ISO-8859-* but not which
one, as opposed to UTF-8) may be possible.

- convert PNG image to a GIF image, in memory, to/from disk

Use RMagick.

- input n characters from the keyboard/stdin/tty

I'm not sure, to be honest. This is unlikely to be cross-platform.

-austin

···

On 6/28/06, gwtmp01@mac.com <gwtmp01@mac.com> wrote:

On Jun 28, 2006, at 4:25 PM, Jim Weirich wrote:

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Austin_Ziegler5 · 28 June 2006 22:02

I can't say that I *like* it; it's *clean* to say str[0..5], but I
can't say it's a bad idea either.

Just radical.

-austin

···

On 6/28/06, Jim Weirich <jim@weirichhouse.org> wrote:

Charles O Nutter wrote:
> The dual-purposing of String ops for both bytes
> and chunks is very worrisome since it's bound to happen
> that chunk operations get incorrectly used for byte
> operations when sizeof(chunk) != sizeof(byte).
I also have this concern.

Here's radical idea. Perhaps it is time to deprecate the str[*arg]
operation in favor of str.char(*arg) and str.byte(*arg) like operations,
making it explicit which operation is to be used. It will break a lot
of code, but then, changing the semantics of from byte oriented to
character oriented operations will probably *silently* break a lot of
code as well. All things considered, I would prefer noisy breaking.

Like I said, its a radical idea.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
* austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
* austin@zieglers.ca

Tim_Bray · 28 June 2006 23:36

I think people understand what you want. But those of us who've done a lot of i18n work know how hard it is to get things right; for example, the single hardest piece of writing an efficient XML parser is dealing with the character input/output. Those of us who write search engines and have sweated the language-sensitive tokenization details are also paranoid about these problems. We also know that it is *possible* to get things right, if you adopt the limitation that characters are Unicode characters. Matz is making a strong claim: that he can write a class that will get Unicode right and also handle arbitrary other character sets and encodings, and serve as a byte buffer (it's a floor wax *and* a dessert topping!) and do this all with acceptable correctness and efficiency. This has not previously been done that I know of. If he can pull it off, that's super. It's not unreasonable to worry, though.

I would offer one piece of advice for the m17n implementation: have a unicode/non-unicode mode bit, and in the case that it's Unicode, pick one encoding and stick to it (probably UTF-8, because that's friendlier to C programmers). The reason that this is a good idea is that if you know the encoding, then for certain performance-critical tasks (e.g. regexp) you can do sleazy low-level optimizations that run on the encoding rather than on the chunked chars.

Yes, you'd have to do conversion of all the 8859 and JIS and Big5 and so on going in and out, but if the volume is big enough that you care, there'll be disks involved, and you can transcode way faster than I/O speeds, so the conversion cost will probably not be observable.

Among other things, I want to be able to process XML in Ruby really really fast, and in XML you *know* that it's all Unicode characters; so it would be nice to leave the door open for low-level Unicode-specific optimizations.

-Tim

···

On Jun 28, 2006, at 2:51 PM, Austin Ziegler wrote:

And I believe this to be the case. But I *also* believe that Ruby's
support for Unicode needs to be first-rate. Where I am getting most
frustrated is that few people have understood that -- and even fewer
have understood that viewing first-rate support for Unicode isn't
incompatible with m17n String.

Yukihiro_Matsumoto2 · 29 June 2006 03:20

Hi,

Matz is making a strong claim:
that he can write a class that will get Unicode right and also handle
arbitrary other character sets and encodings, and serve as a byte
buffer (it's a floor wax *and* a dessert topping!) and do this all
with acceptable correctness and efficiency. This has not previously
been done that I know of. If he can pull it off, that's super. It's
not unreasonable to worry, though.

Have you ever heard of regular expression engine (one of the hardest
parts to implement in text processing) that handles more than 30
different encodings without conversion, _and_ runs faster than PCRE?

If you have, you might be able to believe existence of something that
is a floor wax and a dessert topping at the same time.

If you haven't, I tell you that it is named Oniguruma, regular
expression engine comes with Ruby 1.9.

matz.

···

In message "Re: Unicode roadmap?" on Thu, 29 Jun 2006 08:36:12 +0900, Tim Bray <tbray@textuality.com> writes:

Michal_hramrach_Such · 29 June 2006 06:07

But then you risk that people would lick the floor, which some may
find unacceptable

Michal

···

On 6/29/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

If you have, you might be able to believe existence of something that
is a floor wax and a dessert topping at the same time.

Tim_Bray · 29 June 2006 22:24

Have you ever heard of regular expression engine (one of the hardest
parts to implement in text processing) that handles more than 30
different encodings without conversion, _and_ runs faster than PCRE?

...

If you haven't, I tell you that it is named Oniguruma, regular
expression engine comes with Ruby 1.9.

I'd heard of it but I hadn't tried it until now. Previously I have done quantitative measurement of the performance of Perl vs. Java regex engines (conclusion: Java is faster but perl is safer, see http://www.tbray.org/ongoing/When/200x/2005/11/20/Regex-Promises\).

I thought I would compare Oniguruma, so I downloaded it and compiled it and ran some tests and looked at the documentation. (サービス終了のお知らせ and サービス終了のお知らせ, or is there something better?)

Oniguruma is very clever; support for multiple different regex syntaxes? Wow.

The documentation needs a little work, the example files such as simple.c do not correspond very well (e.g. ONIG_OPTION_DEFAULT).

But I think I must be missing something, because I can't run my test. It's is a fast approximate word counter for large volumes of XML. Here is how the regular expression is built in Perl:

my $stag = "<[^/]([^>]*[^/>])?>";
my $etag = "</[^>]*>";
my $empty = "<[^>]*/>";

my $alnum =
     "\\p{L}|" .
     "\\p{N}|" .
     "[\\x{4e00}-\\x{9fa5}]|" .
     "\\x{3007}|" .
     "[\\x{3021}-\\x{3029}]";
my $wordChars =
     "\\p{L}|" .
     "\\p{N}|" .
     "[-._:']|" .
     "\\x{2019}|" .
     "[\\x{4e00}-\\x{9fa5}]|" .
     "\\x{3007}|" .
     "[\\x{3021}-\\x{3029}]";
my $word = "(($alnum)(($wordChars)*($alnum))?)";

my $regex = "($stag)|($etag)|($empty)|$word";

full regex: (<[^/]([^>]*[^/>])?>)|(</[^>]*>)|(<[^>]*/>)|((\p{L}|\p{N}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])((\p{L}|\p{N}|[-._:']|\x{2019}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])*(\p{L}|\p{N}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}]))?)

I have a very specific idea of what I mean by "word". \w is nice but it's not what I mean.

As far as I can tell, \p{L} and so on don't work, so I can't do this in Oniguruma. Error message: "ERROR: invalid character property name {L}". So a bit more work is required to support Unicode? (Supporting the properties from Chapter 4 is very important.) Or am I mis-reading the documentation? I did it in C because simple.c was there, would it make a difference if I did it from Ruby 1.9?

-Tim

···

On Jun 28, 2006, at 8:20 PM, Yukihiro Matsumoto wrote:

dda · 29 June 2006 23:14

FWIW, OniGuruma is the regex engine used by SubEthaEdit – via the
OgreKit [OniGuruma RegEx Kit for Cocoa]. I am not too sure about its
being faster than PCRE – tests in SEE and BBEdit don't show anything
conclusive. One of my pet peeves with OgreKit/SEE is that it treats
the full text as one line by default, making ^...$ useless [and in the
version of SEE I use ^ doesn't work...].

OTOH, the good thing about OgreKit/SEE is that \w+ on 한글日本語dodo will
catch the whole yahzoo, whereas in PCRE/BBEdit only dodo will get
caught. Yet again, \p{L} works in PCRE, which helps refine what one
wants to call a word, as Mr. Bray showed.

···

--
Didier

On 6/30/06, Tim Bray <tbray@textuality.com> wrote:

On Jun 28, 2006, at 8:20 PM, Yukihiro Matsumoto wrote:

> Have you ever heard of regular expression engine (one of the hardest
> parts to implement in text processing) that handles more than 30
> different encodings without conversion, _and_ runs faster than PCRE?
...
> If you haven't, I tell you that it is named Oniguruma, regular
> expression engine comes with Ruby 1.9.

I'd heard of it but I hadn't tried it until now. Previously I have
done quantitative measurement of the performance of Perl vs. Java
regex engines (conclusion: Java is faster but perl is safer, see
http://www.tbray.org/ongoing/When/200x/2005/11/20/Regex-Promises\).

I thought I would compare Oniguruma, so I downloaded it and compiled
it and ran some tests and looked at the documentation. (http://
www.geocities.jp/kosako3/oniguruma/doc/RE.txt and http://
www.geocities.jp/kosako3/oniguruma/doc/API.txt, or is there something
better?)

Oniguruma is very clever; support for multiple different regex
syntaxes? Wow.

The documentation needs a little work, the example files such as
simple.c do not correspond very well (e.g. ONIG_OPTION_DEFAULT).

But I think I must be missing something, because I can't run my
test. It's is a fast approximate word counter for large volumes of
XML. Here is how the regular expression is built in Perl:

my $stag = "<[^/]([^>]*[^/>])?>";
my $etag = "</[^>]*>";
my $empty = "<[^>]*/>";

my $alnum =
     "\\p{L}|" .
     "\\p{N}|" .
     "[\\x{4e00}-\\x{9fa5}]|" .
     "\\x{3007}|" .
     "[\\x{3021}-\\x{3029}]";
my $wordChars =
     "\\p{L}|" .
     "\\p{N}|" .
     "[-._:']|" .
     "\\x{2019}|" .
     "[\\x{4e00}-\\x{9fa5}]|" .
     "\\x{3007}|" .
     "[\\x{3021}-\\x{3029}]";
my $word = "(($alnum)(($wordChars)*($alnum))?)";

my $regex = "($stag)|($etag)|($empty)|$word";

full regex: (<[^/]([^>]*[^/>])?>)|(</[^>]*>)|(<[^>]*/>)|((\p{L}|\p{N}|
[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])((\p{L}|\p{N}|
[-._:']|\x{2019}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])*(\p
{L}|\p{N}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}]))?)

I have a very specific idea of what I mean by "word". \w is nice but
it's not what I mean.

As far as I can tell, \p{L} and so on don't work, so I can't do this
in Oniguruma. Error message: "ERROR: invalid character property name
{L}". So a bit more work is required to support Unicode? (Supporting
the properties from Chapter 4 is very important.) Or am I mis-
reading the documentation? I did it in C because simple.c was there,
would it make a difference if I did it from Ruby 1.9?

   -Tim

Topic		Replies	Views
Ruby Weekly News 13th - 19th December 2004 ruby-talk	0	101	22 December 2004
Ruby-dev summary 26385-26467 ruby-talk	1	118	18 July 2005
Unicode roadmap? ruby-talk	262	604	1 June 2007
A few good articles on Unicode ruby-talk	3	136	16 June 2006
Unicode in Ruby now? ruby-talk	0	104	6 August 2002

Unicode roadmap?

Related topics