[ruby-talk:443098] Converting bytes into characters

Is it possible within Ruby code to read bytes 1 at a time and convert those
into characters such that you can detect when the byte sequence needs at
least 1 more byte to continue building the character as opposed to
detecting that the byte sequence is completely invalid? This is possible
if you have source and destination encodings and you use
Encoding::Converter#primitive_convert, but in my case, I just want to read
in with the source encoding. I don't want to perform a character
conversion.

I'm basically trying to reimplement IO#readchar using a stream of bytes and
a target encoding. In the C code, the implementation of IO#readchar is
able to access some additional functions for encodings unavailable to Ruby
code that it uses to tell the difference between an incomplete byte
sequence and an invalid byte sequence.

The closest approach I've found so far is to put the bytes into a string
and call #force_encoding with my desired encoding. The problem is that
there seems to be no way to tell the difference between an incomplete byte
sequence or an invalid byte sequence. The #valid_encoding? method in both
cases reports that the encoding is invalid, so I don't know if adding
another byte from the stream would resolve the issue or not.

The workaround is to load more bytes into the string just in order to avoid
the possibility of an incomplete sequence. Then you can fetch the first
character from the string. You'll get either a valid character or an
invalid one. Assuming you added enough bytes to the string, an invalid
character resulted from an invalid byte sequence rather than an incomplete
byte sequence.

There are 2 problems with this approach. The first is that knowing the
right number of additional bytes to load in order to avoid the possibility
of an incomplete sequence is going to depend on the encoding, and that
information doesn't appear to be available from the encoding objects
themselves. The second is that reading excess bytes in the case of an
invalid sequence is inefficient and may lead to unnecessary blocking
operations while reading additional bytes from the byte stream.

As an example, consider the following bytes:

[224, 160, 128]

Taken together the 3 bytes form a valid character in UTF-8. The following
sequence of bytes is invalid though and cannot lead to a valid character:

[224, 128]

After reading in 2 bytes from these sequences, how can we know that the
first pair of bytes just need 1 more to resolve things while the second
pair is a dead end?

-Jeremy