Unicode

James_Britt · 29 September 2007 23:50

Scratch that! I must've gone cross-eyed!
My c++ code was indeed smaller file size in utf-8 than utf-16 as I expected!
Interestingly, *nix's apparently use utf-32 internally regardless of the source encoding... very interesting

···

On Sep 29, 2007, at 2:29 PM, James Edward Gray II wrote:

On Sep 29, 2007, at 2:13 PM, John Joyce wrote:

The short version is that UTF-16 is basically wasteful.

That's not always accurate:

$ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt > japanese_prose_in_utf16.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
14 66 5921 japanese_prose_in_utf8.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
16 45 3968 japanese_prose_in_utf16.txt

James Edward Gray II

Terry_Poulin · 30 September 2007 04:55

Felipe Contreras wrote:

I didn't find anything new. It's just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can't store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn't
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

ASCII is a 7-Bit Encoding with 128 characters in the set.

Most PC's these days use an 8 bit byte. I'm no rocket scientist when it comes
to CPU Architectures or character encodings but I would think the machines
byte or word size would be the most logical choices....

Most of my files are in UTF-8 or ISO 8859-1 (and probably some Windows-1252).
As far as I know UTF-8 and Latin 1 are compatible in the first 128 char
because of ASCII's wide spread'ness.

Since I may have missed the original message.... What is the problem again?

TerryP.

···

--

Email and shopping with the feelgood factor!
55% of income to good causes. http://www.ippimail.com

James_Britt · 30 September 2007 05:22

Hmm... you should consider converting it to utf-8 via iconv.
There is a gem for iconv
This will keep your data intact, but you might need to convert it back to utf-16 later.

I believe filenames on windows are actually utf-8,
Files' contents are generally written in utf-16

Could be wrong on this...
but test it and see!
Try to to open a file with non-ascii range characters in irb and see what happens.
If it fails, no harm done.

···

On Sep 29, 2007, at 9:47 PM, Felipe Contreras wrote:

Hi,

On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

Yes but what about stuff already encoded in UTF-16?

That's why I said read up on unicode!
After you read that stuff you'll understand why it's no problem.
I'm not going to explain it. Many people understand it, but when
explaining it might make mistakes.
Read the unicode stuff carefully. It's vital for many things.

The only thing you might run into is BOM or Endian-ness, but it's
doubtful it will be an issue in most cases.

This might get you started.
FAQ - UTF-8, UTF-16, UTF-32 & BOM

Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2
bytes for lower-level code-points (the stuff also known as ASCII
range) where UTF-8 does not.

As you suggested I read the article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software

I didn't find anything new. It's just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can't store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn't
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

You really need to spend an afternoon reading about unicode. It
should be required in any computer science program as part of an
encoding course, Americans in particular are often the ones who know
the least about it....

What is there to know about Unicode? There's a couple of character
sets, use UTF-8, and remember that one character != one byte. Is there
anything else for practical purposes?

I'm sorry if I'm being rude, but I really don't like when people tell
me to read stuff I already know.

My question is still there:

Let's say I want to rename a file "fooobar", and remove the third "o",
but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
of course there will still be a 0x00 in there. That's if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

I don't mind reading some more if I can actually find the answer.

Best regards.

-- Felipe Contreras

Michal_hramrach_Such · 1 October 2007 13:51

Hi,

> >
> > Yes but what about stuff already encoded in UTF-16?
>
> That's why I said read up on unicode!
> After you read that stuff you'll understand why it's no problem.
> I'm not going to explain it. Many people understand it, but when
> explaining it might make mistakes.
> Read the unicode stuff carefully. It's vital for many things.
>
> The only thing you might run into is BOM or Endian-ness, but it's
> doubtful it will be an issue in most cases.
>
> This might get you started.
> FAQ - UTF-8, UTF-16, UTF-32 & BOM
>
>
> Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
> how programmers need to know it and how few actually do.
> The short version is that UTF-16 is basically wasteful. It uses 2
> bytes for lower-level code-points (the stuff also known as ASCII
> range) where UTF-8 does not.

As you suggested I read the article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software

I didn't find anything new. It's just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can't store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn't
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

UTF-8 and UTF-16 are pretty much the same. They encode a single
character using one or more units, where these units are 8-bit or
16-bit respectively. The only thing you buy by converting to utf-16 is
space efficiency for codepoints that require nearly 16 bits to
represent (such as Japanese characters) and endianness issues. Note
that some characters may (and some must) be composed of multiple
codepoints (a character codepoint, and additional accent
codepoint(s)).

> You really need to spend an afternoon reading about unicode. It
> should be required in any computer science program as part of an
> encoding course, Americans in particular are often the ones who know
> the least about it....

What is there to know about Unicode? There's a couple of character
sets, use UTF-8, and remember that one character != one byte. Is there
anything else for practical purposes?

I'm sorry if I'm being rude, but I really don't like when people tell
me to read stuff I already know.

My question is still there:

Let's say I want to rename a file "fooobar", and remove the third "o",
but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
of course there will still be a 0x00 in there. That's if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

If you handle UTF-16 as something else you break it regardless of the
language support. If you know (or have a way to find out) it's UTF-16
you can convert it. If there is no way to find out all language
support is moot.

Thanks
Michal

···

On 30/09/2007, Felipe Contreras <felipe.contreras@gmail.com> wrote:

On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

James_Edward_Gray_II · 1 October 2007 14:08

The iconv library is a standard library shipped with Ruby.

James Edward Gray II

···

On Sep 30, 2007, at 12:22 AM, John Joyce wrote:

There is a gem for iconv

James_Britt · 1 October 2007 14:21

Sure enough!
Just got so used to require rubygems with nearly everything...

···

On Oct 1, 2007, at 9:08 AM, James Edward Gray II wrote:

On Sep 30, 2007, at 12:22 AM, John Joyce wrote:

There is a gem for iconv

The iconv library is a standard library shipped with Ruby.

James Edward Gray II

Topic		Replies	Views
Unicode in ruby ruby-talk	34	150	15 March 2006
Unicode roadmap? ruby-talk	17	106	18 June 2006
What character sets are available in Ruby? ruby-talk	16	149	10 March 2003
Unicode in Ruby now? ruby-talk	11	132	15 August 2002
UTF-8 question ruby-talk	20	183	15 August 2003

Unicode

Related topics