R1.9 mixed encoding in file

Hello

I wonder if it is possible to enforce encoding of string in ruby 1.9.
Let say I have following example:

C:\enc>echo p 'test'.encoding > encoding.rb
C:\enc>ruby encoding.rb
#<Encoding:US-ASCII>

Thats fine. But what if I like to have in single file ASCII, UTF-8 or
strings with other encodings, i.e.

C:\enc>echo p 'zufällige_žluťoučký'.encoding > encoding.rb
C:\enc>ruby encoding.rb
encoding.rb:1: invalid multibyte char (US-ASCII)

I know that for this particular case I could use directive on top of the
file, but I would like to see something in following manner:

String.new 'zufällige_žluťoučký', Encoding.CP852

It means read the content in between quotes binary and interpret it
according to specified encoding.

Vit

···

--
Posted via http://www.ruby-forum.com/.

Hello

Hello.

I wonder if it is possible to enforce encoding of string in ruby 1.9.
Let say I have following example:

C:\enc>echo p 'test'.encoding > encoding.rb
C:\enc>ruby encoding.rb
#<Encoding:US-ASCII>

Thats fine. But what if I like to have in single file ASCII, UTF-8 or
strings with other encodings, i.e.

C:\enc>echo p 'zufällige_žluťoučký'.encoding > encoding.rb
C:\enc>ruby encoding.rb
encoding.rb:1: invalid multibyte char (US-ASCII)

I know that for this particular case I could use directive on top of the
file, but I would like to see something in following manner:

String.new 'zufällige_žluťoučký', Encoding.CP852

It means read the content in between quotes binary and interpret it
according to specified encoding.

The problem with an idea like this is that before your String is ever created the code to create it must be read (correctly) by Ruby's parser and formed into a proper String literal. That would be impossible to do if String literals could be in any random Encoding.

You have a couple of options though:

* Just set an Encoding like UTF-8 for the source code, enter everything in UTF-8, and transcode it into the needed Encoding. This would make your example something like:

   # encoding: UTF-8
   cp852 = "zufällige_žluťoučký".encode("CP852") # literal in UTF-8

* Have one or more data files the program reads needed String objects from. Those files can be in any Encoding you need and you can specify it to IO operations, so your String objects are returned with that Encoding.

I hope that helps.

James Edward Gray II

···

On Aug 7, 2009, at 8:49 AM, Vít Ondruch wrote:

You seem to be asking for the ability to have individual string
literals have encoding different from that of the program as a whole.
Why not this:

#encoding: ascii-8bit
'zufällige_žluťoučký'.force_encoding 'cp852'
'some utf8 data'.force_encoding 'utf-8'
'some sjis data'.force_encoding 'sjis'

I am far from an expert on encodings, but in my (admittedly minimalist
and perhaps inadequate) testing, this seems to basically work.

There are going to be holes in this; data in nonascii compatible
encodings in particular may give trouble. However, if the string data
does not contain the bytes 0x27 (ascii ') or 0x5C (ascii \) there will
be no problem. Whether this will work in particular circumstances
given a known encoding and data to be represented in it is unknown in
general, but surely very often the case. If it's the single quote
character that causes the problem, you can switch to a different
character using the%q quote syntax. In extremis, a single quoted
here document may be called for:

  <<-'end'
    lotsa ' and \ here, but ruby don't care
  end

This form of string has the advantage of having no special characters
at all, and you can choose the sequence of bytes that makes up the
string terminator to be anything you want. (but you do end up with an
extra (ascii) newline at the end...)

Another challenge will be editing this file. There's no editor out
there that could actually display this kind of thing correctly; you'll
have to become proficient at editing it as binary, or at least find an
editor than can tolerate arbitrary binary chars in its ascii.

···

On 8/7/09, Vít Ondruch <v.ondruch@tiscali.cz> wrote:

file, but I would like to see something in following manner:

String.new 'zufällige_žluťoučký', Encoding.CP852

Vít Ondruch wrote:

I know that for this particular case I could use directive on top of the
file, but I would like to see something in following manner:

String.new 'zufällige_žluťoučký', Encoding.CP852

It's not pretty, but

   str = "zuf\x84llige_\xA7lu\x9Cou\x9Fk\xEC".force_encoding("CP852")

will probably do the job.

···

--
Posted via http://www.ruby-forum.com/\.

Discount Ed hardy tshirt (www.ebuyings.com)
Discount Ed hardy jean (www.ebuyings.com)
Discount Ed hardy shoes (www.ebuyings.com)
Discount Ed hardy handbag (www.ebuyings.com)
Discount Ed hardy other porduct (www.ebuyings.com)
Discount Nike air jordans (www.ebuyings.com)
Discount Nike Air Max 90 Sneakers (www.ebuyings.com)
Discount Nike Air Max 91 Supplier (www.ebuyings.com)
Discount Nike Air Max 95 Shoes Supplier (www.ebuyings.com)
Discount Nike Air Max 97 Trainers (www.ebuyings.com)
Discount Nike Air Max 2003 Wholesale (www.ebuyings.com)
Discount Nike Air Max 2004 Shoes Wholesale
(www.ebuyings.com)
Discount Nike Air Max 2005 Shop (www.ebuyings.com)
Discount Nike Air Max 2006 Shoes Shop (www.ebuyings.com)
Discount Nike Air Max 360 Catalogs (www.ebuyings.com)
Discount Nike Air Max Ltd Shoes Catalogs (www.ebuyings.com)
Discount Nike Air Max Tn Men's Shoes (www.ebuyings.com)
Discount Nike Air Max Tn 2 Women's Shoes (www.ebuyings.com)
Discount Nike Air Max Tn 3 Customize (www.ebuyings.com)
Discount Nike Air Max Tn 4 Shoes Customize
( www.ebuyings.com)
Discount Nike Air Max Tn 6 Supply (www.ebuyings.com)
Discount Nike Shox NZ Shoes Supply (www.ebuyings.com)
Discount Nike Shox OZ Sale (www.ebuyings.com)
Discount Nike Shox TL Store (www.ebuyings.com)
Discount Nike Shox TL 2 Shoes Store (www.ebuyings.com)
Discount Nike Shox TL 3 Distributor (www.ebuyings.com)
Discount Nike Shox Bmw Shoes Distributor (www.ebuyings.com)
Discount Nike Shox Elite Shoes Manufacturer
(www.ebuyings.com)
Discount Nike Shox Monster Manufacturer (www.ebuyings.com)
Discount Nike Shox R4 Running Shoes (www.ebuyings.com)
Discount Nike Shox R5 Mens Shoes (www.ebuyings.com)
Discount Nike Shox Ride Womens Shoes (www.ebuyings.com)
Discount Nike Shox Rival Shoes Wholesaler (www.ebuyings.com)
Discount Nike Shox Energia Wholesaler (www.ebuyings.com)
Discount Nike Shox LV Sneaker (www.ebuyings.com)
Discount Nike Shox Turbo Suppliers (www.ebuyings.com)
Discount Nike Shox Classic Shoes Suppliers
(www.ebuyings.com)
Discount Nike Shox Dendara Trainer (www.ebuyings.com)
Discount Nike Air Jordan 1 Seller (www.ebuyings.com)
Discount Nike Air Jordan 2 Shoes Seller (www.ebuyings.com)
Discount Nike Air Jordan 3 Collection (www.ebuyings.com)
Discount Nike Air Jordan 4 Shoes Collection
(www.ebuyings.com)
Discount Nike Air Jordan 5 Chaussure Shoes
(www.ebuyings.com)
Discount Nike Air Jordan 6 Catalog (www.ebuyings.com)
Discount Nike Air Jordan 7 Shoes Catalog (www.ebuyings.com)
Discount Nike Air Jordan 8 Customized (www.ebuyings.com)
Discount Nike Air Jordan 9 Shoes Customized
(www.ebuyings.com)
Discount Nike Air Jordan 10 Wholesalers (www.ebuyings.com)
Discount Nike Jordan 11 Shoes Wholesalers (www.ebuyings.com)
Discount Nike Air Jordan 12 Factory (www.ebuyings.com)
Discount Nike Air Jordan 13 Shoes Factory (www.ebuyings.com)
Discount Nike Air Jordan 14 Shoes Sell (www.ebuyings.com)
Discount Nike Air Jordan 16 Exporter (www.ebuyings.com)
Discount Nike Air Jordan 17 Shoes Exporter
(www.ebuyings.com)
Discount Nike Air Jordan 18 Offer (www.ebuyings.com)
Discount Nike Air Jordan 19 Shoes Offer (www.ebuyings.com)
Discount Nike Air Jordan 20 Manufacture (www.ebuyings.com)
Discount Nike Jordan 21 Shoes Manufacture (www.ebuyings.com)

James Gray wrote:

Hello

Hello.

C:\enc>echo p 'zufällige_žluťoučký'.encoding > encoding.rb
according to specified encoding.

The problem with an idea like this is that before your String is ever
created the code to create it must be read (correctly) by Ruby's
parser and formed into a proper String literal. That would be
impossible to do if String literals could be in any random Encoding.

Yes, I understand that you have to parse the file. However, if I am
right, you still have to read the file binary in case you are looking
for some encoding directive on top of file. So from my point of view, it
shouldn't be big problem to read until first quotes, suppose the file is
stored in the encoding designed on top of the file. Then read whatever
in between quotes as binary and decide later how to interpret that
binary data, by suggested encoding in second parameter of string
constructor.

You have a couple of options though:

* Just set an Encoding like UTF-8 for the source code, enter
everything in UTF-8, and transcode it into the needed Encoding. This
would make your example something like:

   # encoding: UTF-8
   cp852 = "zufällige_žluťoučký".encode("CP852") # literal in
UTF-8

* Have one or more data files the program reads needed String objects
from. Those files can be in any Encoding you need and you can specify
it to IO operations, so your String objects are returned with that
Encoding.

Both your suggestions are valid of course, but I consider them as
solutions far from ideal. They brings far more complexity than desired.

I hope that helps.

James Edward Gray II

Of course my idea could be considered naive and there might be many
technical issues with parser, etc. which prevents the implementation.
Nevertheless, it would be nice feature.

Thank you for you suggestion anyway.

Vit

···

On Aug 7, 2009, at 8:49 AM, Vít Ondruch wrote:

--
Posted via http://www.ruby-forum.com/\.

Caleb Clausen wrote:

file, but I would like to see something in following manner:

String.new 'zufällige_žluťoučký', Encoding.CP852

You seem to be asking for the ability to have individual string
literals have encoding different from that of the program as a whole.
Why not this:

#encoding: ascii-8bit
'zufällige_žluťoučký'.force_encoding 'cp852'
'some utf8 data'.force_encoding 'utf-8'
'some sjis data'.force_encoding 'sjis'

Hmmm, that is a good idea!!!

Which leads me to the question why is default encoding US-ASCII instead
of ASCII-8BIT?

Another challenge will be editing this file. There's no editor out
there that could actually display this kind of thing correctly; you'll
have to become proficient at editing it as binary, or at least find an
editor than can tolerate arbitrary binary chars in its ascii.

Its almost the same challenge if you want to edit single file in
different encoding than is your system encoding ... so its not relevant
... in contrary, it could be even easier. Because in my case, I don't
care much about content, since I need more encodings for testing.

···

On 8/7/09, Vít Ondruch <v.ondruch@tiscali.cz> wrote:

--
Posted via http://www.ruby-forum.com/\.

You don't really have to:

$ cat source_encoding.rb
# encoding: UTF-8

output = ""
open(__FILE__, "r:US-ASCII") do |source|
   first_line = source.gets
   if first_line =~ /coding:\s*(\S+)/
     source.set_encoding($1)
   else
     output << first_line
   end
   output << source.read
end
p [output.encoding, output[0...20] + "…"]
$ ruby_dev source_encoding.rb
[#<Encoding:UTF-8>, "\noutput = \"\"\nopen(__…"]

James Edward Gray II

···

On Aug 7, 2009, at 9:47 AM, Vít Ondruch wrote:

James Gray wrote:

On Aug 7, 2009, at 8:49 AM, Vít Ondruch wrote:

Hello

Hello.

C:\enc>echo p 'zufällige_žluťoučký'.encoding > encoding.rb
according to specified encoding.

The problem with an idea like this is that before your String is ever
created the code to create it must be read (correctly) by Ruby's
parser and formed into a proper String literal. That would be
impossible to do if String literals could be in any random Encoding.

Yes, I understand that you have to parse the file. However, if I am
right, you still have to read the file binary in case you are looking
for some encoding directive on top of file.

James Gray wrote:

···

On Aug 7, 2009, at 9:47 AM, Vít Ondruch wrote:

You don't really have to:

It is disturbing that this approach will fail as soon as the file is
UTF-16 encoded or it has BOM for UTF-8, etc.

Vit
--
Posted via http://www.ruby-forum.com/\.

You are not allowed to set the source encoding to a non-ASCII compatible encoding, if memory serves. That eliminates any issues with encodings like UTF-16. This makes perfect sense as there's no way to reliably support the magic encoding comment unless we can count on being able to read at least that far.

A BOM could be handled similarly to what I showed. You need to open the file in ASCII-8BIT and check the beginning bytes, then you could switch to US-ASCII and finish reading the first line (or to the second if a shebang line is includes), then switch encodings again if needed and finish processing.

James Edward Gray II

···

On Aug 7, 2009, at 10:20 AM, Vít Ondruch wrote:

James Gray wrote:

On Aug 7, 2009, at 9:47 AM, Vít Ondruch wrote:

You don't really have to:

It is disturbing that this approach will fail as soon as the file is
UTF-16 encoded or it has BOM for UTF-8, etc.

You are not allowed to set the source encoding to a non-ASCII
compatible encoding, if memory serves.

Where is it documented please?

That eliminates any issues
with encodings like UTF-16. This makes perfect sense as there's no
way to reliably support the magic encoding comment unless we can count
on being able to read at least that far.

Needed to say that XML parsers can handle such cases, i.e. when xml
header is in different encoding than the rest of document.

A BOM could be handled similarly to what I showed. You need to open
the file in ASCII-8BIT and check the beginning bytes, then you could
switch to US-ASCII and finish reading the first line (or to the second
if a shebang line is includes), then switch encodings again if needed
and finish processing.

May be this technique could be used for reading UTF-16 encoded files, if
needed? However this is too far from my initial post :slight_smile:

James Edward Gray II

Vit

···

--
Posted via http://www.ruby-forum.com/\.

You are not allowed to set the source encoding to a non-ASCII
compatible encoding, if memory serves.

Where is it documented please?

I'm not sure it's officially documented yet.

Ruby does throw an error in this scenario though:

$ ruby_dev
# encoding: UTF-16BE
ruby_dev: UTF-16BE is not ASCII compatible (ArgumentError)

and:

$ ruby_dev -e 'puts "\uFEFF# encoding: UTF-16BE".encode("UTF-16BE")' | ruby_dev
-:1: invalid multibyte char (UTF-8)

I believe this is the relevant code from Ruby's parser:

static void
parser_set_encode(struct parser_params *parser, const char *name)
{
     int idx = rb_enc_find_index(name);
     rb_encoding *enc;

     if (idx < 0) {
  rb_raise(rb_eArgError, "unknown encoding name: %s", name);
     }
     enc = rb_enc_from_index(idx);
     if (!rb_enc_asciicompat(enc)) {
  rb_raise(rb_eArgError, "%s is not ASCII compatible", rb_enc_name(enc));
     }
     parser->enc = enc;
}

That eliminates any issues
with encodings like UTF-16. This makes perfect sense as there's no
way to reliably support the magic encoding comment unless we can count
on being able to read at least that far.

Needed to say that XML parsers can handle such cases, i.e. when xml
header is in different encoding than the rest of document.

I doubt we can say that universally. :slight_smile:

Also, what you said isn't very accurate. For example, "in different encoding than the rest of document" is not a possible occurrence according to the XML 1.1 specification (http://www.w3.org/TR/2006/REC-xml11-20060816/\) which states:

"It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding."

All XML parsers are required to assume UTF-8 unless told otherwise and to be able to recognize UTF-16 by a required BOM. Beyond that, they are not required to recognize any other encodings, though they may of course. Their encoding declaration can be expressed in ASCII and, since they assume UTF-8 by default, this is similar to what Ruby does. It allows a switch to an ASCII-compatible encoding.

XML processors may do more. For example, they can accept a different encoding from an external source to support things like HTTP headers and MIME types. Ruby doesn't really have access to such sources at execution time, so that option doesn't apply to the case we are discussing. However, XML processors may also recognize other BOM's and Ruby could do this.

A BOM could be handled similarly to what I showed. You need to open
the file in ASCII-8BIT and check the beginning bytes, then you could
switch to US-ASCII and finish reading the first line (or to the second
if a shebang line is includes), then switch encodings again if needed
and finish processing.

May be this technique could be used for reading UTF-16 encoded files, if
needed?

Yes, Ruby could recognize BOM's for non-ASCII compatible encodings to support them. A BOM would be required in this case though, just as it is in an XML processor that doesn't have external information.

Ruby doesn't currently do this, as near as I can tell.

Note that this would not give what you purposed in your initial message: multiple encodings in the same file. Ruby doesn't support that and isn't ever likely to. An XML processor that supports such things is in violation of its specification as I understand it.

Besides, not many text editors that I'm aware of make it super easy to edit in multiple encodings. :slight_smile:

James Edward Gray II

···

On Aug 7, 2009, at 10:41 AM, Vít Ondruch wrote: