File.new and encoding

Hi,

I'm still quite new to ruby, but have written a simple code generator. The generator opens some files and combines them to a new one. The resulting file is encoded as iso-8859-1, but it looks like ruby writes an UTF-8 Markter to the beginning of the file. Is that possible?

How can I tell ruby which encoding to use, if I write to textfiles?

Any pointers to documentation are wellcome, but I didn't find something usefull using google.

regards,
Achim

Achim Domma (SyynX Solutions GmbH) wrote:

Hi,

I'm still quite new to ruby, but have written a simple code generator.
The generator opens some files and combines them to a new one. The
resulting file is encoded as iso-8859-1, but it looks like ruby writes
an UTF-8 Markter to the beginning of the file. Is that possible?

What's an UTF-8 marker? I know only two byte UTF-16 marker but AFAIK
there is no marker for UTF-8. Did I miss something?

How can I tell ruby which encoding to use, if I write to textfiles?

Any pointers to documentation are wellcome, but I didn't find
something usefull using google.

Encoding is not an easy issue with ruby - I guess by default it uses the
default enconding of your environment. But you can specify certain
(Japanese) encodings with command line option -K. HTH

Kind regards

    robert

Hi,

At Wed, 30 Nov 2005 00:17:29 +0900,
Robert Klemme wrote in [ruby-talk:167988]:

> I'm still quite new to ruby, but have written a simple code generator.
> The generator opens some files and combines them to a new one. The
> resulting file is encoded as iso-8859-1, but it looks like ruby writes
> an UTF-8 Markter to the beginning of the file. Is that possible?

What's an UTF-8 marker? I know only two byte UTF-16 marker but AFAIK
there is no marker for UTF-8. Did I miss something?

It would be UTF-8 encoded BOM, but ruby itself never write it
automatically.

> How can I tell ruby which encoding to use, if I write to textfiles?

Can't you show the code?

ยทยทยท

--
Nobu Nakada

nobu@ruby-lang.org wrote:

It would be UTF-8 encoded BOM, but ruby itself never write it
automatically.

[...]

Can't you show the code?

Trying to reproduce the problem in a smaller example, I figured out, that I'm reading the BOM from one of my source files. Sorry for the confusion. I'm doing something like:

File.open("target","w") do |target|
     File.open("source","r") do |source|
         source.each_line do |line|
             ... some processing ...
             target.write(line)
         end
      end
end

source seems to contain the BOM and it is writen to target. Any hint on how to strip the BOM?

regards,
Achim

I'm doing something like:

File.open("target","w") do |target|
    File.open("source","r") do |source|
        source.each_line do |line|
            ... some processing ...
            target.write(line)
        end
     end
end

Have you looked at 'iconv' in the standard library?

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/classes/Iconv.html

Assuming all your input files were ISO-8859-1, and you wanted your output file in UTF-8, your example might look something like (untested):

File.open("target","w") do |target|
  Iconv.open('UTF-8', 'ISO-8859-1') do | converter |
    File.open("source","r") do |source|
      source.each_line do |line|
        # ... some processing ...
        target.write( converter.iconv(line) )
      end
    end
    target << converter.iconv(nil)
  end
end

Iconv should deal with BOMs, stripping them out or adding them in where necessary. I'm not sure if it will complain if it finds a BOM mid-stream (as you open your second and subsequent input file) - if so you could just instantiate a new Iconv to deal with each input.

HTH
alex