Windows encoding: IBM437 to UTF-8

Hey y'all,

I used R to convert a pdf to a text document. I'm using Ruby to parse
through the text document and create a CSV document. When I converted the
pdf to text, I also replaced special characters (Ñ, Ó, Í) to there
non-accented equivalent (N, O, I). When I open up the text document
(notepad) it seems everything was replaced correctly, however, when I read
the text document using ruby in the command line, the non-accented
equivalents can not be read.

Ruby on my machine encodes strings as IBM437. ('string'.encoding)

When I click save-as on the text document in notepad, it suggests that the
encoding is ANSI.

What can I do so that Ruby reads the text document properly? This is the
first time I have ever run into an encoding problem and I am really
confused. Any help or sugestións would be awesome.

Cheers,
Ben

I used this code several years back, to help me produce text with a portable format. I omitted everything except the relevant code, a module variable and module method. Maybe it will give you some ideas?

module JageunSpeak

     @@utf8 = Encoding::Converter.new(
         'binary',
         'utf-8',
         { :invalid=>:replace, :undef=>:replace, :replace=>'' }
     )

     def self.convert_utf8 text
         @@utf8.convert text
     end

end

···

On 09/22/2016 10:35 AM, Ben wrote:

Hey y'all,

I used R to convert a pdf to a text document. I'm using Ruby to parse
through the text document and create a CSV document. When I converted the
pdf to text, I also replaced special characters (Ñ, Ó, Í) to there
non-accented equivalent (N, O, I). When I open up the text document
(notepad) it seems everything was replaced correctly, however, when I read
the text document using ruby in the command line, the non-accented
equivalents can not be read.

Ruby on my machine encodes strings as IBM437. ('string'.encoding)

When I click save-as on the text document in notepad, it suggests that the
encoding is ANSI.

What can I do so that Ruby reads the text document properly? This is the
first time I have ever run into an encoding problem and I am really
confused. Any help or sugestións would be awesome.

Cheers,
Ben

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

James has blogged about the topic in the past - I found it quite
helpful at the time:

Kind regards

robert

···

On Thu, Sep 22, 2016 at 5:35 PM, Ben <bortiz1988@gmail.com> wrote:

Hey y'all,

I used R to convert a pdf to a text document. I'm using Ruby to parse
through the text document and create a CSV document. When I converted the
pdf to text, I also replaced special characters (Ñ, Ó, Í) to there
non-accented equivalent (N, O, I). When I open up the text document
(notepad) it seems everything was replaced correctly, however, when I read
the text document using ruby in the command line, the non-accented
equivalents can not be read.

Ruby on my machine encodes strings as IBM437. ('string'.encoding)

When I click save-as on the text document in notepad, it suggests that the
encoding is ANSI.

What can I do so that Ruby reads the text document properly? This is the
first time I have ever run into an encoding problem and I am really
confused. Any help or sugestións would be awesome.

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Hi,

What can I do so that Ruby reads the text document properly?

If this means reading a text from a file, here is a way for that.

> copy con test.txt
abc
^Z

> irb
irb(main):001:0> open('test.txt', 'r:ibm437', &:read)
=> "abc\n"
irb(main):002:0>

If we specify just 'r', it means 'r:utf-8' by default.

irb(main):002:0> open('test.txt', 'r', &:read).encoding
=> #<Encoding:UTF-8>

We may be able to use -E option to change the default.

···

On 2016/09/23 0:35, Ben wrote:

--
Toshi

irb

irb(main):001:0> open('test.txt', 'r:ibm437', &:read)
=> "abc\n"
irb(main):002:0>

Or do the explicit

File.open("x.c", "r", external_encoding: "UTF-8") do |f|
  p f.external_encoding
end

If we specify just 'r', it means 'r:utf-8' by default.

That is not true in the general case. This depends on the environment settings:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
$ ruby -e 'File.open("x.c"){|f| p f.external_encoding}'
#<Encoding:UTF-8>

$ LANG=en_US.ISO8859-7 locale
LANG=en_US.ISO8859-7
LC_CTYPE="en_US.ISO8859-7"
LC_NUMERIC="en_US.ISO8859-7"
LC_TIME="en_US.ISO8859-7"
LC_COLLATE="en_US.ISO8859-7"
LC_MONETARY="en_US.ISO8859-7"
LC_MESSAGES="en_US.ISO8859-7"
LC_ALL=
$ LANG=en_US.ISO8859-7 !ruby
LANG=en_US.ISO8859-7 ruby -e 'File.open("x.c"){|f| p f.external_encoding}'
#<Encoding:CP850>

We may be able to use -E option to change the default.

If you know the particular encoding of a file, IMO it is best to
explicitly give the encoding on opening of the file rather than
changing it for the whole script.

Kind regards

robert

···

On Thu, Sep 22, 2016 at 11:18 PM, Toshihiko Ichida <dogatana@gmail.com> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Hi,

If we specify just 'r', it means 'r:utf-8' by default.

That is not true in the general case. This depends on the environment settings:

Yes.

Although I presumed that the OS is windows at glance of ibm437,
I might have misunderstood.

If you know the particular encoding of a file, IMO it is best to
explicitly give the encoding on opening of the file rather than
changing it for the whole script.

I fully agree with that and am always doing so to avoid
exceptions like Encoding::UndefinedConversionError and
Encoding::CompatibilityError.

···

On 2016/09/23 16:21, Robert Klemme wrote:

--
Toshi