Windows encoding: IBM437 to UTF-8

Ben6 · 22 September 2016 15:35

Hey y'all,

I used R to convert a pdf to a text document. I'm using Ruby to parse
through the text document and create a CSV document. When I converted the
pdf to text, I also replaced special characters (Ñ, Ó, Í) to there
non-accented equivalent (N, O, I). When I open up the text document
(notepad) it seems everything was replaced correctly, however, when I read
the text document using ruby in the command line, the non-accented
equivalents can not be read.

Ruby on my machine encodes strings as IBM437. ('string'.encoding)

When I click save-as on the text document in notepad, it suggests that the
encoding is ANSI.

What can I do so that Ruby reads the text document properly? This is the
first time I have ever run into an encoding problem and I am really
confused. Any help or sugestións would be awesome.

Cheers,
Ben

RRRoy_BBBean · 22 September 2016 17:57

I used this code several years back, to help me produce text with a portable format. I omitted everything except the relevant code, a module variable and module method. Maybe it will give you some ideas?

module JageunSpeak

     @@utf8 = Encoding::Converter.new(
         'binary',
         'utf-8',
         { :invalid=>:replace, :undef=>:replace, :replace=>'' }
     )

     def self.convert_utf8 text
         @@utf8.convert text
     end

end

···

On 09/22/2016 10:35 AM, Ben wrote:

Hey y'all,

I used R to convert a pdf to a text document. I'm using Ruby to parse
through the text document and create a CSV document. When I converted the
pdf to text, I also replaced special characters (Ñ, Ó, Í) to there
non-accented equivalent (N, O, I). When I open up the text document
(notepad) it seems everything was replaced correctly, however, when I read
the text document using ruby in the command line, the non-accented
equivalents can not be read.

Ruby on my machine encodes strings as IBM437. ('string'.encoding)

When I click save-as on the text document in notepad, it suggests that the
encoding is ANSI.

What can I do so that Ruby reads the text document properly? This is the
first time I have ever run into an encoding problem and I am really
confused. Any help or sugestións would be awesome.

Cheers,
Ben

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Robert_K1 · 22 September 2016 15:51

James has blogged about the topic in the past - I found it quite
helpful at the time:

Kind regards

robert

···

On Thu, Sep 22, 2016 at 5:35 PM, Ben <bortiz1988@gmail.com> wrote:

Hey y'all,

I used R to convert a pdf to a text document. I'm using Ruby to parse
through the text document and create a CSV document. When I converted the
pdf to text, I also replaced special characters (Ñ, Ó, Í) to there
non-accented equivalent (N, O, I). When I open up the text document
(notepad) it seems everything was replaced correctly, however, when I read
the text document using ruby in the command line, the non-accented
equivalents can not be read.

Ruby on my machine encodes strings as IBM437. ('string'.encoding)

When I click save-as on the text document in notepad, it suggests that the
encoding is ANSI.

What can I do so that Ruby reads the text document properly? This is the
first time I have ever run into an encoding problem and I am really
confused. Any help or sugestións would be awesome.

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Toshihiko_Ichida · 22 September 2016 21:18

Hi,

What can I do so that Ruby reads the text document properly?

If this means reading a text from a file, here is a way for that.

> copy con test.txt
abc
^Z

> irb
irb(main):001:0> open('test.txt', 'r:ibm437', &:read)
=> "abc\n"
irb(main):002:0>

If we specify just 'r', it means 'r:utf-8' by default.

irb(main):002:0> open('test.txt', 'r', &:read).encoding
=> #<Encoding:UTF-8>

We may be able to use -E option to change the default.

···

On 2016/09/23 0:35, Ben wrote:

--
Toshi

Robert_K1 · 23 September 2016 07:21

irb

irb(main):001:0> open('test.txt', 'r:ibm437', &:read)
=> "abc\n"
irb(main):002:0>

Or do the explicit

File.open("x.c", "r", external_encoding: "UTF-8") do |f|
p f.external_encoding
end

If we specify just 'r', it means 'r:utf-8' by default.

That is not true in the general case. This depends on the environment settings:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
$ ruby -e 'File.open("x.c"){|f| p f.external_encoding}'
#<Encoding:UTF-8>

$ LANG=en_US.ISO8859-7 locale
LANG=en_US.ISO8859-7
LC_CTYPE="en_US.ISO8859-7"
LC_NUMERIC="en_US.ISO8859-7"
LC_TIME="en_US.ISO8859-7"
LC_COLLATE="en_US.ISO8859-7"
LC_MONETARY="en_US.ISO8859-7"
LC_MESSAGES="en_US.ISO8859-7"
LC_ALL=
$ LANG=en_US.ISO8859-7 !ruby
LANG=en_US.ISO8859-7 ruby -e 'File.open("x.c"){|f| p f.external_encoding}'
#<Encoding:CP850>

We may be able to use -E option to change the default.

If you know the particular encoding of a file, IMO it is best to
explicitly give the encoding on opening of the file rather than
changing it for the whole script.

Kind regards

robert

···

On Thu, Sep 22, 2016 at 11:18 PM, Toshihiko Ichida <dogatana@gmail.com> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Toshihiko_Ichida · 23 September 2016 10:42

Hi,

If we specify just 'r', it means 'r:utf-8' by default.

That is not true in the general case. This depends on the environment settings:

Yes.

Although I presumed that the OS is windows at glance of ibm437,
I might have misunderstood.

If you know the particular encoding of a file, IMO it is best to
explicitly give the encoding on opening of the file rather than
changing it for the whole script.

I fully agree with that and am always doing so to avoid
exceptions like Encoding::UndefinedConversionError and
Encoding::CompatibilityError.

···

On 2016/09/23 16:21, Robert Klemme wrote:

--
Toshi

Topic		Replies	Views
How does one transform UTF-8 encoded characters to ASCII? ruby-talk	13	141	25 May 2006
PDF Writer UTF-8 Support ruby-talk	3	128	31 March 2005
Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files ruby-talk	2	141	18 November 2010
Encoding, "extended ansi", and unicode in 1.9 ruby-talk	2	155	17 June 2010
Encoding issue for special characters on Windows ruby-talk	3	128	13 January 2009

Windows encoding: IBM437 to UTF-8

Related topics