Trying to deal with an "invalid multibyte char (UTF-8)" issue

Leam_Hall · 30 March 2015 16:14

I'm trying to slurp through a file with Ruby 1.9.x and there are some
characters in it that choke my script. The character sequence I've found so
far is "â+â" and all the things I try to do with it gives something like:

   irb(main):007:0> mystr = ['â+â']
   SyntaxError: (irb):7: invalid multibyte char (UTF-8)
   (irb):7: invalid multibyte char (UTF-8)
   (irb):7: syntax error, unexpected $end

The end goal is to wind up with these things being changed to either ASCII
or UTF-8. Either one should be useable.

Thoughts?

Leam

···

--
Mind on a Mission <http://leamhall.blogspot.com/>

Hassan_Schroeder · 30 March 2015 17:11

I'm trying to slurp through a file with Ruby 1.9.x

Why? Ruby 1.9.3 is EOL, no longer supported [1].

irb(main):007:0> mystr = ['â+â']

Thoughts?

Aside from the oddness of "mystr" being an array, this works fine
in Ruby 2.2.1. What reason is there not to upgrade?

[1] Support for Ruby 1.9.3 has ended

···

On Mon, Mar 30, 2015 at 9:14 AM, leam hall <leamhall@gmail.com> wrote:
--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

twitter: @hassan
Consulting Availability : Silicon Valley or remote

Lazaro_Armando · 30 March 2015 16:56

you could Use

"mystr.scrub"

And write the sheebang in this way

#!/usr/bin/env ruby
#encoding: utf-8

That "encoding: utf-8" tells to Ruby wich encode shall use

Thread name: "Trying to deal with an "invalid multibyte char (UTF-8)" issue"
Mail number: 1
In reply to: leam hall

···

Date: Mon, Mar 30, 2015

I'm trying to slurp through a file with Ruby 1.9.x and there are some
characters in it that choke my script. The character sequence I've found so
far is "â+â" and all the things I try to do with it gives something like:

   irb(main):007:0> mystr = ['â+â']
   SyntaxError: (irb):7: invalid multibyte char (UTF-8)
   (irb):7: invalid multibyte char (UTF-8)
   (irb):7: syntax error, unexpected $end

The end goal is to wind up with these things being changed to either ASCII
or UTF-8. Either one should be useable.

Thoughts?

Leam

--
Mind on a Mission <http://leamhall.blogspot.com/>

abinoam · 30 March 2015 16:30

Hi Leam Hall,

Do you have a line at the beginning of you file? Like this line bellow?

#coding: utf-8

Best regard,
Abinoam Jr.

···

On Mon, Mar 30, 2015 at 1:14 PM, leam hall <leamhall@gmail.com> wrote:

I'm trying to slurp through a file with Ruby 1.9.x and there are some
characters in it that choke my script. The character sequence I've found so
far is "â+â" and all the things I try to do with it gives something like:

   irb(main):007:0> mystr = ['â+â']
   SyntaxError: (irb):7: invalid multibyte char (UTF-8)
   (irb):7: invalid multibyte char (UTF-8)
   (irb):7: syntax error, unexpected $end

The end goal is to wind up with these things being changed to either ASCII
or UTF-8. Either one should be useable.

Thoughts?

Leam

--
Mind on a Mission

Leam_Hall · 30 March 2015 17:49

This seems to be a common question. In this case, it is so the tool I'm
writing actually gets used. My tool is for non-programmers. They will
generally have neither the time nor inclination to use a tool that requires
them to maintain a separate Ruby version.

My tool supports a tool that embeds Ruby. Thus the only version of Ruby I
can guarantee is available is that packaged version. That lets me lower the
barrier to usage.

So, really, support is secondary to usefulness.

Leam

···

On Mon, Mar 30, 2015 at 1:11 PM, Hassan Schroeder < hassan.schroeder@gmail.com> wrote:

On Mon, Mar 30, 2015 at 9:14 AM, leam hall <leamhall@gmail.com> wrote:
> I'm trying to slurp through a file with Ruby 1.9.x

Why? Ruby 1.9.3 is EOL, no longer supported [1].

> irb(main):007:0> mystr = ['â+â']

> Thoughts?

Aside from the oddness of "mystr" being an array, this works fine
in Ruby 2.2.1. What reason is there not to upgrade?

--
Mind on a Mission <http://leamhall.blogspot.com/>

Besnik_Ruka · 30 March 2015 18:09

You haven't really given enough info to solve this.

What encoding is your source file? How certain are you that it is UTF8 and
that it has always been UTF8? It could've been something else and then
someone carelessly converted it to something else.

Have you tried using the rails multibyte methods if available?

Have you tried using the iconv library to convert to true UTF8?

You can't test encoding issues on IRB. It's not the same as text coming
from an encoded file.

···

On Mon, Mar 30, 2015 at 12:14 PM, leam hall <leamhall@gmail.com> wrote:

I'm trying to slurp through a file with Ruby 1.9.x and there are some
characters in it that choke my script. The character sequence I've found so
far is "â+â" and all the things I try to do with it gives something like:

   irb(main):007:0> mystr = ['â+â']
   SyntaxError: (irb):7: invalid multibyte char (UTF-8)
   (irb):7: invalid multibyte char (UTF-8)
   (irb):7: syntax error, unexpected $end

The end goal is to wind up with these things being changed to either ASCII
or UTF-8. Either one should be useable.

Thoughts?

Leam

--
Mind on a Mission <http://leamhall.blogspot.com/>

Panagiotis_Atmatzidi · 30 March 2015 16:30

Hello,

I'm trying to slurp through a file with Ruby 1.9.x and there are some characters in it that choke my script. The character sequence I've found so far is "â+â" and all the things I try to do with it gives something like:

   irb(main):007:0> mystr = ['â+â']
   SyntaxError: (irb):7: invalid multibyte char (UTF-8)
   (irb):7: invalid multibyte char (UTF-8)
   (irb):7: syntax error, unexpected $end

The end goal is to wind up with these things being changed to either ASCII or UTF-8. Either one should be useable.

Thoughts?

Is ruby-2.2.1 an option? UTF-8 has greatly improved and you don’t need to manually add lines to support UTF-8 under ruby-2.x .

Other than that, you should someone change env variables to support UTF-8. A common UTF-8 ‘irb’ issue back in 1.9 days was ruby-1.9 installed without ‘readline’ support. You could try re-installing ruby with readline support, not sure it’s going to work though.

Leam

--
Mind on a Mission <http://leamhall.blogspot.com/>

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

“There’s something that I’ve learned. I don’t believe in predetermined faith!” - Hitomi Kanzaki

···

On 30 Mar 2015, at 19:14, leam hall <leamhall@gmail.com> wrote:

Hassan_Schroeder · 30 March 2015 18:07

But they'll be happy with an unmaintained, potentially insecure Ruby
driving this tool? Well OK then

Good luck with that.

···

On Mon, Mar 30, 2015 at 10:49 AM, leam hall <leamhall@gmail.com> wrote:

What reason is there not to upgrade?

My tool is for non-programmers. They will
generally have neither the time nor inclination to use a tool that requires
them to maintain a separate Ruby version.

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

twitter: @hassan
Consulting Availability : Silicon Valley or remote

Martin_DeMello · 30 March 2015 18:43

Would bundling your own ruby with the tool work? See
Category: Packaging to Executables - The Ruby Toolbox for
options.

martin

···

On Mon, Mar 30, 2015 at 10:49 AM, leam hall <leamhall@gmail.com> wrote:

This seems to be a common question. In this case, it is so the tool I'm
writing actually gets used. My tool is for non-programmers. They will
generally have neither the time nor inclination to use a tool that requires
them to maintain a separate Ruby version.

My tool supports a tool that embeds Ruby. Thus the only version of Ruby I
can guarantee is available is that packaged version. That lets me lower the
barrier to usage.

So, really, support is secondary to usefulness.

Besnik_Ruka · 30 March 2015 18:10

This is ignorant and not helpful. There are many legacy applications out
there on older tech. Upgrading costs money and time, and that escalates
rapidly depending on the size of the app.

···

On Mon, Mar 30, 2015 at 2:07 PM, Hassan Schroeder < hassan.schroeder@gmail.com> wrote:

On Mon, Mar 30, 2015 at 10:49 AM, leam hall <leamhall@gmail.com> wrote:

>> What reason is there not to upgrade?

> My tool is for non-programmers. They will
> generally have neither the time nor inclination to use a tool that
requires
> them to maintain a separate Ruby version.

But they'll be happy with an unmaintained, potentially insecure Ruby
driving this tool? Well OK then

Good luck with that.
--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
Hassan Schroeder | about.me
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Leam_Hall · 30 March 2015 18:17

Good points! I'm digesting someone else's XML and have logged a ticket with
them. They've acknowledged the issue and in 6-12 months it might get fixed.
Until that time I'm just doing the best I can to get through the interim.

I'm not sure that it is UTF-8 but it's supposed to be. The string I tested
came back as UTF-8, but as you noted it may not be a well done thing.

Had not seen iconv. Have t go see if there's a way to let the system assume
whatever for the existing encoding.

···

On Mon, Mar 30, 2015 at 2:09 PM, Besnik Ruka <bruka@targetedvictory.com> wrote:

You haven't really given enough info to solve this.

What encoding is your source file? How certain are you that it is UTF8 and
that it has always been UTF8? It could've been something else and then
someone carelessly converted it to something else.

Have you tried using the rails multibyte methods if available?
http://api.rubyonrails.org/classes/String.html#method-i-mb_chars

Have you tried using the iconv library to convert to true UTF8?

You can't test encoding issues on IRB. It's not the same as text coming
from an encoded file.

--
Mind on a Mission <http://leamhall.blogspot.com/>

Hassan_Schroeder · 30 March 2015 18:20

IMO it's "ignorant" to stick your head in the sand and ignore security
issues.

Upgrading has costs; so does recovering from a preventable security
exploit. And the longer your app goes without upgrading, the greater
the exposure (and ultimately cost to upgrade or replace).

···

On Mon, Mar 30, 2015 at 11:10 AM, Besnik Ruka <bruka@targetedvictory.com> wrote:

This is ignorant and not helpful. There are many legacy applications out
there on older tech. Upgrading costs money and time, and that escalates
rapidly depending on the size of the app.

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

twitter: @hassan
Consulting Availability : Silicon Valley or remote

Leam_Hall · 30 March 2015 18:26

That's a dangerous message to preach if you want your community to
continue. If you say "Your version is insecure and you should spend weeks
of man hours to upgrade, as should everyone who uses your product", then
you're likely to wind up with no one using your language because it's not
worth the effort.

Most places I've seen don't want to ignore security issues. However, they
have to produce some sort of product and they have limited resources to do
so. If Language X becomes so insecure that major upgrades are required
because the community quits supporting what everyone is using, then
Language X is used a lot less. Just because it's cool doesn't make it
worthwhile.

Leam

···

On Mon, Mar 30, 2015 at 2:20 PM, Hassan Schroeder < hassan.schroeder@gmail.com> wrote:

On Mon, Mar 30, 2015 at 11:10 AM, Besnik Ruka <bruka@targetedvictory.com> > wrote:
> This is ignorant and not helpful. There are many legacy applications out
> there on older tech. Upgrading costs money and time, and that escalates
> rapidly depending on the size of the app.

IMO it's "ignorant" to stick your head in the sand and ignore security
issues.

Upgrading has costs; so does recovering from a preventable security
exploit. And the longer your app goes without upgrading, the greater
the exposure (and ultimately cost to upgrade or replace).

--
Mind on a Mission <http://leamhall.blogspot.com/>

Besnik_Ruka · 30 March 2015 20:16

I'm not sure that it is UTF-8 but it's supposed to be. The string I tested

came back as UTF-8, but as you noted it may not be a well done thing.

There's no way to programmatically discover the encoding of a file. There
are tons of internet writeups on this. Unless you know who/where/how it was
typed, best thing you're doing is guessing. If it was typed in the United
States, then it may be UTF-8, Windows-1252, ISO Latin 1 etc. If it was
typed in another country then anything goes.

And that's assuming that the file has remained in the original encoding it
was written in. If it was incorrectly re-encoded along the way, then it's
pretty much a lost cause.

Anyway, you'll just have to fool around with it for a little bit to see if
you can convert it to something that won't break the ruby interpreter. I'm
afraid there are no good answers to your problem.

···

On Mon, Mar 30, 2015 at 2:17 PM, leam hall <leamhall@gmail.com> wrote:

On Mon, Mar 30, 2015 at 2:09 PM, Besnik Ruka <bruka@targetedvictory.com> > wrote:

You haven't really given enough info to solve this.

What encoding is your source file? How certain are you that it is UTF8
and that it has always been UTF8? It could've been something else and then
someone carelessly converted it to something else.

Have you tried using the rails multibyte methods if available?
http://api.rubyonrails.org/classes/String.html#method-i-mb_chars

Have you tried using the iconv library to convert to true UTF8?

You can't test encoding issues on IRB. It's not the same as text coming
from an encoded file.

Good points! I'm digesting someone else's XML and have logged a ticket
with them. They've acknowledged the issue and in 6-12 months it might get
fixed. Until that time I'm just doing the best I can to get through the
interim.

I'm not sure that it is UTF-8 but it's supposed to be. The string I tested
came back as UTF-8, but as you noted it may not be a well done thing.

Had not seen iconv. Have t go see if there's a way to let the system
assume whatever for the existing encoding.

--
Mind on a Mission <http://leamhall.blogspot.com/>

Bryce_Kerley · 30 March 2015 18:49

Denial isn’t how you fix the cost of updating, making updating routine and well-practiced or automated is. The reality of shipping software in 2015 is that your environment is constantly changing, whether due to a megacorporation picking a different business strategy, researchers finding flaws in common software, or business partners ceasing to be businesses or partners.

Ruby 2.2.1 isn’t even a big change from 1.9.3: new syntax that doesn’t break existing syntax, new APIs that augment existing ones, and a more efficient runtime.

···

On Mar 30, 2015, at 14:26, leam hall <leamhall@gmail.com> wrote:

That's a dangerous message to preach if you want your community to continue. If you say "Your version is insecure and you should spend weeks of man hours to upgrade, as should everyone who uses your product", then you're likely to wind up with no one using your language because it's not worth the effort.

Most places I've seen don't want to ignore security issues. However, they have to produce some sort of product and they have limited resources to do so. If Language X becomes so insecure that major upgrades are required because the community quits supporting what everyone is using, then Language X is used a lot less. Just because it's cool doesn't make it worthwhile.

Topic		Replies	Views
[Ruby 1.9] New String rules? ruby-talk	5	130	25 January 2011
Cann't require UTF-8 files ruby-talk	13	156	17 February 2011
Ruby 1.9 # coding: utf-8 ruby-talk	5	122	27 March 2009
Dealing with invalid encoding... ruby-talk	4	589	5 May 2018
Multibyte and Gems ruby-talk	4	86	25 June 2009

Trying to deal with an "invalid multibyte char (UTF-8)" issue

Related Topics