Dealing with invalid encoding...

Ken_D_Ambrosio · 4 May 2018 16:28

Hi, all. I've got a file with some things like this:

radio frames^M<83>?<9B>v64
(The "?" is a \x3f)

Needless to say, Ruby barfs all over that. There are also *other* invalid strings in the file. (Thanks, Framemaker.)

Now, I know I can use #scrub to make the file palatable, but what I *really* want to do is to take the "<83>?<9B>", and swap it with a \u2022 (unicode bullet), and then use #scrub on the rest of the invalid stuff. But I can't figure out how to do that; I admit I get out of my depth when dealing with encodings. Ruby 2.3, if that matters.

Thanks for any pointers,

-Ken

Peter_Butler · 4 May 2018 18:21

Looks like a job for regular expressions.

···

On 05/04/2018 10:28 AM, Ken D'Ambrosio wrote:

Hi, all. I've got a file with some things like this:

radio frames^M<83>?<9B>v64
(The "?" is a \x3f)

Needless to say, Ruby barfs all over that. There are also *other* invalid strings in the file. (Thanks, Framemaker.)

Now, I know I can use #scrub to make the file palatable, but what I *really* want to do is to take the "<83>?<9B>", and swap it with a \u2022 (unicode bullet), and then use #scrub on the rest of the invalid stuff. But I can't figure out how to do that; I admit I get out of my depth when dealing with encodings. Ruby 2.3, if that matters.

Thanks for any pointers,

-Ken

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Xeno_Campanoli1 · 4 May 2018 18:55

/#{Regexp.quote(your_string_variable)}/

···

On Fri, May 4, 2018 at 11:21 AM, Peter Butler <peter@cogico.com> wrote:

Looks like a job for regular expressions.

On 05/04/2018 10:28 AM, Ken D'Ambrosio wrote:

Hi, all. I've got a file with some things like this:

radio frames^M<83>?<9B>v64
(The "?" is a \x3f)

Needless to say, Ruby barfs all over that. There are also *other*
invalid strings in the file. (Thanks, Framemaker.)

Now, I know I can use #scrub to make the file palatable, but what I
*really* want to do is to take the "<83>?<9B>", and swap it with a \u2022
(unicode bullet), and then use #scrub on the rest of the invalid stuff.
But I can't figure out how to do that; I admit I get out of my depth when
dealing with encodings. Ruby 2.3, if that matters.

Thanks for any pointers,

-Ken

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Ryan_Davis1 · 4 May 2018 21:23

“Ruby Barfs”… “can’t figure out how”…

What can’t you figure out? What have you tried?

These details are important to provide when posting to a list like this so people have enough context to help.

···

On May 4, 2018, at 09:28, Ken D'Ambrosio <ken@jots.org> wrote:

Now, I know I can use #scrub to make the file palatable, but what I *really* want to do is to take the "<83>?<9B>", and swap it with a \u2022 (unicode bullet), and then use #scrub on the rest of the invalid stuff. But I can't figure out how to do that; I admit I get out of my depth when dealing with encodings. Ruby 2.3, if that matters.

---

That said, you probably want to read the file as BINARY (aka ASCII-8BIT), do your transformations/substitutions with regular expressions & strings (that are also binary) and hopefully do all the ones you need such that the file winds up being valid UTF-8 (or whatever you’re going for).

Marvin_Gulker · 5 May 2018 10:16

Needless to say, Ruby barfs all over that. There are also *other* invalid
strings in the file. (Thanks, Framemaker.)

I assume you run into an invalid encoding exception. As Ryan pointed
out, being more specific helps us to help you.

Now, I know I can use #scrub to make the file palatable, but what I *really*
want to do is to take the "<83>?<9B>", and swap it with a \u2022 (unicode
bullet), and then use #scrub on the rest of the invalid stuff. But I can't
figure out how to do that; I admit I get out of my depth when dealing with
encodings. Ruby 2.3, if that matters.

If the problem is that Ruby doesn't let you do replacements on the
string due to the invalid encoding, then you can use #force_encoding
to force the string's encoding to "BINARY" (which means: accept
anything), then do the replacement (use hex escapes), and then use
#force_encoding again to convert it back to the original
encoding. Then clean the string up with #scrub.

Marvin

···

Am 04. May 2018 um 12:28 Uhr -0400 schrieb Ken D'Ambrosio:

--
Blog: https://mg.guelker.eu
PGP/GPG ID: F1D8799FBCC8BC4F

Topic		Replies	Views
Ruby 1.9.2: How to sanitize text with invalid characters? ruby-talk	6	222	12 October 2010
Trying to deal with an "invalid multibyte char (UTF-8)" issue ruby-talk	14	3681	30 March 2015
Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files ruby-talk	2	141	18 November 2010
ArgumentError - invalid byte sequence in UTF-8 ruby-talk	3	439	24 July 2011
Ruby 1.9 hates you and me and the encodings we rode in on so just get used to it ruby-talk	28	202	31 December 2009

Dealing with invalid encoding...

Related topics