Dealing with invalid encoding...


(Ken D'Ambrosio) #1

Hi, all. I've got a file with some things like this:

radio frames^M<83>?<9B>v64
(The "?" is a \x3f)

Needless to say, Ruby barfs all over that. There are also *other* invalid strings in the file. (Thanks, Framemaker.)

Now, I know I can use #scrub to make the file palatable, but what I *really* want to do is to take the "<83>?<9B>", and swap it with a \u2022 (unicode bullet), and then use #scrub on the rest of the invalid stuff. But I can't figure out how to do that; I admit I get out of my depth when dealing with encodings. Ruby 2.3, if that matters.

Thanks for any pointers,

-Ken


(Peter Butler) #2

Looks like a job for regular expressions.

···

On 05/04/2018 10:28 AM, Ken D'Ambrosio wrote:

Hi, all. I've got a file with some things like this:

radio frames^M<83>?<9B>v64
(The "?" is a \x3f)

Needless to say, Ruby barfs all over that. There are also *other* invalid strings in the file. (Thanks, Framemaker.)

Now, I know I can use #scrub to make the file palatable, but what I *really* want to do is to take the "<83>?<9B>", and swap it with a \u2022 (unicode bullet), and then use #scrub on the rest of the invalid stuff. But I can't figure out how to do that; I admit I get out of my depth when dealing with encodings. Ruby 2.3, if that matters.

Thanks for any pointers,

-Ken

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Xeno Campanoli) #3

/#{Regexp.quote(your_string_variable)}/

···

On Fri, May 4, 2018 at 11:21 AM, Peter Butler <peter@cogico.com> wrote:

Looks like a job for regular expressions.

On 05/04/2018 10:28 AM, Ken D'Ambrosio wrote:

Hi, all. I've got a file with some things like this:

radio frames^M<83>?<9B>v64
(The "?" is a \x3f)

Needless to say, Ruby barfs all over that. There are also *other*
invalid strings in the file. (Thanks, Framemaker.)

Now, I know I can use #scrub to make the file palatable, but what I
*really* want to do is to take the "<83>?<9B>", and swap it with a \u2022
(unicode bullet), and then use #scrub on the rest of the invalid stuff.
But I can't figure out how to do that; I admit I get out of my depth when
dealing with encodings. Ruby 2.3, if that matters.

Thanks for any pointers,

-Ken

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Ryan Davis) #4

“Ruby Barfs”… “can’t figure out how”…

What can’t you figure out? What have you tried?

These details are important to provide when posting to a list like this so people have enough context to help.

···

On May 4, 2018, at 09:28, Ken D'Ambrosio <ken@jots.org> wrote:

Now, I know I can use #scrub to make the file palatable, but what I *really* want to do is to take the "<83>?<9B>", and swap it with a \u2022 (unicode bullet), and then use #scrub on the rest of the invalid stuff. But I can't figure out how to do that; I admit I get out of my depth when dealing with encodings. Ruby 2.3, if that matters.

---

That said, you probably want to read the file as BINARY (aka ASCII-8BIT), do your transformations/substitutions with regular expressions & strings (that are also binary) and hopefully do all the ones you need such that the file winds up being valid UTF-8 (or whatever you’re going for).


(Marvin Gülker) #5

Needless to say, Ruby barfs all over that. There are also *other* invalid
strings in the file. (Thanks, Framemaker.)

I assume you run into an invalid encoding exception. As Ryan pointed
out, being more specific helps us to help you.

Now, I know I can use #scrub to make the file palatable, but what I *really*
want to do is to take the "<83>?<9B>", and swap it with a \u2022 (unicode
bullet), and then use #scrub on the rest of the invalid stuff. But I can't
figure out how to do that; I admit I get out of my depth when dealing with
encodings. Ruby 2.3, if that matters.

If the problem is that Ruby doesn't let you do replacements on the
string due to the invalid encoding, then you can use #force_encoding
to force the string's encoding to "BINARY" (which means: accept
anything), then do the replacement (use hex escapes), and then use
#force_encoding again to convert it back to the original
encoding. Then clean the string up with #scrub.

Marvin

···

Am 04. May 2018 um 12:28 Uhr -0400 schrieb Ken D'Ambrosio:

--
Blog: https://mg.guelker.eu
PGP/GPG ID: F1D8799FBCC8BC4F