Testing encoding / escaping in Ruby 1.8

bruka · 13 November 2014 19:20

It's Rails 2.3, Ruby 1.8.7

Someone before I came along coded a bunch of character sanitations in our
app.
These had no documentation and no unit tests.

Our app is in English and all our clients use latin alphabet.
The sanitation is for the infamous MS Office "smart quotes", that is:
left + right single quote ( ‘ ) ( ’ ) => ascii single quote &#39 ( ' )
left + right double quote ( “ ) ( ” ) => ascii quotation mark &#34 ( " )
(http://amp-what.com/unicode/search/quote)

The sanitation code looks like this:

     str.gsub! "\342\200\230", "'"
     str.gsub! "\342\200\231", "'"
     str.gsub! "\342\200\234", '"'
     str.gsub! "\342\200\235", '"'

My knowledge of encodings and such is above elementary, but in this field
that really means nothing.

Question 1.
I don't understand what "\342\200\230" is.
From online research I find it is the "raw ASCII Octal representation".
What does that mean?

Question 2.
A user enters text into a form and submits it. It goes through the
controller, the model, and eventually the DB. At which stage is a Unicode
character converted to the above backslash representation? Can this be
tweaked for Rails 2.3 apps?

Question 3.
How can I write a test for the above sanitations. If I directly paste the
Unicode characters into my UTF8 encoded sanitation_test.rb the assertions
fail. They expect Unicode sequences. This feels weird and it means my
assertion must look like this:
assert "Bob\342\200\230s house is red".replace_smart_quotes == "Bob's house
is red"

Something about this doesn't feel right. I guess if the answer to 2. is
reliable and permanent.

I understand that the answers to these questions may be long and complex. I
will happily read any online resource you send my way that might help me
learn more about the subject.

Thanks,

P.S.
Upgrading to a later Ruby is not an option.

bruka · 14 November 2014 19:07

I spent a bunch of time investigating this and I'm glad I did because
it's really fascinating stuff. I'm going to post my findings for
future Rubyists. Some issues are Ruby 1.8 specific, but others affect
all developers.

The questions:

Question 1.
I don't understand what "\342\200\230" is.

This is the octal representation of the character. It goes way, way,
way back and is common in older languages (e.g. C ), but not so much
in recent frameworks because of the preference for full Unicode
support. Since Ruby 1.9 the "\uxxxx" Unicode format is supported, and
is the one to use. This SO answer has a great write-up about it

Question 2.
A user enters text into a form and submits it. It goes through the controller, the model, and eventually the DB.
At which stage is a Unicode character converted to the above backslash representation?

This question is wrong. Rails does turn all string to UTF8, but the
real answer is that a string and its Octal or Hex representations are
equal and interchangeable.
example:
aphostrophe: '
octal: \47 hex: \x27

"Bob's" == "Bob\47s" # => true
"Bob's" == "Bob\x27s" # => true

To ruby there is no difference between a UTF8 char, or it's octal or
hex or Unicode representation. Note that I'm using double quote
interpolated strings. Non-interpolated won't work: "Bob's" !=
'Bob\47s'

See this converter for your own special chars:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?mode=char

Question 3.
How can I write a test for the above sanitations.

This was the real tricky part. There are a lot of layers between me
and the Ruby interpreter (Operating System, text editor etc.). Again,
Rails converts everything to UTF8, so if I want to test my special
character conversions, I need to make sure everything is in UTF8. It
turns out that because I'm on windows, my text is not UTF8 but
windows-1252. Tricky Tricky! To make a test for a string literal, that
will ultimately only work on my environment I have to first convert to
UTF8.

Which looks something like this:

# Paste your special characters.
# Right side quote ’
raw_string = %(Bob’s house is red.)

# Convert to UTF8.
# Will differ depending on developer/testing environment. Highly brittle.
utf8_string = Iconv.conv('utf-8', 'windows-1252', raw_string)

assert raw_string != utf8_string #=> they "probably" won't be equal

puts utf8_string.inspect #=> "BobΓÇÖs house is red."

assert utf8_string.include?("Bob\342\200\231s") #=> Should match the
equivalent octal
assert utf8_string.include?("Bob\xE2\x80\x99s") #=> Should match the
equivalent hex
assert utf8_string.gsub(/\342\200\231/, "<rsquo>") == "Bob<rsquo>s
house is red."

Hope this helps someone else in the future dealing with the past.

More reading:
https://www.ruby-forum.com/topic/128572
one of the early discussions about moving away from octal escapes to
Unicode escapes.

···

On Thu, Nov 13, 2014 at 2:20 PM, bruka <bruka@idatainc.com> wrote:

It's Rails 2.3, Ruby 1.8.7

Someone before I came along coded a bunch of character sanitations in our
app.
These had no documentation and no unit tests.

Our app is in English and all our clients use latin alphabet.
The sanitation is for the infamous MS Office "smart quotes", that is:
left + right single quote ( ‘ ) ( ’ ) => ascii single quote &#39 ( ' )
left + right double quote ( “ ) ( ” ) => ascii quotation mark &#34 ( " )
(“quote\” Unicode Characters, Symbols & Entities Search | AmpWhat)

The sanitation code looks like this:

     str.gsub! "\342\200\230", "'"
     str.gsub! "\342\200\231", "'"
     str.gsub! "\342\200\234", '"'
     str.gsub! "\342\200\235", '"'

My knowledge of encodings and such is above elementary, but in this field
that really means nothing.

Question 1.
I don't understand what "\342\200\230" is.
From online research I find it is the "raw ASCII Octal representation". What
does that mean?

Question 2.
A user enters text into a form and submits it. It goes through the
controller, the model, and eventually the DB. At which stage is a Unicode
character converted to the above backslash representation? Can this be
tweaked for Rails 2.3 apps?

Question 3.
How can I write a test for the above sanitations. If I directly paste the
Unicode characters into my UTF8 encoded sanitation_test.rb the assertions
fail. They expect Unicode sequences. This feels weird and it means my
assertion must look like this:
assert "Bob\342\200\230s house is red".replace_smart_quotes == "Bob's house
is red"

Something about this doesn't feel right. I guess if the answer to 2. is
reliable and permanent.

I understand that the answers to these questions may be long and complex. I
will happily read any online resource you send my way that might help me
learn more about the subject.

Thanks,

P.S.
Upgrading to a later Ruby is not an option.

Topic		Replies	Views
What is this syntax: \001\002? ruby-talk	14	170	12 September 2007
To_yaml and international characters ruby-talk	14	176	13 November 2007
Unicode roadmap? ruby-talk	17	108	18 June 2006
What character sets are available in Ruby? ruby-talk	16	157	10 March 2003
Unicode in Regex ruby-talk	32	339	7 December 2007

Testing encoding / escaping in Ruby 1.8

Related topics