"200 Millionen Jahre später # 17.39
\n",
"200 Millionen Jahre später # 9.87
3404211707 \n",
"A l'assaut de l'invisible 1977 # 4.91
\n",
"A l'assaut de l'invisible 1990 # 5.18
226603779 \n",
The above 4 lines are data I was attempting to load into an array to
test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.
The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.
I have two questions
1) Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.
2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.
Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB
str = String.new
str.encode(("US-ASCII")
str = "Millionen Jahre später"
"200 Millionen Jahre später # 17.39
\n",
"200 Millionen Jahre später # 9.87
3404211707 \n",
"A l'assaut de l'invisible 1977 # 4.91
\n",
"A l'assaut de l'invisible 1990 # 5.18
226603779 \n",
The above 4 lines are data I was attempting to load into an array to
test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.
The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.
I have two questions
1) Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.
That would depend very much on how you scrape the data and if you handle
stuff like meta tags correcty.
···
On Thu, 2010-11-18 at 01:01 +0900, Don Norcott wrote:
2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.
Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB
str = String.new
str.encode(("US-ASCII")
str = "Millionen Jahre später"
"200 Millionen Jahre später # 17.39
\n",
"200 Millionen Jahre später # 9.87
3404211707 \n",
"A l'assaut de l'invisible 1977 # 4.91
\n",
"A l'assaut de l'invisible 1990 # 5.18
226603779 \n",
The above 4 lines are data I was attempting to load into an array to
test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.
The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.
I have two questions
1) Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.
Depends how you read the data from webpages.
2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.
You need to set encodings properly. You can do that when opening the file. Example:
Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB
str = String.new
str.encode(("US-ASCII")
str = "Millionen Jahre später"
This won't work - ever. You set the encoding for an instance and then you reassign str to point to another instance, so all your encoding settings are lost. Also, there is no "ü" in ASCII which is 7bit!
irb(main):011:0> s="a"
=> "a"
irb(main):012:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):013:0> t = s.encode "ASCII"
=> "a"
irb(main):014:0> t.encoding
=> #<Encoding:US-ASCII>
Now with "ü":
irb(main):015:0> s="ü"
=> "ü"
irb(main):016:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):017:0> t = s.encode "ASCII"
Encoding::UndefinedConversionError: "\xC3\xBC" from UTF-8 to US-ASCII
from (irb):17:in `encode'
from (irb):17
from /usr/local/bin/irb19:12:in `<main>'
I am using nokogiri (with Mechanize) to scrape the data and the data I
am concerned with is extracted only from displayable fields <table
class="result> .... </table>
The code set/language references I see are
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
Which is I believe, what I am calling Extended ASCII(8 bit 0 - 255)
AND
//<![CDATA[ var awsDomain = 'xxxxxxxx.xxx';
var surveyLink = "sm=93_2fjk6BaUHEqrn2qpdbknQ_3d_d"
var twoLetterISOCode = 'en'; //]]>
The scrapped data has never caused a problem within the ruby program
(would have been very obvious). Can I safely assume that code sets will
never present a problem for this specific application as long as the
retrieval methods do not change???.
···
=========================
That being said when I open the file with io it reports
#<Encoding:IBM437> which would contain the characters giving problems
(but not there correct representation). That is to say the IBM437 for
character E4 is a Graphic character not the accented French 'a' in
"später". The graphic is what is also being displayed in the IRB
console.
I have gone through most of the Shades of Gray link and only thing that
I thought might have been of value is the LC_TYPE but either UTF-9 or
ISO-8859-1 both work identically in my situation. I have removed
LC_TYPE since there is no problem with internal data and it might cause
a problem down the line when I have forgotten about it.
Also tried saving code & data to a file and running the file (ruby
xxx.rb) and still reports a multibyte error.
Played with ruby command line encoding settings (ruby -E XXX)and still
received errors regardless of code set I picked - may be related to
LC_TYPE as did not reboot so still valid??
Error is
CodeSet.rb:4: invalid multibyte char (US-ASCII) which is 7 bit.
Extended ASCII code sets ISO-8859 & IBM437 are 8 bit but can not seem to
set this.
=======================
I can edit the data file externally and read the data into an array
without problems.
So will assume no need to pursue the code set settings at this time.
Will not update unless I have a revelation.
By the recommended link was excellent, will save URL as a resource.
I have two questions
1) Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.
In ruby 1.9, you have to worry about this very much.
Strings in ruby 1.9 are two-dimensional: they have a sequence of bytes,
and they have an encoding. There are additional 'dimensions' based on
the string's content - empty, ascii_compatible, valid_encoding.
If your scraper library doesn't document how it choses the encodings to
tag each string it returns, and doesn't document how it handles invalid
encodings if it comes across them, then you have to test its behaviour
for all the various edge cases.
You never have this issue with ruby 1.8, because a string is just a
string of bytes. Of course, the "garbage in, garbage out" principle
still applies; you just don't choke on the garbage.
2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.
That's a short question with a long answer, and I'm afraid my own
attempt to answer it is incomplete:
If you're reading stuff from a file or a socket yourself, you can
control the process. If you're trusting a third-party library to fetch
data from somewhere, then you have to trust that library to do the right
thing in the situations you're interested in.
Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB
irb is not a good predictor of encoding behaviour for ruby 1.9, and
you'd be better writing standalone .rb scripts that you run.
Note that it's one of the 1.9 language inconsistencies that transcoding
is *not* done on output by default. So if you have a read a string from
a file, and carefully tag it as say UTF-8, but your terminal is IBM437,
then
puts my_string
will just squirt the UTF-8 bytes to the terminal and they'll display
wrongly. You can try something like this:
STDOUT.set_encoding "IBM437"
or
STDOUT.set_encoding "locale"