Hi,
I'm trying to scrape links using Mechanize. Sometimes accented characters
(on French pages) are corrupt once Ruby gets them. To see what I mean, check
this:
require 'mechanize'
a = WWW::Mechanize.new
page = a.get('http://www.agr.gc.ca/cb/index_f.php?s1=n&s2=index&page=2009_07
')
page.links.each do |a_link|
puts a_link
end
Of course, it's only the accents that are entered in plain text (i.e.,
without entities) that have this problem. But in an imperfect world, I can't
always count on accents being entered properly.
Is there anything I can do about this? I've tried using Iconv to convert the
strings to UTF-8, but that just resulted in a different (but still wrong)
character in place of the broken ones.
Thanks for any help,
Patrick
What version of nokogiri / mechanize do you have installed? I ran your
code and was able to see the accents:
Best Note Taking App - Organize Your Notes with Evernote
Most of the time, these encoding issues are due to the server
incorrectly identifying the encoding of the content. Is this content
supposed to be ISO-8859-1?
···
On Tue, Jul 07, 2009 at 10:18:50PM +0900, Patrick Lajeunesse wrote:
Hi,
I'm trying to scrape links using Mechanize. Sometimes accented characters
(on French pages) are corrupt once Ruby gets them. To see what I mean, check
this:
require 'mechanize'
a = WWW::Mechanize.new
page = a.get('http://www.agr.gc.ca/cb/index_f.php?s1=n&s2=index&page=2009_07
')
page.links.each do |a_link|
puts a_link
end
Of course, it's only the accents that are entered in plain text (i.e.,
without entities) that have this problem. But in an imperfect world, I can't
always count on accents being entered properly.
Is there anything I can do about this? I've tried using Iconv to convert the
strings to UTF-8, but that just resulted in a different (but still wrong)
character in place of the broken ones.
--
Aaron Patterson
http://tenderlovemaking.com/
Thanks Aaron - I thought I was up-to-date, but I guess I not. I did a gem
update and got 0.9.3 - and then it worked fine.
Thanks again,
Patrick
···
On Tue, Jul 7, 2009 at 12:13 PM, Aaron Patterson <aaron@tenderlovemaking.com > wrote:
On Tue, Jul 07, 2009 at 10:18:50PM +0900, Patrick Lajeunesse wrote:
> Hi,
> I'm trying to scrape links using Mechanize. Sometimes accented characters
> (on French pages) are corrupt once Ruby gets them. To see what I mean,
check
> this:
>
> require 'mechanize'
> a = WWW::Mechanize.new
> page = a.get('
http://www.agr.gc.ca/cb/index_f.php?s1=n&s2=index&page=2009_07
> ')
> page.links.each do |a_link|
> puts a_link
> end
>
> Of course, it's only the accents that are entered in plain text (i.e.,
> without entities) that have this problem. But in an imperfect world, I
can't
> always count on accents being entered properly.
>
> Is there anything I can do about this? I've tried using Iconv to convert
the
> strings to UTF-8, but that just resulted in a different (but still wrong)
> character in place of the broken ones.
What version of nokogiri / mechanize do you have installed? I ran your
code and was able to see the accents:
Best Note Taking App - Organize Your Notes with Evernote
Most of the time, these encoding issues are due to the server
incorrectly identifying the encoding of the content. Is this content
supposed to be ISO-8859-1?
--
Aaron Patterson
http://tenderlovemaking.com/