Beginner Help - Encoding error exporting/writing to CSV

I've looked all over the internet and can't seem to figure out what I
need to do to get past this error. I admit, I've only been playing with
Ruby for a couple of days so bear with me.

I created a script to read the info and first post from a list of forum
threads. I can get all of the data in arrays and export to a CSV
perfectly fine with small batches. However, I know the forum has some
japanese and other language posts and I can only assume one of those is
what's causing this problem.

After the parsing of my list of HTML pages is done and it starts writing
the data to CSV, about half way I get this:

F:/ruby/lib/ruby/1.9.1/csv.rb:1729:in `join': incompatible character
encodings: UTF-8 and ISO-8859-1 (Encoding::CompatibilityError)
        from F:/ruby/lib/ruby/1.9.1/csv.rb:1729:in `<<'
        from threads.rb:89:in `block (2 levels) in <main>'
        from threads.rb:80:in `each'
        from threads.rb:80:in `block in <main>'
        from
F:/OfflineExplorerPortable/ruby/lib/ruby/1.9.1/csv.rb:1354:in `open'
        from threads.rb:77:in `<main>'

I tried a dozen different things I read online about encoding to try and
fix it but they either didn't do anything or threw me other method
errors. It's probably simple but it's beyond me at the moment. Can
anyone give the FNG a little help? I'd appreciate it!

···

--
Posted via http://www.ruby-forum.com/.

Can you show us the code? It would be easier to help.

All I can recomment now is to add .encode('utf-8') (or
.force_encoding('utf-8')) everywhere where you accept external input.
Command line arguments? Downloaded website content? Files read from
disk? Mark their encodings explicitly.

-- Matma Rex

This is the error from the above code. This is after about 1390~ lines
being written to the output.csv file:

F:/ruby/lib/ruby/1.9.1/csv.rb:1729:in `join': incompatib
le character encodings: UTF-8 and ISO-8859-1
(Encoding::CompatibilityError)
        from F:/ruby/lib/ruby/1.9.1/csv.rb:1729:in `<<'
        from threads2.rb:87:in `block (2 levels) in <main>'
        from threads2.rb:78:in `each'
        from threads2.rb:78:in `block in <main>'
        from F:/ruby/lib/ruby/1.9.1/csv.rb:1354:in `open'
        from threads2.rb:75:in `<main>'

···

--
Posted via http://www.ruby-forum.com/.

Thank you for your response.

Bartosz Dziewoński wrote in post #1074162:

All I can recomment now is to add .encode('utf-8') (or
.force_encoding('utf-8')) everywhere where you accept external input.

I've tried that and I must be doing something wrong because I keep
getting undefined method errors when I do

Can you show us the code? It would be easier to help.

I'll warn you and say that it's not very elegant but it seems to get the
data correctly:

# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'

#define arrays
@thread = Array.new
@filename = Array.new
@postid = Array.new
@title = Array.new
@date = Array.new
@filedate = Array.new
@member = Array.new
@memberurl = Array.new
@content = Array.new

#pull in file/URL list
files = CSV.read("files1.csv")
(0..files.length - 1).each do |index|
#print out current URL or File
  puts files[index][0]
#save it to array
  @filename << files[index][0]
#load HTML to Nokogiri
  doc = Nokogiri::HTML(open(files[index][0]))

#find the first post and ID
threadurl = doc.css('a[name="1"]').map { |link| link['href'] }
test = threadurl.to_s
test = test[2..-3]
#isolate the main/first post ID
  post_id = test.split("#") [1]
  post_id = post_id[4..-1]

#make other versions of post ID references if needed
  postmessageid = "post_message_" + post_id
  postmenu = "postmenu_" + post_id
  pstid = "post" + post_id

#find the Date of first post
  fdate = doc.css('td[class="thead"]')[2].content
  fdate = fdate.strip

#get member name
  membername = doc.css('a[class="bigusername"]')

#get post content
  contentid = "div#post_message_" + post_id
  postcontent = doc.css(contentid)

#write all other arrays
  @title << doc.at_css("title").text[0..-28]
  @filedate << fdate
  @member << membername[0].text
  @memberurl << membername[0]['href']
  @content << postcontent
end

CSV.open("output.csv", "wb:UTF-8") do |row|
  row << ["Thread Title", "Filename - Thread URL", "Date", "Member",
"Member URL", "Content"]
  #(0..urls.length - 1).each do |index|
  (0..files.length - 1).each do |index|
    row << [
      @title[index],
      @filename[index],
      @filedate[index],
      @member[index],
      @memberurl[index],
      @content[index]]
  end
end

···

--
Posted via http://www.ruby-forum.com/\.

Uh, that's certainly bad. Either you're not using Ruby 1.9 (that would
be weird...), or the objects you are operating on are not the object
you think they are (you should get information about the in the error
message, e.g. "NoMethodError: undefined method `encode' for
5:Fixnum").

I skimmed the code, possible places where you are not setting the
encoding are "files = CSV.read("files1.csv")" or "doc =
Nokogiri::HTML(open(files[index][0]))". Also, this is a different code
than the one that gives the error in first post (there is not line 77
in it).

-- Matma Rex

···

2012/9/1 Allan A. <lists@ruby-forum.com>:

Bartosz Dziewoński wrote in post #1074162:

All I can recomment now is to add .encode('utf-8') (or

.force_encoding('utf-8')) everywhere where you accept external input.

I've tried that and I must be doing something wrong because I keep
getting undefined method errors when I do

Bartosz Dziewoński wrote in post #1074257:

Uh, that's certainly bad. Either you're not using Ruby 1.9 (that would
be weird...), or the objects you are operating on are not the object
you think they are (you should get information about the in the error
message, e.g. "NoMethodError: undefined method `encode' for
5:Fixnum").

It would just be like:

postcontent.force_encoding('utf-8'))

right? Or do I have to pass the output of that into another
variable/string?

This is what I get when I use the above:

threads.rb:64:in `block in <main>': undefined method `force_encoding'
for #<Nokogiri::XML::NodeSet:0x142b048> (NoMethodError)
        from threads.rb:21:in `each'
        from threads.rb:21:in `<main>'

I skimmed the code, possible places where you are not setting the
encoding are "files = CSV.read("files1.csv")" or "doc =
Nokogiri::HTML(open(files[index][0]))". Also, this is a different code
than the one that gives the error in first post (there is not line 77
in it).

Nope, it's the same code but the files1.csv has about 3000 lines/files
in it. I presume it's failing on the postcontent on one of those in teh
array.

When I stick the force_encoding at the end of the Nokogiri call (since
that's where the text is coming out from), I get:

threads.rb:20:in `<main>': undefined method `force_encoding' for
#<Array:0xb3aae0> (NoMethodError)

···

--
Posted via http://www.ruby-forum.com/\.