Setting encoding of pages in Capybara

Hi all,

Quick encoding question: say I'm trying to grab data from a Japanese page
using Capybara and Rack::Test, and I get badly encoded text in the response.
e.g. running this script:

require 'rubygems'
require 'capybara'
require 'rack/test'
require 'rack/proxy'

Capybara.default_selector = :css

class Japan < Rack::Proxy
  def rewrite_env(env)
    env['HTTP_HOST'] = 'l-tike.com'
    env
  end
end

session = Capybara::Session.new(:rack_test, Japan.new)
session.visit '/pickup/concert_more.html'
puts session.body

You'll see weird characters in the output, and I can't find nodes that
should be there with css/xpath. How do I set the encoding so that Nokogiri
parses the page properly?

···

--
James Coglan
http://jcoglan.com
+44 (0) 7771512510

Hi,

Hi all,

Quick encoding question: say I'm trying to grab data from a Japanese page
using Capybara and Rack::Test, and I get badly encoded text in the
response.
e.g. running this script:

First, a quick note, that this question is probably more appropriate for the
capybara or nokogiri mailing lists. You're likely to get a quicker response
from those groups.

require 'rubygems'
require 'capybara'
require 'rack/test'
require 'rack/proxy'

Capybara.default_selector = :css

class Japan < Rack::Proxy
def rewrite_env(env)
   env['HTTP_HOST'] = 'l-tike.com'
   env
end
end

session = Capybara::Session.new(:rack_test, Japan.new)
session.visit '/pickup/concert_more.html'
puts session.body

It looks like this page claims (in its header) to be encoding in SHIFT_JIS,
but the page is encoded in UTF-8. LibXML's guesses at encoding are not
perfect, and in this case the misleading information causes it to trust the
header and use the wrong encoding.

If this page is edited to contain

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

instead of

<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">

then all is well.

Perhaps someone with more experience than me using non-western character
sets will have a deeper insight into libxml's behavior here?

···

On Tue, Sep 7, 2010 at 6:27 AM, James Coglan <jcoglan@googlemail.com> wrote:

You'll see weird characters in the output, and I can't find nodes that
should be there with css/xpath. How do I set the encoding so that Nokogiri
parses the page properly?

--
James Coglan
http://jcoglan.com
+44 (0) 7771512510