Setting encoding of pages in Capybara

James_Coglan2 · 7 September 2010 10:27

Hi all,

Quick encoding question: say I'm trying to grab data from a Japanese page
using Capybara and Rack::Test, and I get badly encoded text in the response.
e.g. running this script:

require 'rubygems'
require 'capybara'
require 'rack/test'
require 'rack/proxy'

Capybara.default_selector = :css

class Japan < Rack::Proxy
  def rewrite_env(env)
    env['HTTP_HOST'] = 'l-tike.com'
    env
  end
end

session = Capybara::Session.new(:rack_test, Japan.new)
session.visit '/pickup/concert_more.html'
puts session.body

You'll see weird characters in the output, and I can't find nodes that
should be there with css/xpath. How do I set the encoding so that Nokogiri
parses the page properly?

···

--
James Coglan
http://jcoglan.com
+44 (0) 7771512510

Mike_Dalessio1 · 7 September 2010 11:29

Hi,

Hi all,

Quick encoding question: say I'm trying to grab data from a Japanese page
using Capybara and Rack::Test, and I get badly encoded text in the
response.
e.g. running this script:

First, a quick note, that this question is probably more appropriate for the
capybara or nokogiri mailing lists. You're likely to get a quicker response
from those groups.

require 'rubygems'
require 'capybara'
require 'rack/test'
require 'rack/proxy'

Capybara.default_selector = :css

class Japan < Rack::Proxy
def rewrite_env(env)
env['HTTP_HOST'] = 'l-tike.com'
env
end
end

session = Capybara::Session.new(:rack_test, Japan.new)
session.visit '/pickup/concert_more.html'
puts session.body

It looks like this page claims (in its header) to be encoding in SHIFT_JIS,
but the page is encoded in UTF-8. LibXML's guesses at encoding are not
perfect, and in this case the misleading information causes it to trust the
header and use the wrong encoding.

If this page is edited to contain

instead of

then all is well.

Perhaps someone with more experience than me using non-western character
sets will have a deeper insight into libxml's behavior here?

···

On Tue, Sep 7, 2010 at 6:27 AM, James Coglan <jcoglan@googlemail.com> wrote:

You'll see weird characters in the output, and I can't find nodes that
should be there with css/xpath. How do I set the encoding so that Nokogiri
parses the page properly?

--
James Coglan
http://jcoglan.com
+44 (0) 7771512510

Topic		Replies	Views
How to set Browser Encoding? ruby-talk	1	104	29 June 2007
Nokogiri help ruby-talk	4	93	24 November 2009
Nokogiri encoding problem ruby-talk	1	107	10 April 2012
Mechanize Help ruby-talk	2	88	24 November 2009
Watir 1.6.5, ruby 1.8.7 and character encoding problem ruby-talk	3	150	26 October 2010

Setting encoding of pages in Capybara

Related topics