Handling special characters

Hi all.

I am very new to Ruby (5 days old) so my question might sound very
noobish. I am posting it only cause I couldn't find a solution.

I am using ruby to scrape content of a site.

To be precise I am having problems with the ’ character.

Sample source page :
http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023

The encoding is in tis-620 and I use Iconv to convert it to utf8,
however the special quote character gives the following error on iconv

/home/....../main.rb:37:in `iconv': "\222s announcement "...
(Iconv::IllegalSequence)

the affected code area

body = story.search("//font[@color=\"#333333\"]").inner_html
body = body.gsub(/<(.|\n)+?>/, "")
body = body.gsub(/�/, "\'")
puts body
body = Iconv.iconv("utf8", "tis-620", body) #<-- this is line 37
puts body

Or try the following on irb

require 'rubygems'
require 'net/http'
require 'open-uri'
require 'iconv'
story =
Hpricot(open('http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023'
))
body = story.search("//font[@color=\"#333333\"]").inner_html
body = body.gsub(/<(.|\n)+?>/, "")
body = body.gsub(/’/, "\'")
puts body

no matter whatever i put in the "’" it doesn't replace anything and the
iconv still gives errors.

I am looking for pointers on one of the following.

1) how do i replace "’" to "'" ?
or 2) How can I make iconv ignore the "’" ?

At first I thought this to be a I18n issue, but i guess getting rid of
the special character would be a simple string manipulation which i dont
get.

···

--
Posted via http://www.ruby-forum.com/.

and oh. you would also need to
require 'mechanize'

in the irb to emulate the issue

require 'rubygems'
require 'net/http'
require 'open-uri'
require 'mechanize'
require 'iconv'
story =
Hpricot(open('http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023'
))
body = story.search("//font[@color=\"#333333\"]").inner_html
body = body.gsub(/<(.|\n)+?>/, "")
body = body.gsub(/’/, "\'")
puts body

···

--
Posted via http://www.ruby-forum.com/.

Hi,

Hi all.

I am very new to Ruby (5 days old) so my question might sound very
noobish. I am posting it only cause I couldn't find a solution.

I am using ruby to scrape content of a site.

To be precise I am having problems with the ' character.

Sample source page :
http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023

The encoding is in tis-620 and I use Iconv to convert it to utf8,
however the special quote character gives the following error on iconv

/home/....../main.rb:37:in `iconv': "\222s announcement "...
(Iconv::IllegalSequence)

the affected code area

body = story.search("//font[@color=\"#333333\"]").inner_html
body = body.gsub(/<(.|\n)+?>/, "")
body = body.gsub(/�/, "\'")
puts body
body = Iconv.iconv("utf8", "tis-620", body) #<-- this is line 37
puts body

Or try the following on irb

require 'rubygems'
require 'net/http'
require 'open-uri'
require 'iconv'
story =
Hpricot(open('http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023&#39;
))
body = story.search("//font[@color=\"#333333\"]").inner_html
body = body.gsub(/<(.|\n)+?>/, "")
body = body.gsub(/'/, "\'")
puts body

no matter whatever i put in the "'" it doesn't replace anything and the
iconv still gives errors.

I am looking for pointers on one of the following.

1) how do i replace "'" to "'" ?

body = body.gsub(/\222/,"\'")

or 2) How can I make iconv ignore the "'" ?

The ' character (0x92) is not in tis-620 but in windows-874 character set.

Refer to

http://www.microsoft.com/globaldev/reference/sbcs/874.mspx

Try
body = Iconv.iconv("utf-8", "windows-874", body).join

Regards,

Park Heesob

···

2008/8/4 Sajal Kayan <sajal@thaindian.com>:

Heesob Park wrote:

The ' character (0x92) is not in tis-620 but in windows-874 character
set.

Refer to
TIS 620 Charset
http://www.microsoft.com/globaldev/reference/sbcs/874.mspx

Try
body = Iconv.iconv("utf-8", "windows-874", body).join

Regards,

Park Heesob

Awesome works like a charm now. Thanks for the prompt response.

Seems like the source site was putting in the wrong html headers.

You saved me from going bald :smiley:

···

--
Posted via http://www.ruby-forum.com/\.