Text encodings

Hello,

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

Thank you.

You can't detect one-byte-per-character encodings easily (i.e. without
statistical analysis) but you can easily tell if something's UTF-8 or
not:

class String
  def is_utf8?
    unpack('U*')
    return true
  rescue
    return false
  end
end

"foo".is_utf8? #=> true
"foo\303".is_utf8? #=> false

Not the most efficient way, necessarily, but probably the easiest.

Paul.

···

On 10/07/06, xTRiM <rtokarev@gmail.com> wrote:

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

Hi,

···

2006/7/10, xTRiM <rtokarev@gmail.com>:

Hello,

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

You can use the standard lib NKF's guess or guess2 (ruby 1.8.2 or
later) method for that. Look up the NKF section in
http://www.ruby-doc.org/stdlib/\.

Takashi Sano

In the general case, there's *no safe way* to do this, unless the data is XML or comes with an HTTP header from a reliable server (ha ha ha, I'm sure there must be one somewhere). Probably the best auto-detecter is Mark Pilgrim's, but it's in Python: http://chardet.feedparser.org/

  -Tim

···

On Jul 10, 2006, at 4:47 AM, Takashi Sano wrote:

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

You can use the standard lib NKF's guess or guess2 (ruby 1.8.2 or
later) method for that. Look up the NKF section in
http://www.ruby-doc.org/stdlib/\.

Nice pointer, Tim. I'll have to check that out. I did a quick web search
and found a Ruby port incidentally (I have not evaluated it in any way
though):
http://rubyforge.org/projects/chardet/ by Hui Zheng
gem name is "chardet"

Jake

···

On Jul 10, 2006, at 4:47 AM, Takashi Sano wrote:

is there any way, to detect text encoding?
For example, is it in utf8, or in win1251, or something else.

You can use the standard lib NKF's guess or guess2 (ruby 1.8.2 or
later) method for that. Look up the NKF section in
http://www.ruby-doc.org/stdlib/\.

In the general case, there's *no safe way* to do this, unless the
data is XML or comes with an HTTP header from a reliable server (ha
ha ha, I'm sure there must be one somewhere). Probably the best auto-
detecter is Mark Pilgrim's, but it's in Python: http://
chardet.feedparser.org/

  -Tim