Replace delimiter in unicode encdoded file

Krzysztof_Cierpisz · 4 December 2006 13:40

Is there a way in ruby to:
- open a file encoded in ucs-2le,
- replace every occurance of '\t' (X'0009') with ',' (X'002c'),
- and save it back in ucs-2le, without loosing any content?

thanks
chris

Ross_Bamford4 · 5 December 2006 01:49

Well, you _could_ do it with iconv:

$ irb -riconv

data = File.read('test')
# => "a\000b\000c\000\t\000\273\006\t\0001\000"

str = Iconv.iconv('utf-8', 'ucs-2le', data).first
# => "abc\t\332\273\t1"

newstr = str.tr("\t", ',')
# => "abc,\332\273,1"

newdata = Iconv.iconv('ucs-2le', 'utf-8', newstr).first
# => "a\000b\000c\000,\000\273\006,\0001\000"

But that strikes me as unnecessary when you could just do:

newdata = File.read('test').tr("\t", ',')
# => "a\000b\000c\000,\000\273\006,\0001\000"

Hope that helps,

···

On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:

Is there a way in ruby to:
- open a file encoded in ucs-2le,
- replace every occurance of '\t' (X'0009') with ',' (X'002c'),
- and save it back in ucs-2le, without loosing any content?

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

David_Vallner · 6 December 2006 06:10

Ross Bamford wrote:

···

On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:

Is there a way in ruby to:
- open a file encoded in ucs-2le,
- replace every occurance of '\t' (X'0009') with ',' (X'002c'),
- and save it back in ucs-2le, without loosing any content?

But that strikes me as unnecessary when you could just do:

newdata = File.read('test').tr("\t", ',')
# => "a\000b\000c\000,\000\273\006,\0001\000"

Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
not ASCII-transparent. Your iconv approach could work if you swapped
around the encoding names, except you'd probably also have to involve a
$KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.

David Vallner

Krzysztof_Cierpisz · 6 December 2006 12:05

David Vallner schrieb:

Ross Bamford wrote:
>> Is there a way in ruby to:
>> - open a file encoded in ucs-2le,
>> - replace every occurance of '\t' (X'0009') with ',' (X'002c'),
>> - and save it back in ucs-2le, without loosing any content?
> But that strikes me as unnecessary when you could just do:
>
> newdata = File.read('test').tr("\t", ',')
> # => "a\000b\000c\000,\000\273\006,\0001\000"
>

Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
not ASCII-transparent. Your iconv approach could work if you swapped
around the encoding names, except you'd probably also have to involve a
$KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.

David Vallner

--------------enig4A00E1A3DAAB09EEF0C6DD3E
Content-Type: application/pgp-signature
Content-Disposition: inline;
filename="signature.asc"
Content-Description: OpenPGP digital signature
X-Google-AttachSize: 188

Thanks Ross for the try, but it is not working,
tried for:

"\377\376B\001\363\000|\001k\000o\000\t\000k\000s\000i\000\005\001|\001k\000a\000\t\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000\t\000\t\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
which is:

łóżko książka człowiek
łąka żdźbło

-> (the same :))

the conversion should be:
łóżko,książka,człowiek
łąka,żdźbło

but with the Iconv try:
łóżko,książka,człowiek
਍䈀ԁ欁愀ⰀⰀ簀搁稀戁䈀漁ഀഀ

after swapping utf-8 to ucs-2le in the both iconv convertions, I get an
error message:
`iconv': "\377\376B\001¾ |☺k\000o\000\t\000k\000"...
(Iconv::IllegalSequence)

Any other suggestions highly appreciated.

Thanks
chris

···

> On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:

Ross_Bamford2 · 6 December 2006 12:55

I think David is confusing the order of the 'from' and 'to' arguments to Iconv.iconv - they go: (to, from, data). My short example was ill-conceived, though - this might be safer:

$ irb -riconv

s = <the string you show above>

s.gsub(/\t\000(?!\000)/, ",\000")
# => "\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"

(This is:

łóżko,książka,człowiek
łąka,żdźbło
)

But I'm not totally sure, so you might be better with iconv anyway:

Iconv.iconv('ucs-2le', 'utf-8', Iconv.iconv('utf-8','ucs-2le', s).first.gsub(/\t/u, ',')).first
# => "\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"

(This too is:

łóżko,książka,człowiek
łąka,żdźbło
)

Unless I missed something, this seems to work fine here. Does it work for you?

···

On Wed, 06 Dec 2006 12:01:37 -0000, ciapecki <ciapecki@gmail.com> wrote:

David Vallner schrieb:

Ross Bamford wrote:
> On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:
>> Is there a way in ruby to:
>> - open a file encoded in ucs-2le,
>> - replace every occurance of '\t' (X'0009') with ',' (X'002c'),
>> - and save it back in ucs-2le, without loosing any content?
> But that strikes me as unnecessary when you could just do:
>
> newdata = File.read('test').tr("\t", ',')
> # => "a\000b\000c\000,\000\273\006,\0001\000"
>

Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
not ASCII-transparent. Your iconv approach could work if you swapped
around the encoding names, except you'd probably also have to involve a
$KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.

Thanks Ross for the try, but it is not working,
tried for:

"\377\376B\001\363\000|\001k\000o\000\t\000k\000s\000i\000\005\001|\001k\000a\000\t\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000\t\000\t\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
which is:

łóżko książka człowiek
łąka żdźbło

-> (the same :))

the conversion should be:
łóżko,książka,człowiek
łąka,żdźbło

but with the Iconv try:
łóżko,książka,człowiek
਍䈀ԁ欁愀ⰀⰀ簀搁稀戁䈀漁ഀഀ

after swapping utf-8 to ucs-2le in the both iconv convertions, I get an
error message:
`iconv': "\377\376B\001¾ |☺k\000o\000\t\000k\000"...
(Iconv::IllegalSequence)

Any other suggestions highly appreciated.

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Krzysztof_Cierpisz · 6 December 2006 18:00

Thanks Ross,

I was that stupid and forgot to open the writable file as binary "wb"
(before I had "w" only)

Thanks again for your help
chris

···

On Wed, 06 Dec 2006 12:01:37 -0000, ciapecki <ciapecki@gmail.com> wrote:
> David Vallner schrieb:
>> Ross Bamford wrote:
>> > On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:
>> >> Is there a way in ruby to:
>> >> - open a file encoded in ucs-2le,
>> >> - replace every occurance of '\t' (X'0009') with ',' (X'002c'),
>> >> - and save it back in ucs-2le, without loosing any content?
>> > But that strikes me as unnecessary when you could just do:
>> >
>> > newdata = File.read('test').tr("\t", ',')
>> > # => "a\000b\000c\000,\000\273\006,\0001\000"
>> >
>>
>> Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
>> not ASCII-transparent. Your iconv approach could work if you swapped
>> around the encoding names, except you'd probably also have to involve a
>> $KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
>> where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.
>>
>
> Thanks Ross for the try, but it is not working,
> tried for:
>
> "\377\376B\001\363\000|\001k\000o\000\t\000k\000s\000i\000\005\001|\001k\000a\000\t\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000\t\000\t\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
> which is:
>
> łóżko książka człowiek
> łąka żdźbło
>
> -> (the same :))
>
> the conversion should be:
> łóżko,książka,człowiek
> łąka,żdźbło
>
> but with the Iconv try:
> łóżko,książka,człowiek
> ਍䈀ԁ欁愀ⰀⰀ簀搁稀戁䈀漁ഀഀ
>
> after swapping utf-8 to ucs-2le in the both iconv convertions, I get an
> error message:
> `iconv': "\377\376B\001¾ |☺k\000o\000\t\000k\000"...
> (Iconv::IllegalSequence)
>
>
> Any other suggestions highly appreciated.
>

I think David is confusing the order of the 'from' and 'to' arguments to
Iconv.iconv - they go: (to, from, data). My short example was
ill-conceived, though - this might be safer:

$ irb -riconv

s = <the string you show above>

s.gsub(/\t\000(?!\000)/, ",\000")
# =>
"\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"

(This is:

  łóżko,książka,człowiek
  łąka,żdźbło
)

But I'm not totally sure, so you might be better with iconv anyway:

Iconv.iconv('ucs-2le', 'utf-8', Iconv.iconv('utf-8','ucs-2le',
s).first.gsub(/\t/u, ',')).first
# =>
"\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"

(This too is:

  łóżko,książka,człowiek
  łąka,żdźbło
)

Unless I missed something, this seems to work fine here. Does it work for
you?

--
Ross Bamford - rosco@roscopeco.remove.co.uk

David_Vallner · 6 December 2006 21:22

Ross Bamford wrote:

I think David is confusing the order of the 'from' and 'to' arguments to
Iconv.iconv - they go: (to, from, data).

/me puts on dunce hat.

Sorry! I recall always using the command-line iconv specifying them in
from,to order, and apparently that burned deeper into my brain pathways
than it should have.

David Vallner

Paul_Lutus · 6 December 2006 18:40

ciapecki wrote:

/ ...

I was that stupid and forgot to open the writable file as binary "wb"
(before I had "w" only)

Don't kick yourself too hard, the error lies with Microsoft trying to golf
its way out of a thicket of its own making. There never should have been
two standard line endings (actually three if you include the Mac), and
there never should have been two path delimiters either, both of which
cause endless headaches for cross-platform coders.

The reason these variations exist is so someone can say, "my software is
different, unique, patentable, now you have to pay me for it." Even if the
differences convey no benefit to the users.

···

--
Paul Lutus
http://www.arachnoid.com

Ross_Bamford2 · 7 December 2006 02:55

I wouldn't be too quick with that - it gets me every time I use iconv too...

···

On Wed, 06 Dec 2006 21:22:31 -0000, David Vallner <david@vallner.net> wrote:

Ross Bamford wrote:

I think David is confusing the order of the 'from' and 'to' arguments to
Iconv.iconv - they go: (to, from, data).

/me puts on dunce hat.

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Krzysztof_Cierpisz · 6 December 2006 19:10

Another question following up.
Is there a way to find out in what encoding is the file encoded (is it
ucs-2le or utf-8)?
when I open a file in VIM I can check it with :set fileencoding
so there must be any way to recognize the file and its encoding.

Thanks
chris

David_Kastrup · 6 December 2006 20:00

Paul Lutus <nospam@nosite.zzz> writes:

ciapecki wrote:

/ ...

I was that stupid and forgot to open the writable file as binary "wb"
(before I had "w" only)

Don't kick yourself too hard, the error lies with Microsoft trying
to golf its way out of a thicket of its own making. There never
should have been two standard line endings (actually three if you
include the Mac), and there never should have been two path
delimiters either, both of which cause endless headaches for
cross-platform coders.

The reason these variations exist is so someone can say, "my
software is different, unique, patentable, now you have to pay me
for it." Even if the differences convey no benefit to the users.

No, the reason is that CP/M had no tty concept, and consequently no
automatic LF->CRLF translation (and CRLF is required on printers).
Also forward slashes were used in CP/M as option lead-ins (CP/M, not
having named directories, did not need to use forwards slashes for
those).

This legacy is from long before POSIX, in fact, from long before C.

···

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum

John_W_Kennedy · 11 December 2006 03:40

Paul Lutus wrote:

ciapecki wrote:

/ ...

I was that stupid and forgot to open the writable file as binary "wb"
(before I had "w" only)

Don't kick yourself too hard, the error lies with Microsoft trying to golf
its way out of a thicket of its own making.

Not fair to MS, in this case; they simply copied DR, who had copied DEC. (And the CRLF ending is, arguably, the most faithful to the ASCII design.) It was only as of MS-DOS 2.0 that MS started the long uphill road to kinda-sorta Unix compatibility, and by then it was too late to change, just as it was too late to use "/" as a directory separator.

(To an IBM mainframe programmer, after all, all three line-ending methods look stupid. In mainframes, files are made up discrete records -- like rows in an SQL database -- and aren't terminated by any byte value at all.)

···

--
John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
-- Charles Williams. "Taliessin through Logres: Prelude"

David_Vallner · 6 December 2006 21:36

David Kastrup wrote:

Paul Lutus <nospam@nosite.zzz> writes:

ciapecki wrote:

/ ...

I was that stupid and forgot to open the writable file as binary "wb"
(before I had "w" only)

Don't kick yourself too hard, the error lies with Microsoft trying
to golf its way out of a thicket of its own making. There never
should have been two standard line endings (actually three if you
include the Mac), and there never should have been two path
delimiters either, both of which cause endless headaches for
cross-platform coders.

The reason these variations exist is so someone can say, "my
software is different, unique, patentable, now you have to pay me
for it." Even if the differences convey no benefit to the users.

No, the reason is that CP/M had no tty concept, and consequently no
automatic LF->CRLF translation (and CRLF is required on printers).
Also forward slashes were used in CP/M as option lead-ins (CP/M, not
having named directories, did not need to use forwards slashes for
those).

This legacy is from long before POSIX, in fact, from long before C.

Hrm, and I also recall once knowing about why the different text /
binary file handling was around. Something to do with some DOS
programming environment and efficient (by a measure that could only have
been important enough to warrant a design wart on the hardware from
then) line-oriented text processing.

I don't think there's any distinction between the file modes on the OS
level anymore, but programming language runtimes interpret the absence
of the 'b' flag as "translate newlines" to only have to internally
support one convention and avoid having to have every text manipulation
routine handle the difference gracefully.

The blurb about preserving the idiosyncracies as a business strategy is
hilarious. Also patent nonsense and FUD

David Vallner

Paul_Lutus · 6 December 2006 23:05

ciapecki wrote:

Another question following up.
Is there a way to find out in what encoding is the file encoded (is it
ucs-2le or utf-8)?
when I open a file in VIM I can check it with :set fileencoding
so there must be any way to recognize the file and its encoding.

The fact that you can choose a particular encoding doesn't mean that
encoding is innate to the file. In the case of a unicode text file without
an identifying header, strictly speaking it is not possible to determine
the encoding -- I mean, apart from a human being using common sense and
text recognition.

···

--
Paul Lutus
http://www.arachnoid.com

Ross_Bamford2 · 7 December 2006 02:55

I've not used it myself so I can't say it'll definitely do what you need, but NKF from the standard library might get you closer:

http://ruby-doc.org/stdlib/libdoc/nkf/rdoc/index.html

···

On Wed, 06 Dec 2006 19:06:51 -0000, ciapecki <ciapecki@gmail.com> wrote:

Another question following up.
Is there a way to find out in what encoding is the file encoded (is it
ucs-2le or utf-8)?
when I open a file in VIM I can check it with :set fileencoding
so there must be any way to recognize the file and its encoding.

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Paul_Lutus · 6 December 2006 23:00

David Vallner wrote:

/ ...

The reason these variations exist is so someone can say, "my
software is different, unique, patentable, now you have to pay me
for it." Even if the differences convey no benefit to the users.

No, the reason is that CP/M had no tty concept, and consequently no
automatic LF->CRLF translation (and CRLF is required on printers).

That was the original reason, yes, but it is hard to justify it this far
down the road, with what are ostensibly sophisticated operating systems,
unless the idea is to enshrine a handful of bad choices forever.

Also forward slashes were used in CP/M as option lead-ins (CP/M, not
having named directories, did not need to use forwards slashes for
those).

Actually, yes, I remember this also. There were no hierarchical directory
trees at first (at least not in CP/M), so the ambiguity between command
delimiters and path delimiters didn't come up until it was too late to
settle on something more universal.

Hrm, and I also recall once knowing about why the different text /
binary file handling was around. Something to do with some DOS
programming environment and efficient (by a measure that could only have
been important enough to warrant a design wart on the hardware from
then) line-oriented text processing.

I recall the line ending management issue came up because C (and, later,
C++) used Unix line endings internally, therefore they converted any text
files on the fly as they were read or written. But, because some files were
binary, not text, it became necessary to tell the file reader/writer
routines whether or not this behavior was desired.

I don't think there's any distinction between the file modes on the OS
level anymore, but programming language runtimes interpret the absence
of the 'b' flag as "translate newlines" to only have to internally
support one convention and avoid having to have every text manipulation
routine handle the difference gracefully.

But only on Windows. The presence or absence of the "b" flag has no effect
on other platforms, which don't convert line endings. Except possibly the
Mac -- I don't know how Macintoshes deal with this, IIRC they have \r as a
line ending.

The blurb about preserving the idiosyncracies as a business strategy is
hilarious. Also patent nonsense and FUD

One can't help thinking this is the real motive behind a lot of this stuff.

···

--
Paul Lutus
http://www.arachnoid.com

Krzysztof_Cierpisz · 6 December 2006 23:25

Paul Lutus schrieb:

ciapecki wrote:

> Another question following up.
> Is there a way to find out in what encoding is the file encoded (is it
> ucs-2le or utf-8)?
> when I open a file in VIM I can check it with :set fileencoding
> so there must be any way to recognize the file and its encoding.

The fact that you can choose a particular encoding doesn't mean that
encoding is innate to the file. In the case of a unicode text file without
an identifying header, strictly speaking it is not possible to determine
the encoding -- I mean, apart from a human being using common sense and
text recognition.

--
Paul Lutus
http://www.arachnoid.com

Hi Paul,

in VIM :set filencoding (does not only set fileencoding, but as well
shows current fileencoding when run like I wrote)
so when I open a utf-8 file and enter :set fileencoding I get utf8,
when I open a ucs-2le file I get ucs-2le, I do not know how it
recognizes,
but the same thing happens (but not always) in Microsoft Notepad. When
you mark a file which is in UTF-8 Notepad marks UTF-8 as encoding, when
the file is ucs-2le, it marks Unicode as encoding.
So there must be something characteristic in those files.

chris

Paul_Lutus · 6 December 2006 23:35

ciapecki wrote:

/ ...

in VIM :set filencoding (does not only set fileencoding, but as well
shows current fileencoding when run like I wrote)
so when I open a utf-8 file and enter :set fileencoding I get utf8,
when I open a ucs-2le file I get ucs-2le, I do not know how it
recognizes,
but the same thing happens (but not always) in Microsoft Notepad. When
you mark a file which is in UTF-8 Notepad marks UTF-8 as encoding, when
the file is ucs-2le, it marks Unicode as encoding.
So there must be something characteristic in those files.

In a case like that and AFAIK, there is a header present for the unicode
case, and the editor defaults to utf-8 in the absence of a header. If this
were not true, the program would not be able to distinguish between valid
utf-8 and the content of a unicode file read one byte at a time.

···

--
Paul Lutus
http://www.arachnoid.com

David_Vallner · 7 December 2006 06:46

Paul Lutus wrote:

David Vallner wrote:

/ ...

The reason these variations exist is so someone can say, "my
software is different, unique, patentable, now you have to pay me
for it." Even if the differences convey no benefit to the users.

No, the reason is that CP/M had no tty concept, and consequently no
automatic LF->CRLF translation (and CRLF is required on printers).

That was the original reason, yes, but it is hard to justify it this far
down the road, with what are ostensibly sophisticated operating systems,
unless the idea is to enshrine a handful of bad choices forever.

Backwards binary compatibility for decades is one of the MS hallmarks.
It's entrenched rather than enshrined - first it was necessary as a
business objective to support and interoperate with arbitrary DOS
software on WinNT. I think -everyone- saw a office worker sit at a text
mode FoxPro app long after Win98 was ubiquitous. Keeping the legacy
behaviour as the default was an easier path for vendors developing new
applications that would work with textual output from older versions
(letting them reuse old code), instead of making them handle differences
gracefully. Interoperability with other operating systems was just
unimportant either as a goal to achieve or to avoid, *nix occupied a
share of the market that wasn't important when the NT architecture came
to be.

Hrm, and I also recall once knowing about why the different text /
binary file handling was around. Something to do with some DOS
programming environment and efficient (by a measure that could only have
been important enough to warrant a design wart on the hardware from
then) line-oriented text processing.

I recall the line ending management issue came up because C (and, later,
C++) used Unix line endings internally, therefore they converted any text
files on the fly as they were read or written. But, because some files were
binary, not text, it became necessary to tell the file reader/writer
routines whether or not this behavior was desired.

Which was hilarious fun with buggy web browsers interpreting compressed
archives as text, thoroughly thrashing them.

I don't think there's any distinction between the file modes on the OS
level anymore, but programming language runtimes interpret the absence
of the 'b' flag as "translate newlines" to only have to internally
support one convention and avoid having to have every text manipulation
routine handle the difference gracefully.

But only on Windows. The presence or absence of the "b" flag has no effect
on other platforms, which don't convert line endings. Except possibly the
Mac -- I don't know how Macintoshes deal with this, IIRC they have \r as a
line ending.

I'd expect the Mac side to be worse. IIRC, Mac OS Classic had a CR as a
line ending, and the OS X Classic subsystem apps still have. OS X uses
the (POSIX-specified, I think) LF. Then there's also the PPC -> Intel
switch, so together you should have three combinations of expected line
endings and byte endianness to handle at some level. Someone with more
detailed knowledge about Macsen might know how that's handled.

The blurb about preserving the idiosyncracies as a business strategy is
hilarious. Also patent nonsense and FUD

One can't help thinking this is the real motive behind a lot of this stuff.

Well, my phrasing there was wrong, it was indeed a business strategy.
The motivation however was compatibility; Windows aims to achieve vendor
lock-in by the range of exclusively available software, not using
low-level technical idiosyncracies (unless you count emergent
consequences). This includes both providing attractive tools (like .NET
being installed via Automatic Updates and an essential Vista component -
a brilliant move from Redmond, actually), and reassuring existing
software keeps working (read: potentially making more money to the
vendor without having to be updated).

Davud Vallner

David_Vallner · 7 December 2006 19:48

ciapecki wrote:

Paul Lutus schrieb:

ciapecki wrote:

Another question following up.
Is there a way to find out in what encoding is the file encoded (is it
ucs-2le or utf-8)?
when I open a file in VIM I can check it with :set fileencoding
so there must be any way to recognize the file and its encoding.

The fact that you can choose a particular encoding doesn't mean that
encoding is innate to the file. In the case of a unicode text file without
an identifying header, strictly speaking it is not possible to determine
the encoding -- I mean, apart from a human being using common sense and
text recognition.

--
Paul Lutus
http://www.arachnoid.com

Hi Paul,

in VIM :set filencoding (does not only set fileencoding, but as well
shows current fileencoding when run like I wrote)
so when I open a utf-8 file and enter :set fileencoding I get utf8,
when I open a ucs-2le file I get ucs-2le, I do not know how it
recognizes,
but the same thing happens (but not always) in Microsoft Notepad. When
you mark a file which is in UTF-8 Notepad marks UTF-8 as encoding, when
the file is ucs-2le, it marks Unicode as encoding.
So there must be something characteristic in those files.

chris

Byte order marks[1]? They're a hack of sorts that you can abuse to
indicate "This file is in Unicode encoding $FOO" in a text-file context.
However, they're a form of in-band signalling, and therefore a potential
Bad Thing depending on what the data will be passing through.

David Vallner

[1]: FAQ - UTF-8, UTF-16, UTF-32 & BOM

Topic		Replies	Views
Iconv "\n" (Iconv::InvalidCharacter) ruby-talk	0	84	8 September 2009
Reading from and writing to a Unicode encoded file ruby-talk	18	170	22 May 2012
Unicode Question ruby-talk	5	110	24 April 2009
Ucs-2 ruby-talk	4	123	1 August 2006
Ruby 1.8.* convert string to utf-8 ruby-talk	7	218	20 August 2008

Replace delimiter in unicode encdoded file

Related topics