PDF Writer UTF-8 Support

Brian_Schroder1 · 30 March 2005 22:57

Hello,

I'm having a hard time getting PDF Writer to output my UTF-8 encoded
text correctly. Has anybody around here got some tips for me?

thanks a lot,

Brian

···

--
Brian Schröder
http://ruby.brian-schroeder.de/

Austin_Ziegler5 · 31 March 2005 04:48

Unfortunately, PDF::Writer needs "help" understanding UTF-8 input and
I have been focussing on a number of basic feature changes before
making this "easy" as it also makes a difference as how each font is
handled.

I am hoping to have PDF::Writer 1.0 out -- with documentation on how
to do this at all -- in the next two weeks or so. I apologise for the
inconvenience.

-austin

···

On Mar 30, 2005 5:57 PM, Brian Schröder <ruby.brian@gmail.com> wrote:

I'm having a hard time getting PDF Writer to output my UTF-8 encoded
text correctly. Has anybody around here got some tips for me?

--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Brian_Schroder1 · 31 March 2005 07:13

Thanks for your reply, austin,

Is there any possibility to output UTF-8 encoded text right know? I
need no fancy fonts or formating, just some plain text output at
specific x-y corrdinates.

best regards and thanks for the great library,

brian

···

On Thu, 31 Mar 2005 13:48:16 +0900, Austin Ziegler <halostatue@gmail.com> wrote:

On Mar 30, 2005 5:57 PM, Brian Schröder <ruby.brian@gmail.com> wrote:
> I'm having a hard time getting PDF Writer to output my UTF-8 encoded
> text correctly. Has anybody around here got some tips for me?

Unfortunately, PDF::Writer needs "help" understanding UTF-8 input and
I have been focussing on a number of basic feature changes before
making this "easy" as it also makes a difference as how each font is
handled.

I am hoping to have PDF::Writer 1.0 out -- with documentation on how
to do this at all -- in the next two weeks or so. I apologise for the
inconvenience.

-austin
--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

--
Brian Schröder
http://ruby.brian-schroeder.de/

Austin_Ziegler5 · 31 March 2005 14:11

Yes -- but you have to wade through the font encoding mapping
information for PDF documents right now, and you have to be using a
Unicode-capable font. From the PDF 1.6 Reference:

    Font management is primarily concerned with producing the
    correct appearance of text—that is, the shape and placement of
    glyphs. However, it is sometimes necessary for a PDF application
    to extract the meaning of the text, represented in some standard
    information encoding such as Unicode. In some cases, this
    information can be deduced from the encoding used to represent
    the text in the PDF file. Otherwise, the PDF producer
    application should specify the mapping explicitly by including a
    special object, the ToUnicode CMap.

I have not added support for the /ToUnicode CMap in PDF::Writer, but
it may be possible. However:

    Certain strings contain information that is intended to be
    human-readable, such as text annotations, bookmark names,
    article names, document information, and so forth. Such strings
    are referred to as text strings. Text strings are encoded in
    either PDFDocEncoding or Unicode character encoding.
    PDFDocEncoding is a superset of the ISO Latin 1 encoding and is
    documented in Appendix D. Unicode is described in the Unicode
    Standard by the Unicode Consortium (see the Bibliography).

    For text strings encoded in Unicode, the first two bytes must be
    254 followed by 255. These two bytes represent the Unicode byte
    order marker, U+FEFF, indicating that the string is encoded in
    the UTF-16BE (big-endian) encoding scheme specified in the
    Unicode standard. (This mechanism precludes beginning a string
    using PDFDocEncoding with the two characters thorn ydieresis,
    which is unlikely to be a meaningful beginning of a word or
    phrase). Note: Applications that process PDF files containing
    Unicode text strings should be prepared to handle supplementary
    characters; that is, characters requiring more than two bytes to
    represent.

    An escape sequence may appear anywhere in a Unicode text string
    to indicate the language in which subsequent text is written,
    which is useful when the language cannot be determined from the
    character codes used in the text. The escape sequence consists
    of the following elements, in order:

    1. The Unicode value U+001B (that is, the byte sequence 0
       followed by 27)
    2. A 2-character ISO 639 language code—for example, en for
       English or ja for Japanese
    3. (Optional) A 2-character ISO 3166 country code—for example,
       US for the United States or JP for Japan
    4. The Unicode value U+001B

    The complete list of codes defined by ISO 639 and ISO 3166 can
    be obtained from the International Organization for
    Standardization (see the Bibliography).

So you can't specify UTF-8, but you can specify UTF-16BE if you
provide the 0xFEFF BOM.

-austin

···

On Mar 31, 2005 2:13 AM, Brian Schröder <ruby.brian@gmail.com> wrote:

On Thu, 31 Mar 2005 13:48:16 +0900, Austin Ziegler > <halostatue@gmail.com > wrote:

On Mar 30, 2005 5:57 PM, Brian Schröder <ruby.brian@gmail.com > >> wrote:

I'm having a hard time getting PDF Writer to output my UTF-8
encoded text correctly. Has anybody around here got some tips
for me?

Unfortunately, PDF::Writer needs "help" understanding UTF-8 input
and I have been focussing on a number of basic feature changes
before making this "easy" as it also makes a difference as how
each font is handled.
I am hoping to have PDF::Writer 1.0 out -- with documentation on
how to do this at all -- in the next two weeks or so. I apologise
for the inconvenience.

Thanks for your reply, austin,

Is there any possibility to output UTF-8 encoded text right know?
I need no fancy fonts or formating, just some plain text output at
specific x-y corrdinates.

best regards and thanks for the great library,

--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Topic		Replies	Views
PDF::Writer with UTF-8 ruby-talk	8	106	12 March 2007
Problem with text_width and UTF-8 characters in PDF-writer ruby-talk	1	126	30 May 2008
PDF::Writer and unicode ruby-talk	4	77	3 December 2006
PDF::Writer and Unicode ruby-talk	7	86	17 February 2007
PDF with different encoding than default ruby-talk	0	101	13 June 2006

PDF Writer UTF-8 Support

Related topics