PDF::Writer and Unicode

According to the current manual PDF documents generated by PDF::Writer can use UTF-16BE, but after a few trials with iconv I can't get my UTF-8 strings right. Example:

   $KCODE = 'u'

   require 'rubygems'
   require 'pdf/writer'
   require 'iconv'

   str = Iconv.iconv('UTF-16BE', 'UTF-8', 'á ß €')
   pdf = PDF::Writer.new

   # renders á and ß right, but not €
   pdf.text str

   # same output with garbage prepended
   pdf.text "\xfe\xff#{str}"
   pdf.save_as('unicode_test.pdf')

The manual does not document if any encoding is needed for select_font, I've played around with variations of

   # gives complete garbage
   pdf.select_font 'Times-Roman', :encoding => 'UTF-16BE'

without luck.

TextMate is generating UTF-8 source files for sure. Any ideas?

-- fxn

Xavier Noria wrote:

The manual does not document if any encoding is needed for select_font,
I've played around with variations of

  # gives complete garbage
  pdf.select_font 'Times-Roman', :encoding => 'UTF-16BE'

without luck.

  I'm not familiar with PDF::Writer, but I would be surprised if you
really had all the glyphs for 'UTF-16BE' by default. What is the exact
output ? Does it produce the PDF file, or it simply fails with an
exception, or crashes ?

  If a PDF file is produced (of reasonable size), would you mind posting
it ?

  Cheers,

  Vince

···

--
Vincent Fourmond, PhD student (not for long anymore)
http://vincent.fourmond.neuf.fr/

The manual is incorrect; I have recently figured out how to write
UTF-16 strings, but the current PDF::Writer doesn't do this (and there
are issues that I need to resolve before this will even show up in any
release of PDF::Writer).

-austin

···

On 2/16/07, Xavier Noria <fxn@hashref.com> wrote:

According to the current manual PDF documents generated by
PDF::Writer can use UTF-16BE, but after a few trials with iconv I
can't get my UTF-8 strings right. Example:

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

Sure, it's just 4KB. This is the PDF generated by

   $KCODE = 'u'

   require 'rubygems'
   require 'pdf/writer'
   require 'iconv'

   str = Iconv.iconv('UTF-16BE', 'UTF-8', 'á ß €')
   pdf = PDF::Writer.new
   pdf.text str
   pdf.text "\xfe\xff#{str}"
   pdf.save_as('unicode_test.pdf')

As you see, the glyph we get wrong in this small test is the euro symbol. This is important to me because not only my database in in UTF-8 coming from an unrestricted UTF-8 frontend (website), but the application has money here and there and needs to be able to output that currency symbol.

-- fxn

unicode_test.pdf (1.03 KB)

···

On Feb 16, 2007, at 12:59 PM, Vincent Fourmond wrote:

Xavier Noria wrote:

The manual does not document if any encoding is needed for select_font,
I've played around with variations of

  # gives complete garbage
  pdf.select_font 'Times-Roman', :encoding => 'UTF-16BE'

without luck.

  I'm not familiar with PDF::Writer, but I would be surprised if you
really had all the glyphs for 'UTF-16BE' by default. What is the exact
output ? Does it produce the PDF file, or it simply fails with an
exception, or crashes ?

  If a PDF file is produced (of reasonable size), would you mind posting
it ?

Xavier Noria wrote:

Xavier Noria wrote:

The manual does not document if any encoding is needed for select_font,
I've played around with variations of

  # gives complete garbage
  pdf.select_font 'Times-Roman', :encoding => 'UTF-16BE'

without luck.

  I'm not familiar with PDF::Writer, but I would be surprised if you
really had all the glyphs for 'UTF-16BE' by default. What is the exact
output ? Does it produce the PDF file, or it simply fails with an
exception, or crashes ?

  If a PDF file is produced (of reasonable size), would you mind posting
it ?

Sure, it's just 4KB. This is the PDF generated by

  $KCODE = 'u'

  require 'rubygems'
  require 'pdf/writer'
  require 'iconv'

  str = Iconv.iconv('UTF-16BE', 'UTF-8', 'á ß €')
  pdf = PDF::Writer.new
  pdf.text str
  pdf.text "\xfe\xff#{str}"
  pdf.save_as('unicode_test.pdf')

As you see, the glyph we get wrong in this small test is the euro
symbol. This is important to me because not only my database in in UTF-8
coming from an unrestricted UTF-8 frontend (website), but the
application has money here and there and needs to be able to output that
currency symbol.

  Actually, what you see on the screen is the latin1 representation of
your UTF-16BE string (see below). ^@ means chr 0 and seem to be ignored
by the PDF viewers, and UTF-16BE has the good taste to map to latin1 for
values up to 255. See what less unicode_test.pdf is giving me (I'm on a
latin1 locale):

BT 36.000 744.440 Td /F1 10.0 Tf 0 Tr (^@á^@ ^@ß^@ ¬) Tj ET
BT 36.000 732.880 Td /F1 10.0 Tf 0 Tr (þÿ^@á^@ ^@ß^@ ¬) Tj ET

  Moreover, in this particular case, you are using the Helvetica
built-in font, and I'm pretty sure it doesn't have glyphes for a Euro
symbol. Finally, acroread says that the encoding of the font is 'ansi'.
That is definitely not what you want. Keep in mind that most of the
fonts (about everywhere) are defined for a small encoding (ansi/latin1,
or other 8bits encodings). I unfortunately don't think I can help you
further. If you don't rely too much yet on PDF::Writer, you could use
pdfLaTeX as an alternative, although PDF produced will be significantly
bigger (for small files)...

  Welcome to the nightmare world of fonts and encodings...

  Vince

···

On Feb 16, 2007, at 12:59 PM, Vincent Fourmond wrote:

--
Vincent Fourmond, PhD student (not for long anymore)
http://vincent.fourmond.neuf.fr/

Vincent Fourmond wrote:

  Welcome to the nightmare world of fonts and encodings...

... and PDF generation in Ruby.

If this helps, you can see myself struggle with the same
problem here:

http://groups.google.de/group/comp.lang.ruby/browse_thread/thread/54336c6a932903fe/f0bb48520dac2ba5

I ended up using libharu (http://libharu.sourceforge.net/\)

It is cross platform, FAST and has ruby bindings (it is a little bit
clumsy to use and the ruby bindings are missing some functions but
it is the best i could find)

example:

···

-----------------------------------------------------------------------
require "hpdf"

pdf = HPDFDoc.new
font = pdf.get_font("Helvetica", "CP1254")

page = pdf.add_page

page.set_size(HPDFDoc::HPDF_PAGE_SIZE_A4, HPDFDoc::HPDF_PAGE_PORTRAIT)
page.set_font_and_size(font, 96)

page.begin_text

page.move_text_pos(100, 700)
page.show_text("\x80")

page.end_text

pdf.save_to_file "c:/temp/test.pdf"
-----------------------------------------------------------------------

With a little love to the wrapper this could be really good...

cheers

Simon

Austin explained the issue. But to understand that remark in any case, is that Helvetica in the PDF different from the Helvetica I use in the system? The Helvetica here in the Mac certainly has the euro symbol.

-- fxn

···

On Feb 16, 2007, at 2:49 PM, Vincent Fourmond wrote:

  Moreover, in this particular case, you are using the Helvetica
built-in font, and I'm pretty sure it doesn't have glyphes for a Euro
symbol.

Xavier Noria wrote:

  Moreover, in this particular case, you are using the Helvetica
built-in font, and I'm pretty sure it doesn't have glyphes for a Euro
symbol.

Austin explained the issue. But to understand that remark in any case,
is that Helvetica in the PDF different from the Helvetica I use in the
system? The Helvetica here in the Mac certainly has the euro symbol.

  Well... It is a long and complex story. A font is (for the PDF
document) just a correspondance (char) -> (nice drawing + metrics). What
we call Helvetica is in real a fair number of different fonts, which
cover various symbols that have a helvetica look & feel... Even if a
font is called helvetica, you can't be assured that there are all the
glyphs you're interested in inside it. And I don't even speak about more
delicate things like fonts with Chinese or Russian characters... I
didn't mean to exaggerate when I wrote 'nightmare' :wink: !

  But, in this particular case, I was wrong ;-)... I checked up in the
PDF documentation, which specifies char codes for the Euro symbol. The
real problem was that the font encoding wasn't the right one. I tweaked
manually the file until I could get it. See the problems with the
encodings and fonts: I spent a long time trying to get the char \240
displayed as Euro until I realised the encoding wasn't quite the right
one and \240 meant 'unbreakable space' ! I attached the file just for
the example.

  Cheers

  Vince

euro_symbol.pdf (1006 Bytes)

···

On Feb 16, 2007, at 2:49 PM, Vincent Fourmond wrote:

--
Vincent Fourmond, PhD student (not for long anymore)
http://vincent.fourmond.neuf.fr/