Dear all,
is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn't seem to work for Unicode ...
Is there a Ruby way to do this ?
Thank you!
Best regards,
Axel
Hi,
is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn't seem to work for Unicode ...
Which version of pdftotext did you use? Xpdf or poppler?
You need to install character map files for other Latin1 encoded
texts.
Is there a Ruby way to do this ?
You can use Ruby/Poppler if poppler doesn't have any problem:
http://ruby-gnome2.cvs.sourceforge.net/ruby-gnome2/ruby-gnome2/poppler/sample/pdf2text.rb?revision=HEAD&view=markup
Thanks,
···
2006/11/22, Nuralanur@aol.com <Nuralanur@aol.com>:
--
kou