Text extraction from PDF files (non-European languages)...?

Dear all,

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn't seem to work for Unicode ...

Is there a Ruby way to do this ?

Thank you!

Best regards,

Axel

Axel

···

On 11/21/06, Nuralanur@aol.com <Nuralanur@aol.com> wrote:

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?

rpdf2txt (1) _should_ work with Unicode PDF-Documents. If you run into
any problems let me know, I'm happy to tinker with the beast.

http://download.ywesee.com/rpdf2txt/rpdf2txt-1.0.6.tar.bz2
http://raa.ruby-lang.org/project/rpdf2txt/

hth

Hannes

Hi,

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn't seem to work for Unicode ...

Which version of pdftotext did you use? Xpdf or poppler?
You need to install character map files for other Latin1 encoded
texts.

Is there a Ruby way to do this ?

You can use Ruby/Poppler if poppler doesn't have any problem:
  http://ruby-gnome2.cvs.sourceforge.net/ruby-gnome2/ruby-gnome2/poppler/sample/pdf2text.rb?revision=HEAD&view=markup

Thanks,

···

2006/11/22, Nuralanur@aol.com <Nuralanur@aol.com>:
--
kou