Text extraction from PDF files (non-European languages)...?

Nuralanur · 21 November 2006 17:03

Dear all,

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn't seem to work for Unicode ...

Is there a Ruby way to do this ?

Thank you!

Best regards,

Axel

Hannes_Wyss · 21 November 2006 17:15

Axel

···

On 11/21/06, Nuralanur@aol.com <Nuralanur@aol.com> wrote:

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?

rpdf2txt (1) _should_ work with Unicode PDF-Documents. If you run into
any problems let me know, I'm happy to tinker with the beast.

http://download.ywesee.com/rpdf2txt/rpdf2txt-1.0.6.tar.bz2
http://raa.ruby-lang.org/project/rpdf2txt/

hth

Hannes

Kouhei_Sutou1 · 22 November 2006 00:52

Hi,

is there a way of extracting text from a PDF, if the latter
is in some non-European language, such as Arabic or
Chinese?
Under Linux, I have been able to use Ruby in conjunction
with pdftotext for English and other Latin1 encoded texts -
with some problems sometimes for special characters,
but it doesn't seem to work for Unicode ...

Which version of pdftotext did you use? Xpdf or poppler?
You need to install character map files for other Latin1 encoded
texts.

Is there a Ruby way to do this ?

You can use Ruby/Poppler if poppler doesn't have any problem:
http://ruby-gnome2.cvs.sourceforge.net/ruby-gnome2/ruby-gnome2/poppler/sample/pdf2text.rb?revision=HEAD&view=markup

Thanks,

···

2006/11/22, Nuralanur@aol.com <Nuralanur@aol.com>:
--
kou

Topic		Replies	Views
Extract Text from PDF ruby-talk	5	66	13 April 2007
PDF with Arabic ruby-talk	6	72	2 November 2005
Arabic PDF Reports ruby-talk	1	105	18 March 2006
Example on rpdf2text? ruby-talk	1	90	6 June 2007
PDF to text covertor? ruby-talk	4	190	19 August 2008

Text extraction from PDF files (non-European languages)...?

Related topics