Extract Text from PDF

Hi,

Does anyone know a way to extract plain text from a PDF using Ruby?

Many Thanks,

~ Mark

···

--
Posted via http://www.ruby-forum.com/.

IIRC there is a project under way to extend PDFWriter with reading capabilities. I don't know the current status of that. HTH

  robert

···

On 13.04.2007 14:06, Mark Dodwell wrote:

Does anyone know a way to extract plain text from a PDF using Ruby?

Hi,

···

2007/4/13, Mark Dodwell <seo@mkdynamic.co.uk>:

Does anyone know a way to extract plain text from a PDF using Ruby?

You can use Ruby/Poppler:
  http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby%2FPoppler

Here is an example to do that:
  CVS Info for project ruby-gnome2

Thanks,
--
kou

Robert Klemme wrote:

···

On 13.04.2007 14:06, Mark Dodwell wrote:

Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

In the meantime, you could use the commandline tools pdf2ps and ps2ascii
(I think they use ghostscript as a backend), and read the resulting
ascii file with ruby in the usual way.

Regards,

Chris

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote:

Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading capabilities. I don't know the current status of that. HTH

    robert

At least on Linux, there is "pdftotext", which is part of the "poppler" package. So you can simply shell out to it if it's installed. If you're more ambitious, you could write an extension to use the underlying libraries in poppler.

···

On 13.04.2007 14:06, Mark Dodwell wrote:

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.net/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

The trouble is, pdf is not always the same thing. Sometimes, there is no text at all in a pdf. It can be all vector art outlines or even all raster image graphics. There is never a guarantee that you will get any or all text that may otherwise be human readable in a pdf. Pdf has really become a kitchen sink format, so it is good to anticipate trouble parsing pdf files.