Ruby PDF text extractor

Kevin_Olbrich · 13 August 2005 17:01

I notice that Ruby has lots of tools for creating PDF files, are there any
that let you extract text from a PDF file?

_Kevin

Austin_Ziegler5 · 13 August 2005 17:45

Not yet. PDF::Writer will be refactored a little bit for version 2.0
(coming out later this year) so that it will be three separate
components: PDF::Core (the core objects representing a PDF object in
memory, as well as rendering), PDF::Writer (the writer/layout code),
and PDF::Reader (read a PDF object into an in-memory representation).
Much of the code to do PDF::Core is already in place (it's currently
called PDF::Writer::Object or PDF::Writer::Objects), but there's
nothing explicitly present to represent this.

PDF::Reader will probably be released in early 2006, depending on how
long it takes to refactor the code that already exists, properly
extend it, and get the necessary PDF::Writer code finished.

-austin

···

On 8/13/05, Kevin Olbrich <kevin.olbrich@duke.edu> wrote:

I notice that Ruby has lots of tools for creating PDF files, are there any
that let you extract text from a PDF file?

--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Andreas_Schrafl · 16 August 2005 23:53

I once wrote a Ruby PDF Text extractor while workin at ywesee.

I tought they released it on rubyforge but I can't find it anymore.
perhaps if you contact them they can help you.
www.ywesee.com

Greetings
Andy

Kevin Olbrich wrote:

···

I notice that Ruby has lots of tools for creating PDF files, are there any
that let you extract text from a PDF file?

_Kevin

Kevin_Olbrich · 13 August 2005 17:59

Thanks, I'll keep my eyes open for it.

_Kevin

···

-----Original Message-----
From: Austin Ziegler [mailto:halostatue@gmail.com]
Sent: Saturday, August 13, 2005 01:45 PM
To: ruby-talk ML
Subject: Re: Ruby PDF text extractor

On 8/13/05, Kevin Olbrich <kevin.olbrich@duke.edu> wrote:

I notice that Ruby has lots of tools for creating PDF files, are there
any that let you extract text from a PDF file?

Not yet. PDF::Writer will be refactored a little bit for version 2.0 (coming
out later this year) so that it will be three separate
components: PDF::Core (the core objects representing a PDF object in memory,
as well as rendering), PDF::Writer (the writer/layout code), and PDF::Reader
(read a PDF object into an in-memory representation). Much of the code to do
PDF::Core is already in place (it's currently called PDF::Writer::Object or
PDF::Writer::Objects), but there's nothing explicitly present to represent
this.

PDF::Reader will probably be released in early 2006, depending on how long
it takes to refactor the code that already exists, properly extend it, and
get the necessary PDF::Writer code finished.

-austin
--
Austin Ziegler * halostatue@gmail.com
* Alternate: austin@halostatue.ca

Martin_DeMello1 · 17 August 2005 10:21

I'd be interested in helping with this.

martin

···

Austin Ziegler <halostatue@gmail.com> wrote:

PDF::Reader will probably be released in early 2006, depending on how
long it takes to refactor the code that already exists, properly
extend it, and get the necessary PDF::Writer code finished.

Topic		Replies	Views
Extract Text from PDF ruby-talk	5	66	13 April 2007
Pdf reader ruby-talk	2	99	18 December 2005
PDF Library - Reading the PDF Document ruby-talk	2	113	9 January 2006
Extract contents from pdf (pdf reader) ruby-talk	1	143	22 September 2008
Parsing pdf files ruby-talk	5	139	24 August 2009

Ruby PDF text extractor

Related topics