Ruby PDF text extractor

(Kevin Olbrich) #1

I notice that Ruby has lots of tools for creating PDF files, are there any
that let you extract text from a PDF file?

_Kevin

(Austin Ziegler) #2

Not yet. PDF::Writer will be refactored a little bit for version 2.0
(coming out later this year) so that it will be three separate
components: PDF::Core (the core objects representing a PDF object in
memory, as well as rendering), PDF::Writer (the writer/layout code),
and PDF::Reader (read a PDF object into an in-memory representation).
Much of the code to do PDF::Core is already in place (it's currently
called PDF::Writer::Object or PDF::Writer::Objects), but there's
nothing explicitly present to represent this.

PDF::Reader will probably be released in early 2006, depending on how
long it takes to refactor the code that already exists, properly
extend it, and get the necessary PDF::Writer code finished.

-austin

···

On 8/13/05, Kevin Olbrich <kevin.olbrich@duke.edu> wrote:

I notice that Ruby has lots of tools for creating PDF files, are there any
that let you extract text from a PDF file?

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

(Andreas Schrafl) #3

I once wrote a Ruby PDF Text extractor while workin at ywesee.

I tought they released it on rubyforge but I can't find it anymore.
perhaps if you contact them they can help you.
www.ywesee.com

Greetings
Andy

Kevin Olbrich wrote:

···

I notice that Ruby has lots of tools for creating PDF files, are there any
that let you extract text from a PDF file?

_Kevin

(Kevin Olbrich) #4

Thanks, I'll keep my eyes open for it.

_Kevin

···

-----Original Message-----
From: Austin Ziegler [mailto:halostatue@gmail.com]
Sent: Saturday, August 13, 2005 01:45 PM
To: ruby-talk ML
Subject: Re: Ruby PDF text extractor

On 8/13/05, Kevin Olbrich <kevin.olbrich@duke.edu> wrote:

I notice that Ruby has lots of tools for creating PDF files, are there
any that let you extract text from a PDF file?

Not yet. PDF::Writer will be refactored a little bit for version 2.0 (coming
out later this year) so that it will be three separate
components: PDF::Core (the core objects representing a PDF object in memory,
as well as rendering), PDF::Writer (the writer/layout code), and PDF::Reader
(read a PDF object into an in-memory representation). Much of the code to do
PDF::Core is already in place (it's currently called PDF::Writer::Object or
PDF::Writer::Objects), but there's nothing explicitly present to represent
this.

PDF::Reader will probably be released in early 2006, depending on how long
it takes to refactor the code that already exists, properly extend it, and
get the necessary PDF::Writer code finished.

-austin
--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

(Martin DeMello) #5

I'd be interested in helping with this.

martin

···

Austin Ziegler <halostatue@gmail.com> wrote:

PDF::Reader will probably be released in early 2006, depending on how
long it takes to refactor the code that already exists, properly
extend it, and get the necessary PDF::Writer code finished.