Parsing pdf files

Arun_Kumar · 22 August 2009 16:33

hello all,
Does anyone know a good pdf parser that retains formatting
after its extracted text? I used PDF::Reader, but when you extract text you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I'm looking for something like that.

Thanks for any help.

regards,
Arun Kumar M S

···

--

श्री जानकीरघुनाथो विजयते ||

Greg_Brown1 · 22 August 2009 17:03

This doesn't exist in Ruby, unfortunately.

-greg

···

On Sat, Aug 22, 2009 at 12:33 PM, Arun Kumar<arun.einstein@gmail.com> wrote:

hello all,
Does anyone know a good pdf parser that retains formatting
after its extracted text? I used PDF::Reader, but when you extract text you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I'm looking for something like that.

Greg_Brown1 · 24 August 2009 11:43

Very interesting, thanks for posting this.

-greg

···

On Mon, Aug 24, 2009 at 6:25 AM, Erik Terpstra<erik@ruby-lang.nl> wrote:

You can use http://pdftohtml.sourceforge.net or use my Ruby wrapper for
this tool:

GitHub - eterps/pdf-struct: PDF::Extractor is a library that provides high level access to the text objects of a PDF document

Arun_Kumar · 22 August 2009 20:10

That's really very sad

···

On Sat, Aug 22, 2009 at 10:33 PM, Gregory Brown <gregory.t.brown@gmail.com>wrote:

On Sat, Aug 22, 2009 at 12:33 PM, Arun Kumar<arun.einstein@gmail.com> > wrote:
> hello all,
> Does anyone know a good pdf parser that retains
formatting
> after its extracted text? I used PDF::Reader, but when you extract text
you
> just get a stream of characters that are not at all intelligible. When I
> copy a pdf contents from a pdf reader to Gedit text editor in linux it
> retains its format. I'm looking for something like that.

This doesn't exist in Ruby, unfortunately.

-greg

--

श्री जानकीरघुनाथो विजयते ||

Greg_Brown1 · 22 August 2009 21:11

Looks like you better roll up your sleeves

···

On Sat, Aug 22, 2009 at 4:10 PM, Arun Kumar<arun.einstein@gmail.com> wrote:

That's really very sad

Arun_Kumar · 23 August 2009 08:10

Yeah seeing what can be done

···

On Sun, Aug 23, 2009 at 2:41 AM, Gregory Brown <gregory.t.brown@gmail.com>wrote:

On Sat, Aug 22, 2009 at 4:10 PM, Arun Kumar<arun.einstein@gmail.com> > wrote:
> That's really very sad

Looks like you better roll up your sleeves

--

श्री जानकीरघुनाथो विजयते ||

Topic		Replies	Views
Ruby PDF text extractor ruby-talk	4	158	17 August 2005
Pdf Parsing Challenge ruby-talk	7	105	19 May 2011
Extract Text from PDF ruby-talk	5	76	13 April 2007
PDF to text covertor? ruby-talk	4	194	19 August 2008
Pdf Parsing Project Example ruby-talk	4	106	10 May 2011

Parsing pdf files

Related topics