PDF reader gems

7stud2 · 5 March 2014 17:33

Hi,

I want to parse a pdf document using Ruby. I found one gem -
https://github.com/yob/pdf-reader .

Is there any other good gems, which have strong API to parse pdf data
more easily?

Please share your opinions.

···

--
Posted via http://www.ruby-forum.com/.

Wayne_Brisette · 5 March 2014 17:54

Arup:

I use pdf-reader for several of my scripts. If you're after just reading the PDF, then that's the one I'd stick with. Is there something in particular you don't understand?

Wayne

7stud2 · 5 March 2014 17:59

Wayne Brisette wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you're after just reading
the PDF, then that's the one I'd stick with. Is there something in
particular you don't understand?

Wayne

Thanks Wayne. I have some requirement, I need to script it. I am new to
this gem. Looking to its examples. I never used it before. Today, I
found it online. I want to collect all the data from it to a CSV file.

···

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 6 March 2014 11:32

Wayne Brisette wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you're after just reading
the PDF, then that's the one I'd stick with. Is there something in
particular you don't understand?

@Wayne - Is pdf file holds any xml objects internally of the data it is
displaying ? If so, then I can use Nokogiri to parse this.

···

--
Posted via http://www.ruby-forum.com/\.

Wayne_Brisette · 5 March 2014 18:26

I'm not sure I fully understand... You want to read all the data out of a PDF (or just selected data?) and put all that data into a CSV file?

Here's a sample that may be easy for you to understand. This simply goes through a PDF and searches for a preset word or phrase and lists the page on which it was found in the PDF.

File.open(@thepdffile, "rb") do |io| -- Open file
reader = PDF::Reader.new(io) -- reader now contains full contents of pdf
@counter=0
reader.pages.each do |page| -- since a pdf is defined in pages you have to go through each page to get the content
@counter+=1
pageText = page.text -- pageText contains all the text on a single page (only text!)

@wordlist\.each do |singleword|            \(bunch of stuff specific to my script\)\. But hopefully this example helps\.
  singleword\.strip\!
  if pageText\.include? singleword
  @indv\_word &lt;&lt; singleword
  @indv\_page &lt;&lt; @counter   
  end
end

end
end

···

________________________________
From: Arup Rakshit <lists@ruby-forum.com>
To: ruby-talk@ruby-lang.org
Sent: Wednesday, March 5, 2014 11:59 AM
Subject: Re: PDF reader gems

Wayne Brisette wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you're after just reading
the PDF, then that's the one I'd stick with. Is there something in
particular you don't understand?

Wayne

Thanks Wayne. I have some requirement, I need to script it. I am new to
this gem. Looking to its examples. I never used it before. Today, I
found it online. I want to collect all the data from it to a CSV file.

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 5 March 2014 18:45

Wayne Brisette wrote in post #1138927:

I'm not sure I fully understand... You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here's a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

It would really be helpfull. I would start my script tonight. If I have
any issue to understand it, I will ask you here in this list.

Hope you would help me

Thank you Wayne Brisette.

···

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 5 March 2014 20:01

Wayne Brisette wrote in post #1138927:

I'm not sure I fully understand... You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here's a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

I wrote the code below :

require 'pdf/reader'

File.open("#{__dir__}/a.pdf",'rb') do |io|
  reader = PDF::Reader.new(io)
  reader.pages.each do |page|
    puts page.text
  end
end

It is working. But `text` gives whole page content at a time. Can I read
the page line by line ?

···

--
Posted via http://www.ruby-forum.com/\.

Wayne_Brisette · 5 March 2014 21:18

I don't see anything that sticks out. You might want to post on the PDF reader group and see what the folks there think.

https://groups.google.com/forum/#!forum/pdf-reader

Wayne

···

________________________________
From: Arup Rakshit <lists@ruby-forum.com>
To: ruby-talk@ruby-lang.org
Sent: Wednesday, March 5, 2014 2:01 PM
Subject: Re: PDF reader gems

Wayne Brisette wrote in post #1138927:

I'm not sure I fully understand... You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here's a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

I wrote the code below :

require 'pdf/reader'

File.open("#{__dir__}/a.pdf",'rb') do |io|
reader = PDF::Reader.new(io)
reader.pages.each do |page|
puts page.text
end
end

It is working. But `text` gives whole page content at a time. Can I read
the page line by line ?

7stud2 · 5 March 2014 21:41

Wayne Brisette wrote in post #1138950:

I don't see anything that sticks out. You might want to post on the PDF
reader group and see what the folks there think.

https://groups.google.com/forum/#!forum/pdf-reader

Would you drop me an email mentioned here?

If you do, I can send you the PDF, so that you can see and give me idea,
about the approaches.

···

--
Posted via http://www.ruby-forum.com/\.

Hassan_Schroeder · 6 March 2014 15:37

'page.text' is just a string; you can manipulate it any way you want.
If you want "lines", split the text on line-ending characters.

Good luck,

···

On Wed, Mar 5, 2014 at 12:01 PM, Arup Rakshit <lists@ruby-forum.com> wrote:

File.open("#{__dir__}/a.pdf",'rb') do |io|
  reader = PDF::Reader.new(io)
  reader.pages.each do |page|
    puts page.text
  end
end

It is working. But `text` gives whole page content at a time. Can I read
the page line by line ?

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

twitter: @hassan

Topic		Replies	Views
Extract contents from pdf (pdf reader) ruby-talk	1	143	22 September 2008
Pdf Parsing Challenge ruby-talk	7	92	19 May 2011
Pdf Parsing Project Example ruby-talk	4	96	10 May 2011
PDF Library - Reading the PDF Document ruby-talk	2	113	9 January 2006
Problem reading PDF ruby-talk	2	119	13 December 2010

PDF reader gems

Related topics