PDF reader gems

Hi,

I want to parse a pdf document using Ruby. I found one gem -
https://github.com/yob/pdf-reader .

Is there any other good gems, which have strong API to parse pdf data
more easily?

Please share your opinions.

···

--
Posted via http://www.ruby-forum.com/.

Arup:

I use pdf-reader for several of my scripts. If you're after just reading the PDF, then that's the one I'd stick with. Is there something in particular you don't understand?

Wayne

Wayne Brisette wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you're after just reading
the PDF, then that's the one I'd stick with. Is there something in
particular you don't understand?

Wayne

Thanks Wayne. I have some requirement, I need to script it. I am new to
this gem. Looking to its examples. I never used it before. Today, I
found it online. I want to collect all the data from it to a CSV file.

···

--
Posted via http://www.ruby-forum.com/\.

Wayne Brisette wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you're after just reading
the PDF, then that's the one I'd stick with. Is there something in
particular you don't understand?

@Wayne - Is pdf file holds any xml objects internally of the data it is
displaying ? If so, then I can use Nokogiri to parse this.

···

--
Posted via http://www.ruby-forum.com/\.

I'm not sure I fully understand... You want to read all the data out of a PDF (or just selected data?) and put all that data into a CSV file?

Here's a sample that may be easy for you to understand. This simply goes through a PDF and searches for a preset word or phrase and lists the page on which it was found in the PDF.

File.open(@thepdffile, "rb") do |io| -- Open file
reader = PDF::Reader.new(io) -- reader now contains full contents of pdf
@counter=0
reader.pages.each do |page| -- since a pdf is defined in pages you have to go through each page to get the content
@counter+=1
pageText = page.text -- pageText contains all the text on a single page (only text!)

@wordlist\.each do |singleword|            \(bunch of stuff specific to my script\)\. But hopefully this example helps\.
  singleword\.strip\!
  if pageText\.include? singleword
  @indv\_word << singleword
  @indv\_page << @counter   
  end
end

end
end

···

________________________________
From: Arup Rakshit <lists@ruby-forum.com>
To: ruby-talk@ruby-lang.org
Sent: Wednesday, March 5, 2014 11:59 AM
Subject: Re: PDF reader gems

Wayne Brisette wrote in post #1138922:

Arup:

I use pdf-reader for several of my scripts. If you're after just reading
the PDF, then that's the one I'd stick with. Is there something in
particular you don't understand?

Wayne

Thanks Wayne. I have some requirement, I need to script it. I am new to
this gem. Looking to its examples. I never used it before. Today, I
found it online. I want to collect all the data from it to a CSV file.

--
Posted via http://www.ruby-forum.com/\.

Wayne Brisette wrote in post #1138927:

I'm not sure I fully understand... You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here's a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

It would really be helpfull. I would start my script tonight. If I have
any issue to understand it, I will ask you here in this list.

Hope you would help me :slight_smile:

Thank you Wayne Brisette.

···

--
Posted via http://www.ruby-forum.com/\.

Wayne Brisette wrote in post #1138927:

I'm not sure I fully understand... You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here's a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

I wrote the code below :

require 'pdf/reader'

File.open("#{__dir__}/a.pdf",'rb') do |io|
  reader = PDF::Reader.new(io)
  reader.pages.each do |page|
    puts page.text
  end
end

It is working. But `text` gives whole page content at a time. Can I read
the page line by line ?

···

--
Posted via http://www.ruby-forum.com/\.

I don't see anything that sticks out. You might want to post on the PDF reader group and see what the folks there think.

https://groups.google.com/forum/#!forum/pdf-reader

Wayne

···

________________________________
From: Arup Rakshit <lists@ruby-forum.com>
To: ruby-talk@ruby-lang.org
Sent: Wednesday, March 5, 2014 2:01 PM
Subject: Re: PDF reader gems

Wayne Brisette wrote in post #1138927:

I'm not sure I fully understand... You want to read all the data out of
a PDF (or just selected data?) and put all that data into a CSV file?

Here's a sample that may be easy for you to understand. This simply goes
through a PDF and searches for a preset word or phrase and lists the
page on which it was found in the PDF.

I wrote the code below :

require 'pdf/reader'

File.open("#{__dir__}/a.pdf",'rb') do |io|
reader = PDF::Reader.new(io)
reader.pages.each do |page|
puts page.text
end
end

It is working. But `text` gives whole page content at a time. Can I read
the page line by line ?

Wayne Brisette wrote in post #1138950:

I don't see anything that sticks out. You might want to post on the PDF
reader group and see what the folks there think.

https://groups.google.com/forum/#!forum/pdf-reader

Would you drop me an email mentioned here?

If you do, I can send you the PDF, so that you can see and give me idea,
about the approaches.

···

--
Posted via http://www.ruby-forum.com/\.

'page.text' is just a string; you can manipulate it any way you want.
If you want "lines", split the text on line-ending characters.

Good luck,

···

On Wed, Mar 5, 2014 at 12:01 PM, Arup Rakshit <lists@ruby-forum.com> wrote:

File.open("#{__dir__}/a.pdf",'rb') do |io|
  reader = PDF::Reader.new(io)
  reader.pages.each do |page|
    puts page.text
  end
end

It is working. But `text` gives whole page content at a time. Can I read
the page line by line ?

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

twitter: @hassan