Pdf Parsing Challenge

Felipe_Espinoza · 17 May 2011 21:04

Hi Everyone,

I'm just trying to use the pdf-reader gem, but I have some trouble
understading how the gem wokds

If someone can help me with this, i'll be really grateful

The Problem:

I have to extract some data from a paper in a pdf format. I just need
some data from the page 1, like the title of the paper, the authors
list, the universities of these autors, their mails, the abstract and
keywords

how I can extract this data from this paper?
http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf

with a simple string that contains the information of a complete field
(keywords, abstract, etc) would help me

It's not necessary to use this gem, but I need a string for each field
with this info, how can I do that?

···

--
Posted via http://www.ruby-forum.com/.

Phil · 17 May 2011 21:31

I have to extract some data from a paper in a pdf format. I just need
some data from the page 1, like the title of the paper, the authors
list, the universities of these autors, their mails, the abstract and
keywords

how I can extract this data from this paper?
http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf

Mark the text, copy it.

It's not necessary to use this gem, but I need a string for each field
with this info, how can I do that?

Open a text editor, paste it, and construct the data you need.

Doing the research for how to do what you want, and then writing and
debugging a script that does it, takes longer than just doing it by
hand.

···

On Tue, May 17, 2011 at 11:04 PM, Felipe Espinoza <fespinozacast@gmail.com> wrote:

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.

Mark_T · 18 May 2011 00:37

Inkscape has a command line conversion option.
I've only used it with a Linux instance.
It converts one page at a time though.
More than thee output format options from memory.
Not exactly pure Ruby approach, though scripting such a task is
certainly a Ruby domain.
Your example is still loading here.
So this reply may be completely out of context.

MarkT

I have to extract some data from a paper in a pdf format. I just need
some data from the page 1, like the title of the paper, the authors
list, the universities of these autors, their mails, the abstract and
keywords

I _top_ _post_ _so_ _there_

Mark_T · 18 May 2011 00:42

Inkscape has a command line conversion option.
I've only used it with a Linux instance.
It converts one page at a time though.
More than thee output format options from memory.
Not exactly pure Ruby approach, though scripting such a task is
certainly a Ruby domain.
Your example is still loading here.
So this reply may be completely out of context.

MarkT

I have to extract some data from a paper in a pdf format. I just need
some data from the page 1, like the title of the paper, the authors
list, the universities of these autors, their mails, the abstract and
keywords

I _top_ _post_ _so_ _there_

Kouhei_Sutou1 · 18 May 2011 13:23

Hi,

In <b3e54e146d346d393b16b935800076bb@ruby-forum.com>
"Pdf Parsing Challenge" on Wed, 18 May 2011 06:04:19 +0900,

The Problem:

I have to extract some data from a paper in a pdf format. I just need
some data from the page 1, like the title of the paper, the authors
list, the universities of these autors, their mails, the abstract and
keywords

how I can extract this data from this paper?
http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf

with a simple string that contains the information of a complete field
(keywords, abstract, etc) would help me

% gem install poppler
% cat extract-data-from-paper.rb
require 'tempfile'
require 'open-uri'
require 'poppler'

ARGV.each do |url|
  pdf = Tempfile.new(["extract-data-from-paper", ".pdf"])
  pdf.binmode
  open(url) do |input|
    pdf.write(input.read)
  end
  pdf.close

  document = Poppler::Document.new(pdf.path)
  title_page = document.pages.first
  text = title_page.get_text
  lines = text.lines.to_a
  title = lines[0, 2].collect(&:strip).join(" ")
  puts title
  authors = lines[2, 2].collect(&:strip).join(" ")
  puts authors
  # ...
end
% ruby1.9 extract-data-from-paper.rb http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf
Query Routing Process for Adapted Information Retrieval using Agents
Angela Carrillo-Ramos2, Jérôme Gensel1, Marlène Villanova-Oliver1, Hervé Martin1, and Miguel Torres-Moreno2

Thanks,

···

Felipe Espinoza <fespinozacast@gmail.com> wrote:
--
kou

Johannes_Held · 19 May 2011 08:20

Do you need that for an own application or do you want to build up a literature database on your own?
For the latter, you could try [Mendeley][1]. That's a tool (web-based & desktop-based) to manage your research literature. It can parse PDF, and much more.
Once parsed, you can parse the generated bibtex-file …

[1]: http://www.mendeley.com

···

--
Gruß, Johannes

Felipe_Espinoza · 17 May 2011 21:38

I need to do this automatically, I'll be doing it for a lot of papers
and then take that data to a database

···

--
Posted via http://www.ruby-forum.com/.

Phil · 17 May 2011 21:45

Unless the papers are all (near) identical in layout, this will be
difficult, since PDFs lack semantic information.

Can you instead query a DB for the DOI of the paper (getting the DOI
via the filename, or via the title of the paper, assuming the title is
easy to grab), and use said DOI DB to get the information in a way
that's much easier to process?

···

On Tue, May 17, 2011 at 11:38 PM, Felipe Espinoza <fespinozacast@gmail.com> wrote:

I need to do this automatically, I'll be doing it for a lot of papers
and then take that data to a database

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.

Topic		Replies	Views
Pdf Parsing Project Example ruby-talk	4	96	10 May 2011
PDF reader gems ruby-talk	9	121	6 March 2014
Extract contents from pdf (pdf reader) ruby-talk	1	143	22 September 2008
PDF Library - Reading the PDF Document ruby-talk	2	113	9 January 2006
Ruby PDF text extractor ruby-talk	4	150	17 August 2005

Pdf Parsing Challenge

Related topics