Is there any rubygem available for converting the pdf files to xml files?

7stud2 · 5 March 2014 23:10

Hi,

Is there any rubygem available for converting the pdf files to xml
files?

···

--
Posted via http://www.ruby-forum.com/.

Wayne_Brisette · 5 March 2014 23:46

Look at rubygems.org there’s at least one that is PDF to HTML, but I’ve not used it.

Wayne

···

On Mar 5, 2014, at 17:10, Arup Rakshit <lists@ruby-forum.com> wrote:

Hi,

Is there any rubygem available for converting the pdf files to xml
files?

--
Posted via http://www.ruby-forum.com/\.

Wayne_Brisette · 6 March 2014 11:28

Arup:

I did install the PDF to HTML gem and have to say it’s pretty impressive! It’s all based on the pdf2htmlEX project:

(it’s basically just a nice ruby wrapper, so you have to have pdf2htmlEX installed). But this gem actually opens up a whole new world of possibilities.

In combination with something like nokogiri, you should be able to parse almost all the data you want. However, this means you’ll need to brush up on your css and/or xpath to parse again with nokogiri.

On Mac OS X, it was pretty easy to install the pdf2htmEX toolset. For Windows, somebody has already done the compiling for you here: http://soft.rubypdf.com/software/pdf2htmlex-windows-version

Good luck!

FYI, there is a googlegroup for the pdf2htmlEX toolset and you’re going to be better off asking questions there rather than this list for any additional help with those toolsets if you choose to use them since this list is strictly for ruby related things.

Wayne

7stud2 · 6 March 2014 11:56

Wayne Brissette wrote in post #1139024:

Arup:

I did install the PDF to HTML gem and have to say its pretty impressive!
Its all based on the pdf2htmlEX project:

pdf2htmlEX/src at master · coolwanglu/pdf2htmlEX · GitHub

Wayne

Thanks for your reply. I was also looking for

github.com

kitplummer/pdftohtmlr/blob/master/lib/pdftohtmlr.rb

# The library has a single method for converting PDF files into HTML. The
# method current takes in the source path, and either/both the user and owner
# passwords set on the source PDF document.  The convert method returns the
# HTML as a string for further manipulation of loading into a Document.
#
# Requires that pdftohtml be installed and on the path
#
# Author:: Kit Plummer (mailto:kitplummer@gmail.com)
# Copyright:: Copyright (c) 2009 Kit Plummer
# License:: MIT

require 'rubygems'
require 'nokogiri'
require 'uri'
require 'open-uri'
require 'tempfile'

module PDFToHTMLR
  
  # Simple local error abstraction

This file has been truncated. show original

But the issue is, if PDF have any blank column values, it is not
generating any corresponding tag for those entries. Thus couldn't track
which data is actually under which column.

I am surely give the gem a try, you linked above.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Html 2 pdf ruby-talk	6	85	20 July 2007
HTML to PDF converter ruby-talk	1	73	4 February 2008
PDF reader gems ruby-talk	9	126	6 March 2014
Doc to PDF/HTML converter plugins available in Ruby? ruby-talk	14	146	19 February 2007
Convert .doc to pdf in ruby ruby-talk	9	2474	28 August 2013

Is there any rubygem available for converting the pdf files to xml files?

Related topics