Is there any rubygem available for converting the pdf files to xml files?

Hi,

Is there any rubygem available for converting the pdf files to xml
files?

···

--
Posted via http://www.ruby-forum.com/.

Look at rubygems.org there’s at least one that is PDF to HTML, but I’ve not used it.

Wayne

···

On Mar 5, 2014, at 17:10, Arup Rakshit <lists@ruby-forum.com> wrote:

Hi,

Is there any rubygem available for converting the pdf files to xml
files?

--
Posted via http://www.ruby-forum.com/\.

Arup:

I did install the PDF to HTML gem and have to say it’s pretty impressive! It’s all based on the pdf2htmlEX project:

(it’s basically just a nice ruby wrapper, so you have to have pdf2htmlEX installed). But this gem actually opens up a whole new world of possibilities.

In combination with something like nokogiri, you should be able to parse almost all the data you want. However, this means you’ll need to brush up on your css and/or xpath to parse again with nokogiri.

On Mac OS X, it was pretty easy to install the pdf2htmEX toolset. For Windows, somebody has already done the compiling for you here: http://soft.rubypdf.com/software/pdf2htmlex-windows-version

Good luck!

FYI, there is a googlegroup for the pdf2htmlEX toolset and you’re going to be better off asking questions there rather than this list for any additional help with those toolsets if you choose to use them since this list is strictly for ruby related things.

Wayne

Wayne Brissette wrote in post #1139024:

Arup:

I did install the PDF to HTML gem and have to say its pretty impressive!
Its all based on the pdf2htmlEX project:

pdf2htmlEX/src at master · coolwanglu/pdf2htmlEX · GitHub

Wayne

Thanks for your reply. I was also looking for

But the issue is, if PDF have any blank column values, it is not
generating any corresponding tag for those entries. Thus couldn't track
which data is actually under which column.

I am surely give the gem a try, you linked above.

···

--
Posted via http://www.ruby-forum.com/\.