MS Word files and PDFs

Hello,

I'm fairly new to the Ruby scene.
Is there any library that can read MS Word (.doc) files and extract the pure
text...what about libs for PDF files?

Thanks folks,

M.

You could use the win32ole library and read them yourself via OLE.

···

On 4/22/06, Mateo Barraza <mateo.barraza@gmail.com> wrote:

Hello,

I'm fairly new to the Ruby scene.
Is there any library that can read MS Word (.doc) files and extract the pure
text...what about libs for PDF files?

Thanks folks,

M.

--
Keith Sader
ksader@gmail.com
http://www.saderfamily.org/roller/page/ksader

Hi,

There's not a MS Word library that I know of that will easily allow you
to extract the pure text, but the OLE suggestion is a good idea. Another
method would be to save as WordprocessingML (XML) (if you have Word 2003) and use
either REXML or libxml-ruby (two Ruby XML libraries) to parse it (or XSLT). If you've got XML, the
interesting nodes (if you really only want text) are the 'w:t' ones.

HTH,
Keith

···

On Sun, 23 Apr 2006, Mateo Barraza wrote:

I'm fairly new to the Ruby scene.
Is there any library that can read MS Word (.doc) files and extract the pure
text...what about libs for PDF files?

Thanks for your responses; I also found that the POI java project was
extended to support ruby:
http://jakarta.apache.org/poi/poi-ruby.html
Although, I think the win32ole solution is the best for simply
reading the content of the docs...

M

···

On 4/22/06, Keith Fahlgren <keith@oreilly.com> wrote:

On Sun, 23 Apr 2006, Mateo Barraza wrote:
> I'm fairly new to the Ruby scene.
> Is there any library that can read MS Word (.doc) files and extract the
pure
> text...what about libs for PDF files?

Hi,

There's not a MS Word library that I know of that will easily allow you
to extract the pure text, but the OLE suggestion is a good idea. Another
method would be to save as WordprocessingML (XML) (if you have Word 2003)
and use
either REXML or libxml-ruby (two Ruby XML libraries) to parse it (or
XSLT). If you've got XML, the
interesting nodes (if you really only want text) are the 'w:t' ones.

HTH,
Keith

Keith Sader wrote:

You could use the win32ole library and read them yourself via OLE.

Hi,

could you provide code snippet

···

--
Posted via http://www.ruby-forum.com/\.

I have found this most useful:

what you want should be hidden in there

A most valuable read anyway.

Cheers
Robert

···

On Feb 8, 2008 8:18 AM, Rajesh Soni <rajesh.soni@softwarefolks.com> wrote:

Keith Sader wrote:

--
http://ruby-smalltalk.blogspot.com/

---
Whereof one cannot speak, thereof one must be silent.
Ludwig Wittgenstein