Merging two Word documents with Ruby?

I've got a bugger of a problem and I thought I'd toss it out there to
see if anyone can provide any guidance.

I'm working on an application that needs to merge two Microsoft Word
documents. However, the application will definitely run on a Linux
server, so Word won't be installed.

My only thought would be to use the new XML format -- maybe I can find a
way to merge two documents with those files.

Has anyone else had any experience merging Word documents in Ruby (and
Rails)? Any other experience in manipulating Word documents in other
ways?

Denver Mike

···

--
Posted via http://www.ruby-forum.com/.

Several points
- What do you mean by "Merge"?.. Word documents have structure and the
interleaving of lines or words would appear to make little sense.

- Unless your application and user base is new, then you will have many
files NOT in the XML format, in which case you would need to convert
them - and would need Word installed somewhere. Perhaps you could
reconsider your platform choice (to make the problem simpler) - or if
you have no pre-existing documents reconsider your approach to make
Word unecessary? Word can read a wide variety of document types
(including HTML) - so perhaps this is another way to simplify your
problem.

More details required...
Graham

Denver Mike wrote:

My only thought would be to use the new XML format -- maybe I can find a
way to merge two documents with those files.

The only way I see is to use openoffice. There must be a script
somewhere to run openoffice in batch convert mode. That way you can
convert the doc format to odf. ODF is xml based, so should be mergeable.
The xml based format of microsoft is not used yet. The first office
version that will support that is office 12 and not released yet

···

--
Posted via http://www.ruby-forum.com/\.

[snipped]

I'm working on an application that needs to merge two Microsoft Word
documents. However, the application will definitely run on a Linux
server, so Word won't be installed.

There's the POI Ruby bindings, although I've never used them myself and
have no idea how good they are.

  POI Ruby Bindings

If that doesn't work, I'd try wv and catdoc, respectively.

···

* Denver Mike (denvermike@comcast.net) wrote:

Denver Mike

--
Paul Duncan <pabs@pablotron.org> pabs in #ruby-lang (OPN IRC)
http://www.pablotron.org/ OpenPGP Key ID: 0x82C29562

If you have a choice, don't use Word document. Use RTF format instead.
RTF files can be opened the same way as Word documents, but are a lot
easier to process.

Lei

hi guys,

i have got a doubt .hopeu guy can help

I need to build a utility ,which if i run ,i need to merger two MS wor
documents & i should be able to print the meged document enabling us to
select the ptions of "remove header" & "remove footer"
& consecutively should print document with footer/header removed

help

···

--
Posted via http://www.ruby-forum.com/.

- What do you mean by "Merge"?.. Word documents have structure and the
interleaving of lines or words would appear to make little sense.

Thanks for your thoughts on this Graham. By "merge", I meant appending
one Word document to the end of another, but to make things more
complicated, I need to add text into the headings across the entire
document.

···

--
Posted via http://www.ruby-forum.com/\.

Denver Mike wrote:

Thanks for your thoughts on this Graham. By "merge", I meant appending
one Word document to the end of another, but to make things more
complicated, I need to add text into the headings across the entire
document.

Microsoft word has something called a master document. Maybe you could
add a masterdocument that inclkudes both files+extra headings. This
masterdocument might be simple enouh that you can actually reverse
engineer it. (Create one in word once and just edit the parts you need
to edit with ruby).

···

--
Posted via http://www.ruby-forum.com/\.

This can actually be extremely complex, because a named style (such as
'Body', 'Normal', or 'Heading 1') can (and will) have different
properties (fonts, colors, sizes, margins, encoding, etc) in each of
the two documents. You will need to rename every style and style
reference in the second document in order to prevent the two from
colliding.

···

On 12/21/05, Denver Mike <denvermike@comcast.net> wrote:

> - What do you mean by "Merge"?.. Word documents have structure and the
> interleaving of lines or words would appear to make little sense.

Thanks for your thoughts on this Graham. By "merge", I meant appending
one Word document to the end of another, but to make things more
complicated, I need to add text into the headings across the entire
document.

If your documents are properly structured using styles (which is rare)
and they share the same styles (and I mean the *same* styles), you can
try to use openoffice in remote command mode, convert the .doc into
..odt, parse the xml of both files, proceed to merge the XMLs and
rebuild an odt file; perhaps going through OOo again to have a .doc
back. But you will need to ensure that the styles are always converted
into something reliably identifiable.

FAO (the UN branch for food and agriculture) uses a template system
(thus forcing a set of styles) which is used to output RTF which is
converted into XML for storage. Are your documents existing legacy ones
or is this a new setup? If you're building it all, then you might
seriously consider using openoffice all the way.

Does it still need to be a Word document when you're done? An entirely different approach would be to use some kind of Word file display program and make PDFs of the files, then chain the PDFs together. Do the headers by slapping a white block over the existing headers and writing a new header over them.

Personally, my approach would be to abandon the project as just too messy for words. :slight_smile:

···

On Dec 21, 2005, at 7:06, Denver Mike wrote:

Thanks for your thoughts on this Graham. By "merge", I meant appending
one Word document to the end of another, but to make things more
complicated, I need to add text into the headings across the entire
document.

OpenOffice.org can do the .doc to pdf conversion. I like your idea very
much, Dave. Maybe PostScript would be easier to fiddle with ex-post.

Probably. If you have a program that lets you overlay one PDF page on another, then your best bet is to output a PDF page with your header in it. (I'd probably use TeX, or maybe script OSX's TextEdit program, and my copy of full Acrobat 4 for the page overlay.) The other alternative would be to create (or have somebody create for you) an .eps with the white box and a line of text in a program like Freehand or Illustrator. If you pop open the .eps file in a text editor, you'll find it not too difficult to programmatically replace the text, although you won't easily be able to duplicate the kerning and other textual adjustments. Have OpenOffice print to a postscript file, then figure out what you can use as a page marker in order to embed the .eps in that file on each page so that it comes after (and thus covers) the original headers, if any. Then feed the modified .ps file into a PDF distiller.

That's what I'd try, I think.

···

On Dec 23, 2005, at 11:37, Daniel Calvelo wrote:

OpenOffice.org can do the .doc to pdf conversion. I like your idea very
much, Dave. Maybe PostScript would be easier to fiddle with ex-post.

abiword can be used from the command line. See http://www.advogato.org/person/msevior/diary.html?start=65

This might allow for this to happen

···

On Dec 23, 2005, at 1:27 PM, Dave Howell wrote:

On Dec 23, 2005, at 11:37, Daniel Calvelo wrote:

OpenOffice.org can do the .doc to pdf conversion. I like your idea very
much, Dave. Maybe PostScript would be easier to fiddle with ex-post.

Probably. If you have a program that lets you overlay one PDF page on another, then your best bet is to output a PDF page with your header in it. (I'd probably use TeX, or maybe script OSX's TextEdit program, and my copy of full Acrobat 4 for the page overlay.) The other alternative would be to create (or have somebody create for you) an .eps with the white box and a line of text in a program like Freehand or Illustrator. If you pop open the .eps file in a text editor, you'll find it not too difficult to programmatically replace the text, although you won't easily be able to duplicate the kerning and other textual adjustments. Have OpenOffice print to a postscript file, then figure out what you can use as a page marker in order to embed the .eps in that file on each page so that it comes after (and thus covers) the original headers, if any. Then feed the modified .ps file into a PDF distiller.

That's what I'd try, I think.

hi guys,

i have got a doubt .hopeu guy can help

I need to build a utility ,which if i run ,i need to merge two MS wor
documents & i should be able to print the merged document enabling us to
select the options of "remove header" & "remove footer"
& consecutively should print document with footer/header removed

help -pls

···

--
Posted via http://www.ruby-forum.com/.