Copying text from MS Word and wrapping in HTML - help please

7stud2 · 27 June 2012 10:12

Hi,

I'm new to Ruby and was wondering if someone can help me out with this.
I build websites on a daily basis and use HTML all the time, but seem to
receive content people want publishing in Word documents all the time
and have to simply wrap text in paragraph tags, put ul tags around
unordered lists and various other simple and repetitive tasks.

Therefore I gather I could use Ruby to do some of these tedious tasks
for
me, but not sure how to do this. I've made a start, but doesn't seem to
work. The one thing I'm really unsure of how to do is to recognise where
paragraphs and lists start and end in order to wrap the HTML around it.
Any help would be appreciated.

require 'rubygems'
require 'win32/clipboard'
include Win32

# get the text from the clipboard
text = Clipboard.data

# clean up and wrap in HTML
text = text.gsub(/(“|”|£|sometext|• List item 1)/) do |match|
  case match
    when '“': '&quote'
    when '”': '&quote'
  when '£': '£'
  when 'sometext': '<p>sometext</p>'
  when '• List item 1
      • List item 2': '<ul><li>List item 1</li><li>List item
2</li></ul>'
end

# send it back to the clipboard
Clipboard.set_data(text)

# displays message
print "Cleaning up clipboard..."

Thanks in advance.

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 June 2012 10:33

Hi,

What's the exact format of the Word files? It might be easier to parse
the original file with all the structure information rather than the
extracted plaintext. For example, .docx files are XML and can easily
processed with the Nokogiri library. I've actually used it to transform
Word documents into simple HTML.

In general, you can recognize paragraphs by looking for empty lines (two
or more newlines with only space in between). The lists can be converted
by starting a new list at the first list entry and closing it at the
last one. This requires a more complex logic than you actually use.

By the way, you don't have to replace special characters by hand. The
Ruby libraries already have methods for this (I don't remember the exact
name, but it's probably in CGI).

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 June 2012 10:42

Hi,

Thank you for your reply. Yes the documents comes through generally as
Word 2007 or 2010 documents, so with an extension of docx. Ok I see, so
how did you do yours if you don't mind showing bits of your code? That's
good, didn't know Ruby had its own method on replacing special
characters, will look into this.

Thanks for your help.

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 June 2012 10:54

OK, it's in the attachments.

The DocxTree reads a docx file and builds a tree consisting of Node
objects. Each Node represents a basic text element (paragraphs, headings
etc.). You can then specify rules for how to transform the Node objects
into HTML (or anything else). The transformation works recursively, so
you can build a full HTML fragment by transforming the root node.

Attachments:
http://www.ruby-forum.com/attachment/7532/docx_tree.rb
http://www.ruby-forum.com/attachment/7533/node.rb

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 June 2012 11:09

Thank you very much. So I presume I have to point the location below to
the file location on my computer? Does this mean I need to convert the
Word document into XML format, or can I put the location as for example
- file.read 'C:\Users\me\Documents\myDocument.docx' ?

file.read 'word/document.xml'

Thanks once again for your help, really appreciate it.

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 June 2012 11:22

You only have to supply the path to the docx file:

tree = DocxTree.new 'C:/Users/me/Documents/myDocument.docx'

The initialize method will unzip the XML file of the document and parse
it with Nokogiri.

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 June 2012 11:48

Thanks. Sorry if I'm missing the obvious here but looking in the two
Ruby files you supplied I can't see anywhere the line tree =
DocxTree.new? Or does that need adding into the docx_tree.rb file
somewhere?

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 June 2012 12:08

The files only supply the classes. In order to use them, you have to
create your own script, include docx_tree.rb and then instantiate the
DocxTree class.

Let's say you name your script "main.rb" and put the two files into the
subdirectory "classes". Then main.rb would have to look something like
this:

···

---------------------
require_relative 'classes/docx_tree.rb' # include the file

# the actual code
my_docx_file = 'C:/Users/me/Documents/myDocument.docx'
docx_tree = DocxTree.new my_docx_file # use the DocxTree class

# do something with tree ...
---------------------

By the way, you need the zip gem and the nokogiri gem for the class to
work. So if you haven't installed them yet, do it before running the
code.

And you'll probably have to customize the classes. For example, there
isn't a method to transform the docx tree directly yet.

--
Posted via http://www.ruby-forum.com/.

7stud2 · 27 June 2012 19:20

I see, thank you very much for your help. Being new to Ruby, think I
need to brush up on my skills before I can take this on as I'm
struggling to get this to work. However hopefully I will be able to get
it up and working in time.

Thanks for all your help.

···

--
Posted via http://www.ruby-forum.com/.

Graham_Menhennitt1 · 28 June 2012 09:24

I'm not sure that Ruby is the right tool for this task. Why not use LibreOffice to read the Word files and then export them as HTML. I'm sure it's possible to automate this in LO too, but I'll leave that up to somebody else.

Graham

···

On 27/06/2012 20:12, Adam Holloway wrote:

I build websites on a daily basis and use HTML all the time, but seem to
receive content people want publishing in Word documents all the time
and have to simply wrap text in paragraph tags, put ul tags around
unordered lists and various other simple and repetitive tasks.

Therefore I gather I could use Ruby to do some of these tedious tasks
for me, but not sure how to do this.

Michael_Shigorin1 · 28 June 2012 12:02

There are antiword and catdoc out there (but these would rather
handle .doc and not .docx).

···

On Thu, Jun 28, 2012 at 06:24:11PM +0900, Graham Menhennitt wrote:

>I build websites on a daily basis and use HTML all the time,
>but seem to receive content people want publishing in Word
>documents all the time and have to simply wrap text in
>paragraph tags, put ul tags around unordered lists and various
>other simple and repetitive tasks.
I'm not sure that Ruby is the right tool for this task. Why not
use LibreOffice to read the Word files and then export them as HTML.

--
---- WBR, Michael Shigorin <mike@altlinux.ru>
------ Linux.Kiev http://www.linux.kiev.ua/

Topic		Replies	Views
Output file from clipboard in unix ruby-talk	2	149	20 May 2013
Parse Word/HTML Docs for database inserts ruby-talk	3	122	16 July 2009
Problems with making wordwrap ruby-talk	7	82	21 November 2005
Word + win32ole - how to find formatting of a word? ruby-talk	5	127	27 October 2008
MS Word files and PDFs ruby-talk	5	125	8 February 2008

Copying text from MS Word and wrapping in HTML - help please

Related Topics