[ANN] uniforma-0.0.1 - converter for text formats

Victor_Zverok_Shepel · 14 September 2007 01:39

Hi all.

I'm pleased to announce 0.0.1 (aka "early adopters only" release) of my
Uniforma library.

It's here: http://rubyforge.org/projects/uniforma/

== What is it?

Library for parsing "simple text" formats (RD, Textile, Markdown, etc.) and
generating output in various formats (including simple text, html/xml and
more complex ones).

The heart of the library is two DSLs - for defining parsers and generators.

== Why?

1. Preparing "one more serious library"'s documentation, I've found a
dillema: write it in RD? (for auto-generate all with RDoc) or Trac's wiki
format? (for uploading to Trac site) or Textile? (for once uploading to
stand-alone site) So I've decided to do conversion library/tool.

2. I'm using RedCloth (Textile) for all my works, and trying to patch it for
my needs, I've found it's a mess. I just need to have separate clear
description of "how is it parsed" and "how is it generated" aspects.

3. For my journalism, I need MS Word output (I have no fun to do text
editing in MS Word, but ability to generate it is a must). Now I use
"Textile=>(RedCloth)=>HTML=>`winword mytext.html`" scheme, which have
several flaws. I want be able to easy define MS Word generator (using
win32ole, of course, no hand-made heroism).

== Show. Me. The. Code.

Usage:

puts Uniforma::textile('*some text* "with
links":http://google.com.').to_html_string

output:
<html><body>
some text <a href='http://google.com'>with links</a>.
</body></html>

Defining parsers:

···

---
module Uniforma::Parsers
class Textile < LineParser
 definition do
 ....
 #how to parse some line
 ....
 line /^h(\d+)\.\s+/ do para(:heading, :level => @_1.to_i) end
 ....
 #how to parse inline formatting:
 inline /__(.+?)__/, :italic
 end
end
end
---

Defining generators

---
module Uniforma::Generators
class HtmlString < TextGenerator
 definition do
 ...
 #what to place around some "paragraph type"
 around(:heading) {|p| i = p.level; ["<h#{i}>", "</h#{i}>\n"]}
 ...

#what to place around some "inline markup type"
 around(:italic) {["", ""]}
 end
end
end
---

Uniforma is smart enough to allow:
* non-line based formats parsers (in fact, it also has one "toy" parser for
HTML, which even works! on not-very-complex HTML documents)
* non-text format generators (I'm working on PDF and MSWord generators. It's
not very hard to define with Uniforma)

== Important notes about current release

* This release shamelessly includes htmlentities library by Paul Battley[1],
without even notice it in license files. It is subject to change ASAP.

* It's really "early adopters" release. Almost no docs, and very, very poor
tests. But it shows an idea and is a base for further work.

* This release include parsers for: Textile, RD, HTML and generators for:
BBcode, RD, HTML. All of them are incomplete but tend to work.

* I'd want to hear opinions about whether DSLs for parser/generator looks
"right" from point-of-view of a) native English speakers and b) real Ruby
ninja. You can examine my parsers in lib/uniforma/parsers/ and generators in
lib/uniforma/generators/

Again, the library is here: http://rubyforge.org/projects/uniforma/

Thanx.

Zverok.

1:http://rubyforge.org/projects/htmlentities/

Gaspard_Bucher1 · 14 September 2007 07:23

Your work is interesting and sure looks like good ruby to me.

I had to write a similar parser for Zena (to parse textile additions
and the zafu templates) and I thought that doing so many regex
evaluations on the full text (can be long) was too slow (please tell
if this is wrong). I thus chose to use regex anchored left /\A.../ and
eat through the text only once. This has the other advantage that you
enter different modes (tag parameters, comments, raw data, etc) along
the way. It makes it very easy to parse sub languages from within
these modes.

The parser is a two step operation: 1. parse, 2. render. This might be
overkill for the kind of transformations you need but it is very
interesting because the parsed elements can use some knowledge from
the context when they are rendered.

I intend to do a textile -> Latex transformation so the users can
write zafu templates to generate PDF.

The current implementation of the parser is not as clean as yours but
works. You can have a look at the parser here :
http://dev.zenadmin.org/browser/trunk/lib/parser

Let me know what you think.

Gaspard

Jeff_Barczewski · 14 September 2007 15:53

Victor,

Great idea! It would be great to have one way to read and write all these
formats.

See my embedded comments.

Hi all.

I'm pleased to announce 0.0.1 (aka "early adopters only" release) of my
Uniforma library.

It's here: http://rubyforge.org/projects/uniforma/

3. For my journalism, I need MS Word output (I have no fun to do text
editing in MS Word, but ability to generate it is a must). Now I use
"Textile=>(RedCloth)=>HTML=>`winword mytext.html`" scheme, which have
several flaws. I want be able to easy define MS Word generator (using
win32ole, of course, no hand-made heroism).

To generate msword docs - it might be easier (and more portable) to simply
write out the new xml form of word rather than using win32ole. I believe it
would still have all the same capabilities but just represented in xml
format.

And of course don't forget to add open office xml to the list.

You might also take a look at deplate for some inspiration, it generates
latex, html, and docbook and can read from a few formats.

If you set up a mailing list, let me know as I would like to follow the
project since this could be very useful as I need to be able to generate
many formats.

Blessings,

Jeff

···

On 9/13/07, Victor Zverok Shepelev <vshepelev@imho.com.ua> wrote:

--
Jeff Barczewski, MasterView core team
Inspired Horizons Ruby on Rails Training and Consultancy
http://inspiredhorizons.com/

ThoML · 17 September 2007 07:50

Library for parsing "simple text" formats (RD, Textile, Markdown, etc.) and
generating output in various formats (including simple text, html/xml and
more complex ones).

I wrote deplate[1], which has similar goals (well, with the exception
of
source quality maybe ;-).

The point here is of course that simple formats are easy to parse, so
the question is how simple do you mean with "simple".

I want be able to easy define MS Word generator (using
win32ole, of course, no hand-made heroism).

If simple is really simple like rdoc-simple, why not simply import
HTML?
Although I like slightly more the way how OpenOffice uses HTML files.

If "simple" includes cross references, footnotes, endnotes, headers,
footers, table of contents/tables/figures etc., I think you'll
probably
need:

    - a general way to define counters and lists
    - some notion of metadata (like index, footnotes, labels, section
      names etc.)
    - make it possible to locate text at some random position in the
      output document (eg for headers & footers), e.g. move text to
the
      top of the document, after packages are required but before the
      start of the body etc. deplate defines "slots" for this which
      allows users to place the element at any position they want.
    - on the long (or intermediate-distance) run, you might also
think
      of some plugin-mechanism (e.g. e-mail obfuscation that may be
      loaded when converting the document without being hard-coded,
      although this could also be done by post-processing the output).

* non-line based formats parsers (in fact, it also has one "toy" parser for
HTML, which even works! on not-very-complex HTML documents)

From a pragmatic point of view, using hpricot and writing and map

classes on its output could be the better strategy.

Anyway, I'm eager to see how this develops.

Cheers,
Thomas.

[1] http://deplate.sf.net

Bira · 16 September 2007 18:35

OOXML's specification is over 6.000 pages long, and full of
idiosyncrasies - I don't know how much of it he'd need for his
documents, but using OLE and Word is probably easier than trying to
build a OOXML-compliant document generator from scratch. Which I guess
is exactly why Microsoft made the spec that long in the first place,
but I digress... :).

···

On 9/14/07, Jeff Barczewski <jeff.barczewski@gmail.com> wrote:

To generate msword docs - it might be easier (and more portable) to simply
write out the new xml form of word rather than using win32ole. I believe it
would still have all the same capabilities but just represented in xml
format.

Office Open XML - Wikipedia

--
Bira

http://sinfoniaferida.blogspot.com

Victor_Zverok_Shepel · 17 September 2007 06:37

Hi all.

I'm pleased to announce 0.0.1 (aka "early adopters only" release) of my
Uniforma library.

It's here: http://rubyforge.org/projects/uniforma/

3. For my journalism, I need MS Word output (I have no fun to do text
editing in MS Word, but ability to generate it is a must). Now I use
"Textile=>(RedCloth)=>HTML=>`winword mytext.html`" scheme, which have
several flaws. I want be able to easy define MS Word generator (using
win32ole, of course, no hand-made heroism).

To generate msword docs - it might be easier (and more portable) to simply
write out the new xml form of word rather than using win32ole. I believe it
would still have all the same capabilities but just represented in xml
format.

Office Open XML - Wikipedia

And of course don't forget to add open office xml to the list.

1. My primary goal is to create "right tool" for others to define their own
parsers and generator
2. About complex formats, I'm planning to be pragmatic: win32ole-based
solution is enough for most cases. If somebody feels it's not enough, he/she
can do the "right" generator him/her-self.

You might also take a look at deplate for some inspiration, it generates
latex, html, and docbook and can read from a few formats.

Yeah, thanks.

If you set up a mailing list, let me know as I would like to follow the
project since this could be very useful as I need to be able to generate
many formats.

OK, I'l notify you in the case. I think, it will be soon.

V.

···

From: Jeff Barczewski [mailto:jeff.barczewski@gmail.com]
Sent: Friday, September 14, 2007 6:53 PM

On 9/13/07, Victor Zverok Shepelev <vshepelev@imho.com.ua> wrote:

ThoML · 17 September 2007 08:00

From a pragmatic point of view, using hpricot and writing and map
classes on its output could be the better strategy.

BTW which makes me wonder, if you could your library into a parser and
a
formatter. A parser for wiki-like formats that generates something
like
hpricot does, would be something I would very much like to see.

Regards,
Thomas.

Victor_Zverok_Shepel · 17 September 2007 17:05

Library for parsing "simple text" formats (RD, Textile, Markdown, etc.)

and

generating output in various formats (including simple text, html/xml and
more complex ones).

I wrote deplate[1], which has similar goals (well, with the exception
of
source quality maybe ;-).

The point here is of course that simple formats are easy to parse, so
the question is how simple do you mean with "simple".

I meant, those whose parsers can be defined with some easy common DSL

If "simple" includes cross references, footnotes, endnotes, headers,
footers, table of contents/tables/figures etc., I think you'll
probably
need:

- a general way to define counters and lists

yep, something already exists, something more will.

- some notion of metadata (like index, footnotes, labels, section
names etc.)

yep, planned.

   - make it possible to locate text at some random position in the
     output document (eg for headers & footers), e.g. move text to
the
     top of the document, after packages are required but before the
     start of the body etc. deplate defines "slots" for this which
     allows users to place the element at any position they want.

If I understand you correctly, now Uniforma's generators does this "hack"
(placing fearst "heading" paragraph in generated html <title> tag):

lib\uniforma\generators\html.rb (lines 6-17):

···

From: micathom [mailto:micathom@gmail.com]
Sent: Monday, September 17, 2007 10:50 AM

---
pre(:document) do |document|
 title = document.find_first(:heading)
 if title
 %Q{
 <html>
 <head><title>#{title.text}</title></head>
 <body>
 }
 else
 %Q{<html><body>\n}
 end
end
---

So, I don't think about "random" positions, but only about some
pre(:document) and post(:document) actions, which has access to overall
document.

This approach seems natural enough for me.

   - on the long (or intermediate-distance) run, you might also
think
     of some plugin-mechanism (e.g. e-mail obfuscation that may be
     loaded when converting the document without being hard-coded,
     although this could also be done by post-processing the output).

yep. I've thought about syntax like ("plug-in" to rewrite some urls):

Uniforma.textile('mydocument.text').to_html do
  rewrite(:href, %r{http://somesite.com}) do |url|
    url.gsub(/somesite/, 'othersite')
  end
end

* non-line based formats parsers (in fact, it also has one "toy" parser

for

HTML, which even works! on not-very-complex HTML documents)

From a pragmatic point of view, using hpricot and writing and map

classes on its output could be the better strategy.

From the pragmatic point of view, I've already had HTMLSax-related stuff
before started to work on Uniforma

I'll think about converting this part to use Hpricot, but later, when core
thing (parsers and generators) will work well.

V.

Jeff_Barczewski · 16 September 2007 19:36

Wow, that figures that they would make something too complicated to
implement.

However I was thinking one could take a simpler approach and just create a
document in MSWord with everything you are supporting in the markup and then
save it as open office xml. That should give you an example of what to code
to, but if there are many idiosyncrasies then I guess the OLE way would be
more straight forward. I was hoping rather that for simple markup that there
wouldn't be too much to learn, but I have not read the spec so maybe it
wouldn't be as easy as I thought.

Boy I am glad that we have open source alternatives to most everything these
days!

···

On 9/16/07, Bira <u.alberton@gmail.com> wrote:

On 9/14/07, Jeff Barczewski <jeff.barczewski@gmail.com> wrote:

>
>
> To generate msword docs - it might be easier (and more portable) to
simply
> write out the new xml form of word rather than using win32ole. I believe
it
> would still have all the same capabilities but just represented in xml
> format.
>
> Office Open XML - Wikipedia

OOXML's specification is over 6.000 pages long, and full of
idiosyncrasies - I don't know how much of it he'd need for his
documents, but using OLE and Word is probably easier than trying to
build a OOXML-compliant document generator from scratch. Which I guess
is exactly why Microsoft made the spec that long in the first place,
but I digress... :).

--
Jeff Barczewski, MasterView core team
Inspired Horizons Ruby on Rails Training and Consultancy
http://inspiredhorizons.com/

Victor_Zverok_Shepel · 17 September 2007 17:05

From a pragmatic point of view, using hpricot and writing and map
classes on its output could be the better strategy.

BTW which makes me wonder, if you could your library into a parser and
a
formatter. A parser for wiki-like formats that generates something
like
hpricot does, would be something I would very much like to see.

Internally, it's already here. Any parser creates Uniforma::Dom structures,
which, though, lacks handy (Hpricot-like) navigation. Will think about it.

V.

···

From: Tom [mailto:micathom@gmail.com]
Sent: Monday, September 17, 2007 11:00 AM

Topic		Replies	Views
[ANN] Uniforma 0.0.2 released ruby-talk	0	85	20 September 2007
[ANN] RedCloth 2.0.7 -- A Textile Humane Web Text Generator ruby-talk	11	119	23 April 2004
[ANN] ClothRed 0.3.0 released ruby-talk	3	103	13 April 2007
[ANN] ClothRed 0.3.1 (Bugfix release) ruby-talk	0	116	13 April 2007
[ANN] RedCloth 2.0.2 -- A Textile Humane Web Text Generator ruby-talk	9	112	12 March 2004

[ANN] uniforma-0.0.1 - converter for text formats

Related topics