[ANN] ClothRed (HTML to Textile)

I'm pleased to announce, that I've begun working on a small library to convert HTML into Textile.

Please forgive me, that this announcement isn't yet following the community's standards, but I'm slowly getting there.

For the curious, the website and project on RuybForge have gone online *and* have some content[0].

For the impatient:
ClothRed will be exactly the reverse of RedCloth: It will grab any HTML string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted into Textile's markup from the text, making it, hopefully, usable for sanitizing HTML.

I hope to have an Alpha release out by the end of next month.

Links:
[0] http://clothred.rubyforge.org/

···

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #5:

A project is never finished.

I'm pleased to announce, that I've begun working on a small library to
convert HTML into Textile.

...

ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted
into Textile's markup from the text, making it, hopefully, usable for
sanitizing HTML.

I hope to have an Alpha release out by the end of next month.

Awesome, Phillip. I really look forward to using this!

Jacob Fugal

···

On 4/10/07, Phillip Gawlowski <cmdjackryan@googlemail.com> wrote:

I'm pleased to announce, that I've begun working on a small library to
convert HTML into Textile.

Please forgive me, that this announcement isn't yet following the
community's standards, but I'm slowly getting there.

For the curious, the website and project on RuybForge have gone online
*and* have some content[0].

For the impatient:
ClothRed will be exactly the reverse of RedCloth: It will grab any HTML
string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted
into Textile's markup from the text, making it, hopefully, usable for
sanitizing HTML.

I hope to have an Alpha release out by the end of next month.

Awesome!

x x x

A bit OT, but I'm dreaming/planning (for a long time) about library, which
can handle "greatest common divisor" of all simple text format and perform
uniformly conversions like

Textile <=> <=> HTML
Markdown <=> <=> PDF
Mediawiki <=> gcd <=> PS
RDOC <=> <=> OpenOffice

There are several projects performing only [some markup]=>html conversions.
There is also Maruku[1], which seems to handle virually any Markdown=>[rich
format] conversion (and seems to embody some common intermediate format).
There is now your project.

Isn't now a time to do something more generic?

V.

1: maruku.rubyforge.org

···

From: Phillip Gawlowski [mailto:cmdjackryan@googlemail.com]
Sent: Tuesday, April 10, 2007 8:39 PM

Phillip Gawlowski wrote:

ClothRed will be exactly the reverse of RedCloth: It will grab any HTML string, and convert it into Textile.

As a bonus, ClothRed will strip all HTML that is not being converted into Textile's markup from the text, making it, hopefully, usable for sanitizing HTML.

Looks interesting, but I hope there would be a mode to preserve unknown HTML in addition to the "lossy" mode. Sanitizing HTML is good but if you convert the resulting Textile to HTML and it doesn't look like the original, that's not too good IMHO.

Daniel

Victor "Zverok" Shepelev wrote:

A bit OT, but I'm dreaming/planning (for a long time) about library, which
can handle "greatest common divisor" of all simple text format and perform
uniformly conversions like

Textile <=> <=> HTML
Markdown <=> <=> PDF
Mediawiki <=> gcd <=> PS
RDOC <=> <=> OpenOffice

There are several projects performing only [some markup]=>html conversions.
There is also Maruku[1], which seems to handle virually any Markdown=>[rich
format] conversion (and seems to embody some common intermediate format).
There is now your project.

Isn't now a time to do something more generic?

Well, once there are libraries to do any one of these tasks, you can build a tool chain, similar to DBI, for example.

I know that there's a PDF generator written in Ruby, but i don't know about the other file formats. Creating markup parsers isn't that much of a challenge, so that could be done quite easily.

I'd be happy to, once ClothRed is feature-complete in the HTML -> Textile area, to write an API to integrate ClothRed into other tools.

···

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #33:

Don't waste time on writing test cases and test scripts - your users are
your best testers.

Daniel DeLorme wrote:

Looks interesting, but I hope there would be a mode to preserve unknown HTML in addition to the "lossy" mode. Sanitizing HTML is good but if you convert the resulting Textile to HTML and it doesn't look like the original, that's not too good IMHO.

To do that, there'll probably be two different modes of HTML stripping:
* One "strict": Every thing that cannot be parsed by ClothRed will be thrown out.
* One "loose": All HTML that ClothRed cannot preserve will be kept, and warnings will be emitted (either to stdout, or stderr, or both).

The latter will not be usable for sanitizing HTML, as "unknown" HTML *should* be treated as malicious (specifically, as there is no "unknown" HTML in the W3C specs).

···

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #33:

Don't waste time on writing test cases and test scripts - your users are
your best testers.

Victor "Zverok" Shepelev wrote:

A bit OT, but I'm dreaming/planning (for a long time) about library,

which

can handle "greatest common divisor" of all simple text format and

perform

uniformly conversions like

Textile <=> <=> HTML
Markdown <=> <=> PDF
Mediawiki <=> gcd <=> PS
RDOC <=> <=> OpenOffice

There are several projects performing only [some markup]=>html

conversions.

There is also Maruku[1], which seems to handle virually any

Markdown=>[rich

format] conversion (and seems to embody some common intermediate format).
There is now your project.

Isn't now a time to do something more generic?

Well, once there are libraries to do any one of these tasks, you can
build a tool chain, similar to DBI, for example.

I know that there's a PDF generator written in Ruby, but i don't know
about the other file formats. Creating markup parsers isn't that much of
a challenge, so that could be done quite easily.

I'd be happy to, once ClothRed is feature-complete in the HTML ->
Textile area, to write an API to integrate ClothRed into other tools.

My point was, to have some intermediate format, and have couple of parsers
TO this format and generators FROM it.

Now, authors of all libraries are solving 2 problems - parse & generate. It
should be nice to have one common HTML parser, which could be used either
for HTML->Textile, or for HTML->Markdown (only generators will differ).

From some poin of view, we can use Textile as intermediate, your library
would be "parser", RedCloth would be "generator". But this leaves Markdown,
Rdoc "off the game", while we have no Markdown->Textile and similar
convertors.

V.

···

From: Phillip Gawlowski [mailto:cmdjackryan@googlemail.com]
Sent: Wednesday, April 11, 2007 10:26 PM

Seems like XHTML would be the obvious choice for the intermediate format, no?
Unless you want to reinvent that particular wheel.

Gary Wright

···

On Apr 11, 2007, at 4:52 PM, Victor Zverok Shepelev wrote:

My point was, to have some intermediate format, and have couple of parsers
TO this format and generators FROM it.

Victor "Zverok" Shepelev wrote:

Now, authors of all libraries are solving 2 problems - parse & generate. It
should be nice to have one common HTML parser, which could be used either
for HTML->Textile, or for HTML->Markdown (only generators will differ).

Well, ClothRed has to parse HTML to output Textile. It does not more than that. If you plug it into a converter suit, you can use HTML as an intermediary format (RedCloth can parse Textile and Markdown into HTML, so you'd have already a little part of such a converter).

From some poin of view, we can use Textile as intermediate, your library

would be "parser", RedCloth would be "generator". But this leaves Markdown,
Rdoc "off the game", while we have no Markdown->Textile and similar
convertors.

Granted, the scope of my library is limited, but purposefully so, to keep it a) manageable, and b) keep it in line with my skills. Once ClothRed is feature-complete, I can add to its functionality, but not sooner, if I can avoid it.

···

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/

Rule of Open-Source Programming #8:

Open-Source is not a panacea.

My point was, to have some intermediate format, and have couple of
parsers
TO this format and generators FROM it.

Seems like XHTML would be the obvious choice for the intermediate
format, no?
Unless you want to reinvent that particular wheel.

<OT>
We, russians, say "invent the bike". In russian programming forums my usual
origin is "Bikes forever!" :slight_smile:
</OT>

The question, I think, is like XML vs. YAML/JSON. Just do simpler.

I mean, for conversions like Markdown <=> Textile, XHTML as intermediate is
slightly too funny.

The overall thought was "conversion of basic logical formatting", thus,
intermediate format should only handle basic features (at the level of
Textile-like formats, not bloated XHTML-like).

V.

···

From: Gary Wright [mailto:gwtmp01@mac.com]
Sent: Thursday, April 12, 2007 12:10 AM

On Apr 11, 2007, at 4:52 PM, Victor Zverok Shepelev wrote:

Victor "Zverok" Shepelev wrote:
>Now, authors of all libraries are solving 2 problems - parse & generate. It
>should be nice to have one common HTML parser, which could be used either
>for HTML->Textile, or for HTML->Markdown (only generators will differ).

Well, ClothRed has to parse HTML to output Textile. It does not more
than that. If you plug it into a converter suit, you can use HTML as an
intermediary format (RedCloth can parse Textile and Markdown into HTML,
so you'd have already a little part of such a converter).

Are you using Hpricot for your parsing? If so, it should be pretty easy to
do the conversion. If not, why not? (Disclaimer: I've been following the
thread but haven't looked at or even installed/run the code.)

>From some poin of view, we can use Textile as intermediate, your library
>would be "parser", RedCloth would be "generator". But this leaves Markdown,
>Rdoc "off the game", while we have no Markdown->Textile and similar
>convertors.

Granted, the scope of my library is limited, but purposefully so, to
keep it a) manageable, and b) keep it in line with my skills. Once
ClothRed is feature-complete, I can add to its functionality, but not
sooner, if I can avoid it.

I understand where you are with this. At the same time, I have an actual
need to do something very much like this in my own work. I suspect there
are others out there in a similar situation. We're hoping that this will
become useful to us sooner rather than later and that we can avoid rolling
our own.

Phillip "CynicalRyan" Gawlowski

--Greg

···

On Thu, Apr 12, 2007 at 03:56:50PM +0900, Phillip Gawlowski wrote:

Yes, but that is because XHTML is too funny.

Markdown exists to generate XHTML.
Textile exists to generate XHTML.

If you've got reverse translations also, then XHTML is *already* working
as the intermediate. Why do you need yet another format?

Gary Wright

···

On Apr 11, 2007, at 5:26 PM, Victor "Zverok" Shepelev wrote:

I mean, for conversions like Markdown <=> Textile, XHTML as intermediate is
slightly too funny.

Gregory Seidman wrote:

Are you using Hpricot for your parsing? If so, it should be pretty easy to
do the conversion. If not, why not? (Disclaimer: I've been following the
thread but haven't looked at or even installed/run the code.)

No, I don't. I want to avoid dependencies as much as I can, so that ClothRed can stand on its own as much as possible.

I understand where you are with this. At the same time, I have an actual
need to do something very much like this in my own work. I suspect there
are others out there in a similar situation. We're hoping that this will
become useful to us sooner rather than later and that we can avoid rolling
our own.

Regarding the time frame, I'm trying to make it feature-complete as soon as I can. There isn't much left to do for the core engine, and after that I can pretty it up a bit (with rule sets and the like).

If all goes well, ClothRed will hit the big 1.0.0 at the weekend, as a full HTML to Textile parser. After that, I'm very open to ideas regarding its future.

···

--
Phillip "CynicalRyan" Gawlowski
http://cynicalryan.110mb.com/
http://clothred.rubyforge.org

Rules of Open-Source Programming:

22. Backward compatibility is your worst enemy.

23. Backward compatibility is your users' best friend.

I mean, for conversions like Markdown <=> Textile, XHTML as
intermediate is
slightly too funny.

Yes, but that is because XHTML is too funny.

Markdown exists to generate XHTML.
Textile exists to generate XHTML.

If you've got reverse translations also, then XHTML is *already* working
as the intermediate. Why do you need yet another format?

May be you're right. Only can I say that for other "rich formats" (like PDF
or OpenOffice) generation conversions textile->pdf can be simpler than
XHTML->pdf (thus, it breaks rule for "the single intermediate format"
through chains like Markdown->XHTML->Textile->PDF). What do you think?

V.

···

From: Gary Wright [mailto:gwtmp01@mac.com]
Sent: Thursday, April 12, 2007 12:33 AM

On Apr 11, 2007, at 5:26 PM, Victor "Zverok" Shepelev wrote:

I think that a translator that is designed specifically for X to Y will
always do better than a translator that goes through an intermediate
language. A Russian to Spanish translation is going to be better than
a Russian to English to Spanish translation. The benefit of an
intermediate language is that you don't need n^2 translators only n.
That doesn't mean that in some special/common cases the direct translation
might be available and might make a better choice.

Gary Wright

···

On Apr 11, 2007, at 5:40 PM, Victor "Zverok" Shepelev wrote:

May be you're right. Only can I say that for other "rich formats" (like PDF
or OpenOffice) generation conversions textile->pdf can be simpler than
XHTML->pdf (thus, it breaks rule for "the single intermediate format"
through chains like Markdown->XHTML->Textile->PDF). What do you think?

Valid, but not the same. Human languages leave lots of implicit information that isn't easily machine parsed. That comparison is out.
But rich formats don't translate to formats that lack certain capabilities.
Particularly PDF to XHTML or Markdown or anything almost.
PDF is a pretty broad format. Layouts don't translate so easily. Adobe would love to have such a capability reliably. InDesign could then produce layouts for print and the web. Not likely.

···

On Apr 12, 2007, at 7:00 AM, Gary Wright wrote:

On Apr 11, 2007, at 5:40 PM, Victor "Zverok" Shepelev wrote:

May be you're right. Only can I say that for other "rich formats" (like PDF
or OpenOffice) generation conversions textile->pdf can be simpler than
XHTML->pdf (thus, it breaks rule for "the single intermediate format"
through chains like Markdown->XHTML->Textile->PDF). What do you think?

I think that a translator that is designed specifically for X to Y will
always do better than a translator that goes through an intermediate
language. A Russian to Spanish translation is going to be better than
a Russian to English to Spanish translation. The benefit of an
intermediate language is that you don't need n^2 translators only n.
That doesn't mean that in some special/common cases the direct translation
might be available and might make a better choice.

Gary Wright