STEP Structured Text Entry Processor

This was orignally a follow up to my question about YAML documentation, but
grew into a separate topic.

This is not an announcement, but I think it’s about time to get some
feedback on the design. It’s hardly specific to Ruby, but I guess it would
work well in a Ruby context.

“Mauricio Fernández” batsman.geo@yahoo.com wrote in message
news:20030223073420.GA13356@student.ei.uni-stuttgart.de

BTW: what tool(s) did you use to produce the Yaml documentation?

Yaml :slight_smile: Take a look at doc/yamlrb.yod: it is a “Yaml document”, to be
processed by Yod (Yaml Ok Documentation). See src/yod.rb:

I kind of figured :slight_smile: But it must be post-processed to xsl-fo or something?

I have on and off for a long time been hacking on a simple xml format - like
some of the people behind yaml - then comes yaml. But meanwhile I changed
focus towards a format that is supposed to be especially suited for
documentation purposes. yaml is also text typing friendly - but still not
the best possible for text entry. I hacked something in Ruby but need to
back to it. The primary motivation is that Tex is too complex and xml-doc is
too cumbersome - and finally the need to have a text format as you can’t
trust wordprocessors to be around in the long term, and are bad for
formattting and source control.

Ruby doc format is a similar approach, but not sufficiently advanced in
formatting.

Perhaps I should ask for some help here in getting the format completed?

The design goals are

  • absolutely minimum of escape symbols
  • arbitrarily complex nesting
  • automatic tag-close based on context
  • support for meta-tagging (comments, other languages, notes)
  • headers etc. should not be escaped by = for level 1, == for level 2 etc.,
    because it makes it difficult to move a section.

I see it as a possiblity to use Wiki like interface for advanced text
formatting purposes - and also for non-text purposed - but here YAML or even
XML might be better.

Currently I haven’t looked much into how to represent lists, a case where
YAML clearly excells.

I’ve written a prelim. spec., but I’m considering changing it a bit. Here
are the main points (it’s simple because that’s the whole point).

I’ve currently got some problems handling paragraph breaks - I don’t want to
type them everywhere, but deducing them can be tricky.

I called it STEP: Structured Text Entry Processor.

Text is text. A blank line is is paragraph break (whatever that means in the
given context). The only escape symbols are curly braces. This form a
command.
example:

{chapter The first chapter}
Here is text. Then next sentence is bolded. {b This is bolded text}. This is
not bold.
{chapter The next chapter}
Here is text in chapter two.
{section a subsection} Text in section. {note needs cleanup}
{chapter Also a chapter}

Clearly tags (called commands) follow ‘{’. These are not predefined in STEP.
STEP provides means to define tags hierarchies which enables one tag to
automatically close another. STEP also has two kinds of commands: those that
has a header and a body, and those that only have a header:
{b header only}, {chapter header} body {chapter header} body

I am actually considering having two different symbols for the two command
styles: {b header only}, [chapter header] body [chapter header] body
But then I would have more symbols to escape.

I am also considering moving the command name outside of ‘{’:
This is b{bolded text} this is not bolded.
chapter{The chapter title} The chapter body section{Text in section}
However, currently the name follows ‘{’ as in: This is {b bolded text}.

Semantics are the most important, but here is the basic syntax:

::= ( | )( | |)*
::= ‘{’ [+ ] ‘}’
::= (SYMBOL except , ‘{’, ‘}’, ‘(’, or ‘)’)*
::= ( | ‘{’ | ‘}’ | ‘(’ | ‘)’ )*
::= [ ‘(’ ‘)’]
::= – reserved for future
, ::= – see below

The only escaped symbols are ‘{’ and ‘}’. '' is not escaped: If you want to
write ‘{’ you must write ‘{’, but if you want to write ‘{’ you write
‘\{’. '' only has a special meaning before ‘{’ or ‘}’.

Spaces are usually merged into a single command. To have
explicit spaces in front of text or just spaces, use the command with no
name:
The following are multiple spaces { }and the following are multiple
{ spaces follewed by text}.

Not shown: There a special commands for handling source code text completely
unescaped using something similar to <<EOInput, and another simpler option
where { } are only required to be balanced.

are reserved for future used. They would a allow a syntax like
{font(courier, 10) some text in courier}.

The following is an attempt to clearly define the space syntax. The
and are significant. is stripped and
is only used to seperate the command name from the following text. There are
problems - how to deal with space before and after a field if the field
evaluates to nothing, and there are several issues with explicit
paragaph-breaks and implicit breaks (like after a chapter title). Therefore,
a higher lever syntax must also be used to handle document output and clean
up repeated breaks.

::= |
::= ( | )+
::= SPACE | TAB
::= (CR LF | LF | CR not followed by LF )
::= * [ *]
::= [] ([] )+

UTF-8 symbols are handled directly by the syntax. In fact the format is
perfectly suited for binary encodings as long as ‘{’, ‘}’ are escaped and
space sequences are contained in { }.

Something that I haven’t covered here is how you can define commands as
macros of other commands, and how you can define commands to be subordinate
to other commands for automatic tag closing. While there is a syntax for
doing so, this is something that can be defined outside of the scripting
syntax such that commands like {chapter} and {section} are predefined. The
processor will also accept undefined commands, but in that case they will be
treated as having no body - that is they stop exactly where at ‘}’.

Another issue not covered is that commands inside the header or body of
other commands may be treated specially within that context. Thus a command
can act as modifier to the active parent command: e.g. {chapter {1}
Introduction}, here {1} acts as an enumeration command. This is partly why I
haven’t settled for arguments to commands. In fact the entire header text of
a command could be viewed as arguments to certain commands. E.g.
{font courier, 10}
{font {name courier}{size 10}}

STEP is only a syntax and a processor, so a separate layer on top of STEP
would be needed for a particular purpose. One such layer could be a generic
handler for generated XSL-FO, a subset of Latex, HTML and Doc-Book.

{Early-brainstorming}
As I mentioned, I am considering moving the command name outside the of the
curly braces, but I havent investigated this further yet. I personally tend
to think “bold” and then realize I need to add some delimiters, typically
going back to add the curly brace. Compare this to LISP versus other
languages function syntax: (print “foo”) and print(“foo”)

Also, I am considering having a special short command notation for commands
covering a single word:
Only the word b,only is bolded.
Only the word {b only} is bolded.

Two commas happen infrequently in natural text but are quick to enter and
easy to read. Two commas (or more) would be escaped by {,}, analogous to
escaping spaces.

It could also be used with linebreaks when preceeded by colon:

chapter:,This is chapter 1
This is the content of chapter 1.

However, I don’t really like too many special cases just to make things
marginally easier. It’s much easier if commands are exactly { } and nothing
else. I might by intoo the , notion because it’s so much easier.

As mentioned I do have some prototype code around - mail if interested. I
also learned that Ruby really needs a lexer tool, it was not as easy to
implement in Ruby as I had expected.

Mikkel

···

On Sun, Feb 23, 2003 at 08:06:16AM +0900, MikkelFJ wrote:

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message
news:3e58e2cb$0$126$edfadb0f@dtext01.news.tele.dk…

{chapter The first chapter}
Here is text. Then next sentence is bolded. {b This is bolded text}. This
is
not bold.
{chapter The next chapter}
Here is text in chapter two.
{section a subsection} Text in section. {note needs cleanup}
{chapter Also a chapter}

I forgot one important point:
STEP is defined such that there is a strict conversion to XML from STEP. In
fact my prototype generates XML output. There are several ways to do the
translation, but one is easily defined. For example:

The first chapter Here is text. Then next sentence is bolded. This is bolded text. This is not bold. The next chapter There is text in chapter two. a subsection Text in section needs cleanup </chapter Also a chapter

Mikkel

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message
news:3e58e2cb$0$126$edfadb0f@dtext01.news.tele.dk…

{chapter The first chapter}
Here is text. Then next sentence is bolded. {b This is bolded text}. This
is
not bold.
{chapter The next chapter}
Here is text in chapter two.
{section a subsection} Text in section. {note needs cleanup}
{chapter Also a chapter}

I forgot one important point:
STEP is defined such that there is a strict conversion to XML from STEP. In
fact my prototype generates XML output. There are several ways to do the
translation, but one is easily defined. For example:

The first chapter Here is text. Then next sentence is bolded. This is bolded text. This is not bold. The next chapter There is text in chapter two. a subsection Text in section needs cleanup </chapter Also a chapter

Mikkel

If TeX and LaTeX are too complex, have you looked at Lout?
http://snark.ptc.spbu.ru/~uwe/lout/lout.html
I’ve not had time to get into it, but it seems to have relatively
few rules. People always want to do more thigs, so more cases crop
up, which I can see from the elided examples you have run into
already. One thing that irked me about troff, which I otherwise
like, is that I can never remember which commands just affect
following text, and which need arguments on the same line. Also,
having tables and equations outside of troff itself meant that
dealing with them was a pain when you tried after 6 months of not
using them. I don’t hear much about troff nowadays.

One suggestion: get rid of '' notation. Too many programs use it.
Yes, it is nice to have a consistent standard, but when you pipe a
text string though a couple of these then you end up with \\
which are really bad for the eyes! :slight_smile: {brace} {closebrace} could
allow insertion of your special characters.

    Hugh
···

On Mon, 24 Feb 2003, MikkelFJ wrote:

I have on and off for a long time been hacking on a simple xml format - like
some of the people behind yaml - then comes yaml. But meanwhile I changed
focus towards a format that is supposed to be especially suited for
documentation purposes. yaml is also text typing friendly - but still not
the best possible for text entry. I hacked something in Ruby but need to
back to it. The primary motivation is that Tex is too complex and xml-doc is
too cumbersome - and finally the need to have a text format as you can’t
trust wordprocessors to be around in the long term, and are bad for
formattting and source control.

This was orignally a follow up to my question about YAML
documentation, but grew into a separate topic.

This is not an announcement, but I think it’s about time to
get some feedback on the design. It’s hardly specific to
Ruby, but I guess it would work well in a Ruby context.

Just a quick note: STEP might not be a good name for it, as there is
already a documentation standard called STEP:
BSISG STEP Home Page. It might not matter, but I thought I’d
mention it.

Nathaniel

<:((><

···

MikkelFJ [mailto:mikkelfj-anti-spam@bigfoot.com] wrote:

RoleModel Software, Inc.
EQUIP VI

Just for comparison, Python is beginning to use a format called
reStructured Text, described here:
http://docutils.sourceforge.net/#restructuredtext

One interesting bit is that it indicates section titles like this:

This Is My Title

···

================

This Is My Subtitle

This Is My Sub-Subtitle


which a) only provides three levels (if I read it right), b) requires
changing the underline characters to move a section, and c) must be
pretty hard to parse, requiring one line of lookahead.  OTOH, it is more
human-readable.

-- 
Frank Mitchell
frankm@bayarea.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.fsf.org/philosophy/no-word-attachments.html

</details>

“Hugh Sasse Staff Elec Eng” hgs@dmu.ac.uk wrote in message
news:Pine.GSO.4.53.0302231919130.23615@neelix…

Thanks for the response.

If TeX and LaTeX are too complex,

It wasn’t entirely correclty formulated: It is not really the complexity
that bothered me, because I could avoid advanced issues. The problem was
that I quickly ran into problems with non-ASCII symbols like the danish æøå
and even with escaping ASCII symbols frequently used in LaTex syntax. LaTex
also seems to have some difficulties with whitespaces: /LaTex/ to avoid
consuming space after the command. I wanted a clear start and endpoint of
the command : ‘{’ and ‘}’. If the dd letters in middle are to be bolded, I
simply write
mi{b dd}le.
Putting the command name outside of ‘{’ would require a space before the
command name. That’s why it’s not b{something bold} but {b something bold}.
I forgot that in the original post.

have you looked at Lout?
http://snark.ptc.spbu.ru/~uwe/lout/lout.html
I’ve not had time to get into it, but it seems to have relatively
few rules.

I didn’t know Lout. Thanks for the pointer.

Lud seems to address many of the same issues that I do with STEP. However,
it mixes the text entry syntax with document layout. I do not want to handle
document formatting - I wan’t to write simple syntax the translates into
existing formatting languages like DocBook, XSL-FO or LaTex (or use it for
something non-formatting like assigning bug reports). Of course commands
will need to be defined on top of STEP - you could easily steal DocBooks
commands for example. End the end each backend processor provides a native
set of commands and additional macro definitions provides easier use of the
same commands - not unlike LaTex - except the higher level macros are
delivered to the backend so macro resolution is optional and can be replaced
by a different backend using these commands directly - e.g. you can define
{section} in terms of simple html commands, or you could programmatically
process it - this is where Ruby would be great.
You could say I’m aiming at the same thing as Yaml, except the focus is on
entering text, not generic datastructures. Thus you can plug all kinds of
back-ends to the STEP processor.

People always want to do more thigs, so more cases crop
up, which I can see from the elided examples you have run into
already.

True, but I try to address the syntax and structure part and leave the other
problems to those that have already addressed them; like XSL-FO. My problems
essentially relates to 3 issues: 1) automatically closing tags, 2)
whitespace and 3) escaping text. These are very fundamental problems that
must be solved. Anything else can be solved within the STEP syntax defining
appropriate commands. Perhaps not always entirely elegant, but often more
elegant than the alternatives.

Problems, like where paragraphs begin and end really are outside the scope
of STEP. I have made it possible to clearly identify the location of breaks,
but in the end a higher level processor can strip them and require explicit
paragraph breaks, like DocBook, or use sensible rules to interpret the
breaks produced by STEP. I also made it possible to have exact control over
whitespace. The choice to collapse spaces to a single word-break is for XML
compatibility and also for userfriendlyness - all space information could be
passed along with a break token.

It’s important to remember that STEP is a structured text entry format and
the logic for processing that information. It is not in itself a document
formatter. It’s purpose is to make it as simple as possible to enter text
information with as much metadata as possible.

For example, I could just require parapgraphs to be explicit commands, like
in Xml-Doc. It would be less convenient though. Also, the complexity of
spaces should be compared to Ruby’s syntax: Ruby’s syntax is pretty
involved - but in return makes it intuitive to the user. It’s about
balancing day to day usability against the problem of explaining the syntax
and it’s special cases.

One thing that irked me about troff, which I otherwise
like, is that I can never remember which commands just affect
following text, and which need arguments on the same line.

Yes, this is really my main concern as well. This is why I consider using
two escape syntaxes: “[section header] body” and “{b a bolded text}”, but
then it is often implicit by the command and it may be more difficult to
remember and understand why there are two syntaxes. I would appreciate some
more feedback on how to deal with this. I have considered several options,
but the current solutions seems to be the most practical. You don’t want to
track end tags 200 pages down a document to close the {part I} tag. You’d
just write a new {part II} tag or a {postscript} tag.

One ting I didn’t menation was that you can explicity close the body of a
command:

{chapter Ch. 1} text in ch.1 {section My section} body text {/section}. More
text in ch.1 but outside the section. {chapter Ch. 2} …

having tables and equations outside of troff itself meant that
dealing with them was a pain when you tried after 6 months of not
using them. I don’t hear much about troff nowadays.

I have never worked with troff.
I expect equations in STEP would initially require LaTex target and use
something like {LaTex … a raw latex equation syntax here}. Then you could
use tools like Tex4ht to get MathML. This is really outside the scope of
STEP itself, but an issue for a higher level processor (such as XSL-FO
output or a Wiki Engine). Of course you could define an equation syntax in
STEP which would be converted to LaTex or MathML.

One suggestion: get rid of '' notation. Too many programs use it.
Yes, it is nice to have a consistent standard, but when you pipe a
text string though a couple of these then you end up with \\
which are really bad for the eyes! :slight_smile: {brace} {closebrace} could
allow insertion of your special characters.

I totally agree - but this is also exactly what I did. The only escape
character is ‘{’ ‘}’. However, I have two choices to escape the escape.
Either doubling them like {{ and }} which would give me the same problem as
with \\ and even worse problems related to nesting, or use a different
notation.
I chose ‘{’ for that purpose. '' has no special meaning otherwise: If you
write “\\” you mean to write four backslashes and if you write “\{\” you
mean to write “{\”.
If I want to escape source source code with balanced braces the syntax is
{|if(x < 2) { printf(“hello”) }; |} (in this particular case space is
significant after the bar). If the curly braces are not balanced, you write
an endmarker: {eot| if(x < 2) printf(“left curly brace: {”); |eot}, where
eot is arbitrarily chosen.

The important thing is that there are not traps. Unless you use ‘{’ you can
type anything. You won’t accidentally type a command you are not aware of.

Mikkel

“Hugh Sasse Staff Elec Eng” hgs@dmu.ac.uk wrote in message
news:Pine.GSO.4.53.0302231919130.23615@neelix…

One suggestion: get rid of '' notation. Too many programs use it.
Yes, it is nice to have a consistent standard, but when you pipe a
text string though a couple of these then you end up with \\
which are really bad for the eyes! :slight_smile: {brace} {closebrace} could
allow insertion of your special characters.

OK, I think I got this wrong in my previous answer - I thought you meant
STEP would end up with \\, but you mean any external processor, like we
have seen with regular expressions in strings. I still don’t think it is as
much of an issue because STEP doesn escape '' itself, but you do have a
valid point. Personally I would grow tired of typing it, but as an
alternative it would be good. It could also be {(} {)}. Completely avoiding
any other escape than {} is appealing because there is less to explain. It’s
easy to say “{” => “{()” and “}” => “{)}”. And theres no - but what about .

In fact, in my current prototype, parsing { keeps creeping in as a bug. So
implementation wise it’s also easier and cleaner. I think I’ll buy it -
preferring {(} and {)} for typability and readability. But suggestions are
welcome.

Files.new(“foo”).each {openbrace} |line| {br}
p line{br}
{closebrace}

Files.new(“foo”).each {(} |line|{br}
p line{br}
{)}

But cleaner would be

{yburuby|
Files.new(“foo”).each { |line|
p line
}

yburuby}

… if you ever wondered what palindromes are good for :wink:

Mikkel

Sorry for following up my own post – and mentioning one of the dreaded
‘P’ words – but …

Frank Mitchell wrote:

One interesting bit is that [reStructured Text] indicates section titles
[using an underscore line]
which a) only provides three levels (if I read it right),

… which I didn’t:

There may be any number of levels of section titles, 

and

Rather than imposing a fixed number and order of section title 
adornment styles, the order enforced will be the order as 
encountered. 

from the full spec at
http://docutils.sourceforge.net/spec/rst/reStructuredText.html

···


Frank Mitchell
frankm@bayarea.net

Please avoid sending me Word or PowerPoint attachments.
See We Can Put an End to Word Attachments - GNU Project - Free Software Foundation