Phrogz wrote:
For example, the
following snippet is a perfectly well-formed and valid HTML document,
but none of the regexps posted in this thread so far are able to
correctly parse it:
<HTML/
<HEAD/
<TITLE/>/
<P/>
Wow. I was all fired up to call you out on this, and ask you what
insane cocaine you were smoking when you main this claim.
Well, keep in mind that this is a very contrived, extreme, exaggerated
example that you will never find in the wild, simply because not only
the regexps in this thread but also the browsers cannot parse it --
although I heard rumors that Emacs/w3 actually supports some of the
features used by that snippet. I just wanted to demonstrate that
there are a lot of weird things in HTML that are much better left to
the people that write HTML parsers rather than writing the same
incomplete HTML regexps over and over and over and over again.
I was a web developer for many many years and standards were very,
very important to me. I thought I knew the specs.
The above example mainly draws upon one simple fact: the HTML
designers decided to make HTML an application of SGML without actually
having a *beep*ing clue about SGML, thus creating some "interesting"
interactions with SGML's parsing rules. And who can blame them? The
reason they created HTML in the first place, was that SGML is so
mind-bogglingly complex that *nobody* has a *beep*ing clue!
So, you can read all the W3C specs you want, but what makes HTML so
weird isn't actually in there; it's buried somewhere in the thousands
of pages of ISO SGML specs.
And then I ran that by validator.w3.org along with an HTML 4.01 strict
DTD, and - to my utter shock and surprise and horror - it turns out
you were correct.
Well, let's see what actually happens. We start out with this:
<html>
<head>
<title>></title>
</head>
<body>
<p>></p>
</body>
</html>
First, SGML is case-insensitive and HTML inherits that property. This
already fools about 99% of all HTML regexps that you can find on the
web:
<HTML>
<HEAD>
<TITLE>></TITLE>
</HEAD>
<BODY>
<P>></P>
</BODY>
</HTML>
We don't need to escape closing/right angle brackets (>), only
opening/left ones (<):
<HTML>
<HEAD>
<TITLE>></TITLE>
</HEAD>
<BODY>
<P>></P>
</BODY>
</HTML>
Next, we use a feature that HTML inherited from SGML (without anybody
noticing), called Null End Tags (NET), which allows you, basically, to
DRY out (in Rails speak) the end tags. If you close the start tag
with a slash instead of an angle bracket, you can replace the end tag
with another slash, so
<tag>some content</tag>
becomes
<tag/some content/
That looks like this:
<HTML/
<HEAD/
<TITLE/>/
/
<BODY/
<P/>/
/
/
Quite weird, huh? But we are not done yet! End tags are optional if
they can be inferred from the context (and if the DTD specifically
allows this). So, for example, since BODY cannot occur inside of
HEAD, the opening BODY tag implies a closing HEAD tag:
<HTML/
<HEAD/
<TITLE/>/
<BODY/
<P/>
And one last step: actually, not only are end tags optional, you can
even lose the tags entirely if they can be inferred. P can only occur
inside a BODY, so the BODY can be inferred from P and we can get rid
of it:
<HTML/
<HEAD/
<TITLE/>/
<P/>
Thanks for sharing.
My pleasure. BTW: this is not so useless as it might first seem.
It's actually quite important to know that the W3C Validator uses an
SGML parser to validate your documents, because that means it's
worthless for
a) XHTML, because XHTML is an application of XML, not SGML and
b) HTML, too, because browsers don't parse HTML as SGML, they parse
it as Tag Soup. (To be more precise: if the validator tells you
your HTML is invalid, then you know it's broken; however, if it
tells you it's valid, that doesn't necessarily mean it'll
actually work in a browser.)
XHTML is much better validated with an XML Schema Validator such as
Christoph Schneegans' Schema Validator at <http://Schneegans.de/sv/>
or the Validome validator at <Validome - A Free Validation Service for HTML and Accessibility - investing.io.
It's crucial to remember that the W3C Validator and the browser parse
HTML quite differently and that neither of those has necessarily
anything to do with how *you* might actually parse it (-; I once
found a cute little snippet on a website that I unfortunately can no
longer locate, that demonstrated this quite nicely. That snippet had
a little typo in it that fooled the human reader, the W3C Validator
and the browser into reading that exact same snippet in three
radically different ways, although what was *really* meant was
actually a *fourth* thing.
Just one quick example: HTML allows you to leave out the quotation
marks around attribute contents. So,
<A HREF=search.html>Search</A>
is perfectly fine, however
<A HREF=http://google.com/>Search</A>
isn't, because as we now know, the double slash actually gets
interpreted as a Null End Tag, so the above snippet would actually be
parsed as something like the following:
<A HREF="http:"></A>google.com/>Search</A>
And the validator will complain about an extra closing </A> tag, while
the browser will quietly fix that up to mean
<A HREF="http://google.com/">Search</A>
which is obviously what was intended. However, if you don't know
about Null End Tags you can stare at the Validator's Error Message:
Line X, Column Y: end tag for element "A" which is not open
for hours and still not realize that your problem has nothing to do
with an extra end tag, Line X or Column Y but that you are actually
missing some quotation marks somewhere else in your document.
BTW: the W3C gave up on SGML long ago and developed XML as a much
simpler subset of SGML and XHTML as an application of XML. Now, the
WHAT-WG followed by basically giving up any pretenses that HTML5 was
actually an application of SGML; rather it is a language in its own
right, totally seperate from both XML and SGML. And now we know why!
One last goodie: you can actually specify an alternate root element in
the DOCTYPE declaration:
<!DOCTYPE p PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<P/
Although I have no friggin' clue how a browser were actually supposed
to display this.
Anyway, that concludes today's off-topic SGML rant, let's now get back
to our regularly scheduled Smalltalk and Lisp threads, please (-;
jwm
···
On May 15, 10:18 pm, Jörg W Mittag <Joerg.Mit...@Web.De> wrote: