REXML error reporting (XHTML validation)

I've implemented a simple XHTML validation class based on REXML and
YAML, and it works like a charm except for invalid XML: when there is
something like a loose unescaped '<' character, it just raises
ParseException with no obvious reference to the guilty character. Is
it possible to get more useful info out of REXML, or should I use some
other XML validator?

Sanitize class (54 lines total):

http://savannah.nongnu.org/cgi-bin/viewcvs/samizdat/samizdat/samizdat/sanitize.rb?rev=1.99

YAML file with allowed XHTML tags and attributes:

http://savannah.nongnu.org/cgi-bin/viewcvs/samizdat/samizdat/xhtml.yaml?rev=1.99

Dmitri Borodaenko wrote:

I've implemented a simple XHTML validation class based on REXML and
YAML, and it works like a charm except for invalid XML: when there is
something like a loose unescaped '<' character, it just raises
ParseException with no obvious reference to the guilty character. Is
it possible to get more useful info out of REXML, or should I use some
other XML validator?

This is quite nice. I'm poking around, looking to see how best to recover validation error info.

Two comments:

I added logging to my copy so that I could see what was being clobbered during sanitization. Might be worth including this by default.

I see that 'script' elements are deleted, as the yaml file makes no mention of that element.

Thanks for the nice work,

James

REXML saves information about the state of the parse stream, including line
numbers and the reason for the exception. However, this is admittedly pretty
weak; the problem being, of course, what constitutes a "line". However,
REXML tries to be good about saving parse state, and some of this is captured
in the Source class and repeated in the ParseException. I'd recommend
looking at the Source class to see if any of the methods help you.

The main problem right now is that REXML uses '<' as the line separator, as it
is the only reasonable way to parse open-ended streams.

The short version is that all I can say is that I'm struggling with how to
improve the error reporting of REXML while maintaining reasonable efficiency,
and I haven't come up with a good solution yet.

···

On Monday 08 November 2004 10:40, Dmitri Borodaenko wrote:

YAML, and it works like a charm except for invalid XML: when there is
something like a loose unescaped '<' character, it just raises
ParseException with no obvious reference to the guilty character. Is
it possible to get more useful info out of REXML, or should I use some
other XML validator?

--
### SER
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido
### http://www.germane-software.com/~ser jabber.com:ser ICQ:83578737
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg

I added logging to my copy so that I could see what was being clobbered
during sanitization. Might be worth including this by default.

Err, I can't throw Ruby dumps on unsuspecting Wiki users: my problem
is not just to find the cause, but also to report it nicely.

I see that 'script' elements are deleted, as the yaml file makes no
mention of that element.

Right, that was on purpose.

Btw, I've noticed that this script doesn't completely filter out things like:

<IMG width="0" height="0" style="bac\kground:
ur\l(javascript:alert('boop'));" />

...although it cripples it a bit by escaping quotes. I don't want to
remove "style" attributes, is there any easy way around parsing CSS?

···

On Tue, 9 Nov 2004 02:47:11 +0900, James Britt <jamesunderbarb@neurogami.com> wrote:

--
Dmitry Borodaenko

Dmitri Borodaenko wrote:

I added logging to my copy so that I could see what was being clobbered
during sanitization. Might be worth including this by default.

Err, I can't throw Ruby dumps on unsuspecting Wiki users: my problem
is not just to find the cause, but also to report it nicely.

I see that 'script' elements are deleted, as the yaml file makes no
mention of that element.

Right, that was on purpose.

Ah, I see. I thought of this as the start of a general-purpose lib that might then be used by some more specific application.

A suggestion (motivated by self-interest): arrange for the code to allow all proper XHTML by default, with the option of passing in a set of elements and/or attributes that are disallowed at validation time.

For example, if you decide to disallow style or class attributes, you could pass this information in when calling sanitize

Perhaps sanitize could take an optional hash parameter
   sanitize(html, filter = {} )

and disallowed elements/attribute could be specified in perhaps as

  'script' => '', # no script element at all
  'img' => 'usemap, height' # allow images, but
                               # no usemap or height attributes
  '*' => 'style, class' # no class or style on any element

Just a thought; it's easy to make suggestions when you're not writing the code :wink:

This way, you need not keep editing the base yaml file when adjusting what to sanitize.

James

···

On Tue, 9 Nov 2004 02:47:11 +0900, James Britt > <jamesunderbarb@neurogami.com> wrote:

>>I see that 'script' elements are deleted, as the yaml file makes no
>>mention of that element.
> Right, that was on purpose.
Ah, I see. I thought of this as the start of a general-purpose lib that
might then be used by some more specific application.

Well, it is going to be general-purpose -- if someone is going to use
it. As with Samizdat's RDF layer, I don't want to implement features
that no-one needs, and since I'm the only one currently using it, I do
only the stuff I need.

A suggestion (motivated by self-interest): arrange for the code to allow
all proper XHTML by default, with the option of passing in a set of
elements and/or attributes that are disallowed at validation time.

(...)

This way, you need not keep editing the base yaml file when adjusting
what to sanitize.

I don't see the point: the reason I've put it all into a YAML file is
that it's easier to edit it there, rather than in your source code.
Or, if you want to do it programmatically, all you have to do is:

class Sanitize
    attr_reader :xhtml
end
...
Sanitize.instance.xhtml['_common'].delete('style')

···

On Thu, 11 Nov 2004 06:22:42 +0900, James Britt <jamesunderbarb@neurogami.com> wrote:

--
Dmitry Borodaenko

Dmitri Borodaenko wrote:

...

I don't see the point: the reason I've put it all into a YAML file is
that it's easier to edit it there, rather than in your source code.

I'm thinking that editing that YAML file directly means that as you add or remove things you have to check that you're dealing with all the valid XHTML items you might want. Once you delete something, how do you know, sometime later, that it is an option that could be restored, other than perhaps looking through a DTD. Having a base that is always complete, with edits overlaid, makes it easier to rollback to the most tolerant sanitization.

How about keeping different versions of the yaml file? And once again,
you don't have to overload API for something you can do directly, the
way I've shown.

···

On Thu, 11 Nov 2004 23:40:09 +0900, James Britt <jamesunderbarb@neurogami.com> wrote:

> I don't see the point: the reason I've put it all into a YAML file is
> that it's easier to edit it there, rather than in your source code.
I'm thinking that editing that YAML file directly means that as you add
or remove things you have to check that you're dealing with all the
valid XHTML items you might want. Once you delete something, how do you
know, sometime later, that it is an option that could be restored, other
than perhaps looking through a DTD. Having a base that is always
complete, with edits overlaid, makes it easier to rollback to the most
tolerant sanitization.

--
Dmitry Borodaenko