Ann: rexml 2.3.5 && 2.2.3

Tobias Reif wrote:

It must be escaped everywhere if it’s part of that sequence but not part
of the ending delimiter of a CData section.

And where does the spec say

not if it appears elsewhere in the document.

The structure of the ‘rule’ sentance dictates that they are speaking
strictly of CData sections.

This is an important point, because the even non-validating parsers must
raise parsing errors – if you were right, then REXML should raise an
exception if it encounters an unquoted ‘>’ in a text node. However, as an
illustration of why unquoted ‘>’ may validly exist in XML documents,
consider the Oasis test ‘xmlconf/xmltest/valid/sa/103.xml’:

    <!DOCTYPE doc [
    <!ELEMENT doc (#PCDATA)>
    ]>
    <doc>&#60;doc></doc>
···

… “The greatest security hazard is a false sense of security.”
<|> – anon
/|\
/|

This /is/ very useful. It doesn’t solve the basic problem where the schema
choice used in the software is beyond the control of the developer, but the
software itself is very useful.

···

bbense+comp.lang.ruby.Jun.10.02@telemark.stanford.edu wrote:

    • I don’t know if this is much help or not, but there is a tool
      that will convert most of the other major schema schemes[1] into
      RELAX NG. This would seem to solve some of the compatiblity
      problems. The tool is available at

http://wwws.sun.com/software/xml/developers/relaxngconverter/

… Our users will know fear and cower before our software! Ship it!
<|> Ship it and let them flee like the dogs they are!
/|\
/|

Hmm. Ok. What I meant was that I sort of feel obligated to provide a
mechanism by which you can validate XML documents with REXML. The
/efficient/ way to do this is report validation errors while
parsing; the
extensible way to do this is to parse the entire document and
then validate
the document. Of course, this means that hooking validation into the
streaming parsers will be more difficult.

Are there validation schemes that require the entire document be in memory
before validation can occur? How many schemes allow for validation while
token parsing? I’m just wondering how much more efficient it is to
validate while parsing, if the parser is otherwise only concerned with
balanced angle braces and nested tags.

Validation issues consume a large part of the XML spec.
Personally, I’d be
happy to ignore validation altogether – for REXML to solve most
people’s
needs, it’ll need some sort of validation mechanism, though.

Why do you say that? My admittedly distorted view of things suggests that
validation is desirable in only a small number of cases.

Maybe it should pass back the results of validation, and perhaps have a
means for updating the REXML document based on post-validation
processing
(e.g., default attributes & values provided by the PSVI ).

Ya lost me.

I mean that something like a DTD can define implied attributes and values,
so a document instance can omit them; the validation process adds them. My
DTD can specify that element ‘foo’ always has to have a ‘bar’ attribute,
and if foo shows up in the instance document without bar, then the parser
must behave as if it were there, with some default value. Not sure how
REXML would do this, though simply adding it to the instance doc would
perhaps be easiest. Then there’s the Post Schema Validation Infoset, which
takes this idea further.

For me, this boils down to: “Of the
users of REXML, what is the schema language that they’re most
likely going to /need/?”

Well, for me, it’s none.

I’m just here to provide solutions, not dictate how
you work.

Still, given a hammer, everything starts to look like a nail.

W3C Schemas look a bit like a solution in search of a problem. Perhaps too
many database people thinking, “well, if we ever get around to using XML,
we’re going to need certain things …” Or vendors realizing that they’ll
sell more tools if it gets to hard to roll your own.

James

Sean Russell wrote:

    <doc>&#60;doc></doc>

I know that that’s well-formed.

The compatibility thing confused me, plus there were misunderstandings
in the thread, so anyways; the spec says it all.

Tobi

···


http://www.pinkjuice.com/

james@rubyxml.com wrote:

Are there validation schemes that require the entire document be in memory
before validation can occur? How many schemes allow for validation while

I don’t think so, but consider how you’re going to do validation outside of
the parser. You’re either going to: (a) parse the document in tree mode
and then validate the tree (requiring the entire document to be in memory),
or (b) provide some sort of stream validator that validates and then
propagates the events to the end listeners. (b) sounds like more work to
me.

token parsing? I’m just wondering how much more efficient it is to
validate while parsing, if the parser is otherwise only concerned with
balanced angle braces and nested tags.

Well, you’re doing the same amount of work, only if you do the validation
while you’re parsing, you only have to walk the tree once. O(n) vs O(2n).

Why do you say that? My admittedly distorted view of things suggests that
validation is desirable in only a small number of cases.

Maybe because you tend to look at XML from a document point of view, rather
than a data point of view (I’m making some bold assumptions with that
statement :slight_smile: When XML is being used as a process communication mechanism
– say, via RPC or some other logic-to-logic communication – validation is
/extremely/ important. Validation can either be done in the logic, or in
the processing, or both. When I’m concerned about accuracy of the data
(say, in a balance transfer between bank accounts), I tend to believe that
you can’t have enough validation.

I mean that something like a DTD can define implied attributes and values,
so a document instance can omit them; the validation process adds them.

For DTDs. RelaxNG, as I understand it, specifically omits implied
attributes. This was done because /with/ implied attributes, the behavior
of an application differs depending on whether the document was validated
or not – if it wasn’t validated, it doesn’t have the implied attributes.
I agree with RelaxNG WRT this point.

Still, given a hammer, everything starts to look like a nail.

Given Ruby, every problem looks like a game. :wink:

W3C Schemas look a bit like a solution in search of a problem. Perhaps
too many database people thinking, “well, if we ever get around to using
XML, we’re going to need certain things …” Or vendors realizing that
they’ll sell more tools if it gets to hard to roll your own.

Yes, definitely. Actually, W3C XML Schema looks to me like a bad case of
ivory-tower-itis, designed by people who never have, and probably never
will, use the darned thing.

···

… “They that can give up essential liberty to obtain a little
<|> temporary safety deserve neither liberty nor safety.”
/|\ – Benjamin Franklin
/|

Sean Russell wrote:

Maybe because you tend to look at XML from a document point of view, rather
than a data point of view (I’m making some bold assumptions with that
statement :slight_smile: When XML is being used as a process communication mechanism
– say, via RPC or some other logic-to-logic communication – validation is
/extremely/ important.

When working with XML documents, this is often true as well. Especially
when stuff gets processed, for example DocBook to FO etc.

I tend to believe that
you can’t have enough validation.

I agree.

Tobi

···


http://www.pinkjuice.com/

Hmm, this is an interesting point of view expressed here by both you and
Sean. I’m not sure I understand what is meant by “extremely important” and
“can’t have enough” – especially what this means in the context of RPC/SOAP
use.

···

On 6/10/02 1:22 PM, “Tobias Reif” tobiasreif@pinkjuice.com wrote:

Sean Russell wrote:

Maybe because you tend to look at XML from a document point of view, rather
than a data point of view (I’m making some bold assumptions with that
statement :slight_smile: When XML is being used as a process communication mechanism
– say, via RPC or some other logic-to-logic communication – validation is
/extremely/ important.

When working with XML documents, this is often true as well. Especially
when stuff gets processed, for example DocBook to FO etc.

I tend to believe that
you can’t have enough validation.

I agree.

Tobi

I tend to believe that
you can’t have enough validation.

I agree.

What I’ve found, though, is that that cost of incessant validation is often
much greater than the cost of recovering from an exception when bad data or
markup is allowed too far along a processing path. It is, in some ways,
similar to static types in a programming languages.

If you’re passing along financial information then perhaps obsessive
validation is a good thing (though even then the validation should focus on
amounts and account information, i.e. the data, not the markup). If you’re
processing less-critical XML-RPC requests, then its arguably simpler to let
bad requests simply fail somewhere along the processing path. For most of
what I’ve worked on, so long as the XML was well-formed, and had some
minimum amount of data, it was worth trying to do something with it.
Regular expressions served as the most practical form of validation;
anything remotely like a schema was overkill.

Larry Wall once said that systems should be generous in what they accept,
and strict in what they emit.
http://www.oreilly.com/catalog/opensources/book/larry.html

It’s a rather broad statement, but I’ve taken to mean that the burden of
correctness is on the sender. The receiver needs only do enough to balance
the cost of a possible problem with the cost of problem prevention. One
system I worked on received XML requests for hotel reservations. Now, the
code could have a) checked the request against a schema, and then b) move
on to extracting the data and processing. Or just do (b). If the data or
markup was bad, then just doing (b) meant an exception was raised after
some wasted processing (i.e., the work done until the bad data or markup
was encountered was fruitless)

However, 99.999% of the time the request was fine. So, had we done
“proper” validation, far more wasted processing would have occurred.

Preventing a problem may be more of a problem then the problem you’re
trying to prevent.

James

james@rubyxml.com wrote:

What I’ve found, though, is that that cost of incessant validation is
often much greater than the cost of recovering from an exception when bad
data or markup is allowed too far along a processing path. It is, in some
ways, similar to static types in a programming languages.

This may be true in your domain, but there are often circumstances where you
must attempt to prove validity, and the requirement for that proof is
beyond your control.

The restrictions tend to be greater when you’re working in a heterogeneous,
large organization. It is easier to prove validation via a static, well
defined validation mechanism than programmatically.

Larry Wall once said that systems should be generous in what they accept,
and strict in what they emit.
Open Sources: Voices from the Open Source Revolution

With all due respect to Larry, this is simply not a responsible philosophy
in many problem domains. Financial transactions aren’t the only place
where you want to be able to wholly reject badly formatted data, rather
than accept and make assumptions. Medical data, financial transactions,
military targetting systems (well, that last one is arguable :wink: – any
system where the precision of the information is critical enough that
processing incorrect information is worse than not processing any data at
all are all candidates for heavy validation.

“Oops. Sorry about that. There was a missing tag in the order,
so when the wrecking crew saw ‘Hudson St.’, they assumed it was NORTH
Hudson St., and your house looked kinda shoddy to them anyway…”

It’s a rather broad statement, but I’ve taken to mean that the burden of
correctness is on the sender. The receiver needs only do enough to

Again, in many systems, /everybody/ better be on the same sheet of music.
If you’re going to remove a kidney, you’d better be sure you got the right
person.

However, 99.999% of the time the request was fine. So, had we done
“proper” validation, far more wasted processing would have occurred.

I’m of the opinion that there is a lot of wasted processing power sitting
out there. If it gets used doing useless validation but catches even a
rare error, than fine by me. Remember, .001% of a year is 8 hours… how
much would you mind having a day’s pay taken out of your paycheck?

I understand what you’re saying, and I agree with you. In most cases, I
don’t do validation myself. In most cases, the consequences of a mistake
are not significant. However, there are many, many applications where you
not only /should/ be extra careful, but that you /must/.

In response to Bob’s question…

Sean. I’m not sure I understand what is meant by “extremely important” and
“can’t have enough” – especially what this means in the context of

… that is what I mean by “extremely important”. WRT XML-RPC, I don’t
know. XML-RPC is just a specification for how processes communicate. If
the information they’re communicating is part of a critical system, I’d
guess that the system might be a candidate for validation. If it just
means that your spiffy, animated instant-messaging love letter doesn’t get
to Lucy down the hall, it probably isn’t that important.

···

… “It’s not that I’m afraid to die. I just don’t want to be there when
<|> it happens.”
/|\ – Woody Allen
/|

Larry Wall once said that systems should be generous in what they accept,
and strict in what they emit.
Open Sources: Voices from the Open Source Revolution

With all due respect to Larry, this is simply not a responsible philosophy
in many problem domains. […]

“Oops. Sorry about that. There was a missing tag in the order,
so when the wrecking crew saw ‘Hudson St.’, they assumed it was NORTH
Hudson St., and your house looked kinda shoddy to them anyway…”

While I totally agree with Sean, I think Sean missed James’s point.

I would enjoy an XML parser and XML validator that allowed me to parse
an XML document that failed to validate so that I could try and figure
out (programmatically) what caused it to fail so that I could present
the human-user an intelligible error message, rather than a simple
“Sorry, your XML document failed to validate.” but rather show things
like “Node in the node for was missing.”

I don’t know if this was really James’s point after all, but I guess
this is what I’d want to say …

Remember, .001% of a year is 8 hours… how much would you mind having
a day’s pay taken out of your paycheck?

Ouch! Stop that! Who are you, the IRS?! :slight_smile:

– Dossy

···

On 2002.06.12, Sean Russell ser@germane-software.com wrote:

james@rubyxml.com wrote:


Dossy Shiobara mail: dossy@panoptic.com
Panoptic Computer Network web: http://www.panoptic.com/
“He realized the fastest way to change is to laugh at your own
folly – then you can let go and quickly move on.” (p. 70)

Dossy wrote:

While I totally agree with Sean, I think Sean missed James’s point.

I would enjoy an XML parser and XML validator that allowed me to parse
an XML document that failed to validate so that I could try and figure

Hm. Yes, there’s definately been some miscommunication. I never implied
that REXML would force users to validate XML. In fact, I know of no XML
parser that has this requirement. If you don’t supply a schema, you don’t
get your document validated.

What I was saying was that I believe that REXML should have some inherant
validation mechanism for people who need validation.

“Sorry, your XML document failed to validate.” but rather show things
like “Node in the node for was missing.”

Well, that’s validation, right? How the errors are handled is a question of
design philosophy. If I can make the validator smart enough to continue in
the case of an error, I certainly will.

Remember, .001% of a year is 8 hours… how much would you mind having
a day’s pay taken out of your paycheck?

Ouch! Stop that! Who are you, the IRS?! :slight_smile:

I wish. Well, no – I don’t wish that I was the IRS – they don’t get to
keep the money.

···

… The Tick: “Spoon!”
<|> Neo: “There is no spoon.”
/|\ – anonymous
/|