Bill Kelly wrote:
From: "Paul Lutus" <nospam@nosite.zzz>
def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end
Are `<' and`>' characters legal inside quoted attribute values?
I don't think so. I think they have to be escaped, like most things in HTML
syntax.
E.g. <img alt="a>b" src="inequality.gif">
Also, is the closing tag allowed to have whitespace between the
tag name and the ending bracket?
E.g. </body >
Not syntactically correct, but the question might be "will it happen?" In
which case the answer is "probably".
The latter would be trivial to accomodate with a \s* obviously;
Yep.
but the former would be a shade trickier (though certainly still
possible with a regexp.)
I don't think that one needs to be addressed. It isn't syntactically correct
as well as being strange. I know when I create relatively free-form
attributes like the content of "title," I always escape the HTML tag
delimiters. I am reasonably sure it is a requirement.
If we allowed bare "<" and ">" between quotes in attributes, we would have
to scan the tags character by character to be sure to have a valid parse.
In nearly all cases involving delimiters like quotes and any relaxed,
permissive syntax, you end up scanning with a state machine.
There's a lot of foul, cruel, and bad-tempered HTML out there
in the wild.
Yeah, and I wrote some of it personally, or it was written with my editor
Arachnophilia.
Depending on the needs of the Original Poster,
death could await a simplistic HTML lexer with nasty big pointy
teeth.
Yes, as I have said. 
TIM: I warned you! But did you listen to me? Oh, no, you knew it
all, didn't you? Oh, it's just a harmless little markup language,
isn't it? Well, it's always the same, I always--
ARTHUR: Oh, shut up!
TIM: --But do they listen to me?--
ARTHUR: Right!
TIM: -Oh, no--
KNIGHTS: Charge!
Not at all fair to a helpless attack-rabbit. 
···
--
Paul Lutus
http://www.arachnoid.com