Victor "Zverok" Shepelev wrote:
From: Dmitry Borodaenko [mailto:angdraug@gmail.com]
Sent: Thursday, November 30, 2006 4:21 PM
My task is: I have some HTML fragment; no limitations on it
correctness,
except of there can't be tag cutted:
(...)
Can it be done with Hpricot? Or any other options?
Tried HTMLTidy[0]?
Not really tried, but had thought about.
The problem is I need something really "small, smart and simple" not
"huge
and almighty" (as Tidy seems).
Not "huge and almighty" but "small, smart and simple" ... I believe that's
my cue.
Have you considered writing your own miniature library? Maybe, a library
consisting of 20 lines of Ruby instructions (regulars: note the absence of
a certain trigger word)?
Why not express the problem to be solved more explicitly and clearly?
And ... were the HTML pages written by humans or a machine? I ask because
machine-generated HTML tends to be more syntactically reliable.
If I can have a sufficiently clear statement of the problem to be solved,
I
can suggest a solution -- or post one.
On re-reading your first post in this thread, I venture to say that the
pages are sufficiently disorganized that an ad hoc solution is the best
approach overall, one in which various regular expression filters are used
to extract essential page data, and the pages can then be reconstructed
using stricter HTML or XHTML syntax.
So, let's write some cod ... oops, I mean let's write a small library.
OK, here's the model of what I'm doing: small app, which interacts with
dictionaries like Wikipedia:
* user inputs something like "w matz"
* the software download first lines of Matz - Wikipedia
(first one or two meaningful paragraphs) and displays them.
What to download and to show is setted by simple templates (regexpes for
now, but may be something Xpath-like).
Now we have some part of page, need to delete all tables, images, and so on,
and strip all "non-content" tags (everything but p, ul, ol, li, b, i...),
and I need to have "consistent" HTML to show.
It is a task definition.
The task may vary for different dictionaries. For ex., with some
dictionaries tables must not be deleted, but "normalized":
"<td>text1<td>text2" => "<table><tr><td>text1<td>text2</table>"
Or even XHTMLish "<table><tr><td>text1</td><td>text2</td></tr></table>"
--
Paul Lutus
http://www.arachnoid.com
V.
···
From: Paul Lutus [mailto:nospam@nosite.zzz]
Sent: Thursday, November 30, 2006 8:20 PM
On 11/30/06, Victor Zverok Shepelev <vshepelev@imho.com.ua> wrote: