HTML -> list of sentences? (semi-impossible task)

Hello, all.

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

There are, of course, several things that make
this difficult:

  • need to distinguish between end-of-sentence
    and embedded punctuation, including both
    abbreviations and textual references to
    Ruby methods such as eof? and split!
  • need to treat sentence fragments as sentences
  • need to ignore blocks of code
  • etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

  1. Convert to text
  2. Process

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal

···


Hal Fulton
hal9000@hypermetrics.com

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

Attached is a 30 mins. hack of mine that does something like that.
The scanning part is really a kludge but I’ve been using it w/
acceptable results in a proxy that add hints to webpages on the fly.

There are, of course, several things that make
this difficult:

  • need to distinguish between end-of-sentence
    and embedded punctuation, including both
    abbreviations and textual references to
    Ruby methods such as eof? and split!

They only way I can think of to do that is having a list of methods and
abbreviations to ignore.

  • need to treat sentence fragments as sentences
  • need to ignore blocks of code

Kind of doable if you have a dictionary (/usr/share/dict/words should
be enough). For each candidate sentence, you see how many words are
there and take it if the percentage is above some threshold.

  • etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

  1. Convert to text
  2. Process

Might be just as easy to do it in one step if
I knew what I was doing.

IMHO it can be done in one pass.

bloom.c (3.83 KB)

extconf.rb (40 Bytes)

scanhtml.rb (2.67 KB)

···

On Thu, Jun 12, 2003 at 11:38:13AM +0900, Hal E. Fulton wrote:


Given http://www.rubygarden.org/ruby?ClassMethodsTutorial, my hack
returns

[“This”, “is”, “simply”, “an”, “extract”, “from”, “a”, “post”, “to”, “ruby-talk”, “by”, “DavidBlack”, “on”, “the”, “topic”, “of”, “class”, “methods.”]
[“It”, “is”, “stored”, “here”, “in”, “the”, “hope”, “that”, “it”, “will”, “be”, “useful!”]
[“It”, “actually”, “goes”, “beyond”, “the”, “surface”, “of”, “class”, “methods”, “to”, “describe”, “the”, “nature”, “of”, “classes”, “as”, “objects”, “so”, “is”, “interesting”, “reading”, “for”, “anyone”, “progressing”, “to”, “intermediate-level”, “Ruby.”]
[“See”, “also”, “ClassMethods”, “for”, “an”, “overview”, “of”, “the”, “options”, “available”, “in”, “Ruby”, “for”, “creating”, “class”, “methods”, “and”, “SingletonTutorial”, “for”, “a”, “detailed”, “explanation”, “of”, “singleton”, “methods.”]
[“Every”, “object”, “responds”, “to”, “certain”, “messages”, “i.e.”, “can”, “call”, “methods”, “with”, “certain”, “names”, “.”]
[“Usually”, “those”, “methods”, “are”, “the”, “instance”, “methods”, “defined”, “by”, “the”, “object’s”, “class”, “However”, “it’s”, “also”, “possible”, “to”, “add”, “methods”, “to”, “individual”, “objects”, “Now”, “c”, “will”, “respond”, “to”, “speak”, “–”, “but”, “other”, “instances”, “of”, “class”, “C”, “will”, “not”, “This”, “means”, “that”, “speak”, “is”, “a”, “singleton”, “method”, “of”, “c.”]
[“now”, “look”, “at”, “this”, “Notice”, “the”, “similarity”, “between”, “the”, “syntax”, “involved”, “in”, “creating”, “a”, “new”, “singleton”, “method”, “for”, “c”, “and”, “creating”, “a”, “class”, “method”, “of”, “class”, “D”, “In”, “fact”, “these”, “are”, “essentially”, “the”, “same”, “thing.”]
[“In”, “both”, “cases”, “what’s”, “happening”, “is”, “that”, “a”, “singleton”, “method”, “is”, “being”, “added”, “to”, “a”, “particular”, “object.”]
[“It”, “just”, “happens”, “to”, “be”, “that”, “in”, “the”, “second”, “case”, “the”, “object”, “getting”, “the”, “new”, “method”, “is”, “a”, “Class”, “object”, “as”, “opposed”, “to”, “a”, “String”, “an”, “Array”, “an”, “instance”, “of”, “MyClass”, “?”]
[“So”, “now”, “D”, “responds”, “to”, “greet”, “just”, “as”, “c”, “responds”, “to”, “speak”, “.”]
[“In”, “other”, “words”, “the”, “term”, “class”, “method”, “is”, “just”, “a”, “special”, “term”, “for”, “something”, “which”, “you”, “can”, “do”, “with”, “any”, “mutable”, “object”, “namely”, “add”, “a”, “singleton”, “method”, “to”, “it.”]
[“It”, “has”, “a”, “special”, “name”, “because”, “in”, “actual”, “program”, “design”, “class”, “methods”, “have”, “a”, “special”, “role”, “to”, “play.”]
[“But”, “what”, “they”, “are”, “at”, “heart”, “is”, “singleton”, “methods”, “defined”, “on”, “objects”, “where”, “those”, “objects”, “happen”, “to”, “be”, “instances”, “of”, “a”, “class”, “called”, “Class.”]
[“The”, “use”, “of”, “uppercase”, “names”, “constants”, “for”, “classes”, “can”, “obscure”, “the”, “fact”, “that”, “classes”, “are”, “just”, “objects.”]
[“Also”, “the”, “usual”, “style”, “is”, “to”, “put”, “class”, “method”, “definitions”, “inside”, “the”, “class”, “definition”, “which”, “makes”, “it”, “look”, “like”, “they”, “have”, “some”, “special”, “status.”]
[“But”, “look”, “at”, “this”, “etc.”]
[“You”, “can”, “see”, “that”, “some”, “of”, “the”, “special”, “treatment”, “of”, “classes”, “–”, “constants”, “as”, “names”, “the”, “separate”, “notion”, “of”, “class”, “method”, “for”, “their”, “singleton”, “methods”, “–”, “is”, “just”, “that”, “special”, “treatment.”]
[“Underneath”, “a”, “class”, “is”, “indeed”, “an”, “object.”]
[“CategoryDocumentation”, “CategoryTutorial”, “HomePage”, “RecentChanges”, “Preferences”, “RubyGarden”, “Edit”, “text”, “of”, “this”, “page”, “View”, “other”, “revisions”, “Last”, “edited”, “May”, “am”, “diff”, “Search”]

note that “i.e.” was recognized :slight_smile:
However several problems are yet to be solved:

  • how to get rid of meaningful lone words? (last line)
  • what happens to things like
    As seen here:
    CODE
    bla bla bla.
  • etc

However solving that would transform the 30mins. hack into a 1H kludge,
better stay this way :slight_smile:


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Because I don’t need to worry about finances I can ignore Microsoft
and take over the (computing) world from the grassroots.
– Linus Torvalds

It depends on what you mean by “sentence”, 'ey? Do you mean natural
language (English? Rumanian? Urdu? Hakka? Thai? Japanese?), or
artificial formalisms like programming languages (Perl, Ruby, FORTH)?

But someone went to a lot of trouble to carve up their perceptions of
reality (heh) into procrustean HTML, so you may as well begin there.
Determine the major syntactical units (TABLE, DIV, P, HR, PRE, TT, H1,
etc.). Recursing, determine what is a “sentence” on semantic,
idiomatic (BR, B, U), or at least grammatical (カ、ネー、ニ、ヘ、。。。), grounds.
Collect these purely formal “sentences” and send the list to
post-processing (possibly human inspection) to be vetted and refined
(e.g., does your system account for utterances which are meaningful but
grammatically abbreviated, like “What up?” (MTV argot used by
advertisers to slide nickels out of pockets) or “Annta desu” (kids
choosing sides for oni in Osaka). )

If you have access to a page’s CSS, your hints about what the author(s)
intended are much expanded. Maybe not so impossible after all? This
does not seem like a difficult task to me, but maybe I haven’t
appreciated the context from which the question is posed? Does the
solution have to be extremely general, or is it a one-shot?

David

···

On Wednesday, June 11, 2003, at 09:38 PM, Hal E. Fulton wrote:

Hello, all.

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

There are, of course, several things that make
this difficult:

  • need to distinguish between end-of-sentence
    and embedded punctuation, including both
    abbreviations and textual references to
    Ruby methods such as eof? and split!
  • need to treat sentence fragments as sentences
  • need to ignore blocks of code
  • etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

  1. Convert to text
  2. Process

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal


Hal Fulton
hal9000@hypermetrics.com


David C. Oshel mailto:dcoshel@mac.com
Cedar Rapids, Iowa http://homepage.mac.com/dcoshel
``I think most pleasantly in metaphors, and smoking brings metaphors to
mind." - Augustus Srb, in Alexei Panshin’s Star Well

Hello, all.

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

There are, of course, several things that make
this difficult:

  • need to distinguish between end-of-sentence
    and embedded punctuation, including both
    abbreviations and textual references to
    Ruby methods such as eof? and split!
  • need to treat sentence fragments as sentences
  • need to ignore blocks of code
  • etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

  1. Convert to text
  2. Process

I would parse into a tree, process there, then strip tags. The reason
being, ruby code and other nongramatical entities are likely to be
offset by tags –

, , , things like that.  Not always,
but it’s a useful heuristic.

It’s not a trivial task – I’ve done a lot of natural-language work for
the Wiki that I run (it’s markup is one of the least code-like of any
wiki). How good you need the results to be are a big deciding factor in
how to implement, for sure. Natural language parsing is a big cpu
cruncher.

···

On Wed, 2003-06-11 at 20:38, Hal E. Fulton wrote:

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal


Hal Fulton
hal9000@hypermetrics.com

Attached is a 30 mins. hack of mine that does something like that.
The scanning part is really a kludge but I’ve been using it w/
acceptable results in a proxy that add hints to webpages on the fly.

The world is full of kludges. One more won’t hurt.

Given http://www.rubygarden.org/ruby?ClassMethodsTutorial, my hack
returns

[snippage]

Great. I will look into your source.

Thanks,
Hal

···

----- Original Message -----
From: “Mauricio Fernández” batsman.geo@yahoo.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Thursday, June 12, 2003 1:43 AM
Subject: Re: HTML → list of sentences? (semi-impossible task)

It depends on what you mean by “sentence”, 'ey? Do you mean natural
language (English? Rumanian? Urdu? Hakka? Thai? Japanese?), or
artificial formalisms like programming languages (Perl, Ruby, FORTH)?

In this case, English sentences. Not as in formal grammars, or as
in prison sentences. Not that those two are so different.

But someone went to a lot of trouble to carve up their perceptions of
reality (heh) into procrustean HTML, so you may as well begin there.
Determine the major syntactical units (TABLE, DIV, P, HR, PRE, TT, H1,
etc.). Recursing, determine what is a “sentence” on semantic,
idiomatic (BR, B, U), or at least grammatical (カ、ネー、ニ、
ヘ、。。。), grounds.
Collect these purely formal “sentences” and send the list to
post-processing (possibly human inspection) to be vetted and refined
(e.g., does your system account for utterances which are meaningful but
grammatically abbreviated, like “What up?” (MTV argot used by
advertisers to slide nickels out of pockets) or “Annta desu” (kids
choosing sides for oni in Osaka). )

I think even that is perhaps too much intelligence.I don’t want to
build in knowledge about nouns and verbs.

If you have access to a page’s CSS, your hints about what the author(s)
intended are much expanded. Maybe not so impossible after all? This
does not seem like a difficult task to me, but maybe I haven’t
appreciated the context from which the question is posed?

My parents sometims quote a comedian from before I was born: “Easy for
you, difficult for me.”

Does the
solution have to be extremely general, or is it a one-shot?

Ehh, somewhat general in the sense of several chapters. But very
one-shot in that I’m looking at one particular document, and it’s
about Ruby. :wink:

I think the replies I’ve got are fairly promising along with my
own dirty hack from last night.

Cheers,
Hal

···

----- Original Message -----
From: “Dave Oshel” dcoshel@vcmails.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Thursday, June 12, 2003 9:29 AM
Subject: Re: HTML → list of sentences? (semi-impossible task)

David

On Wednesday, June 11, 2003, at 09:38 PM, Hal E. Fulton wrote:

Hello, all.

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

There are, of course, several things that make
this difficult:

  • need to distinguish between end-of-sentence
    and embedded punctuation, including both
    abbreviations and textual references to
    Ruby methods such as eof? and split!
  • need to treat sentence fragments as sentences
  • need to ignore blocks of code
  • etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

  1. Convert to text
  2. Process

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal


Hal Fulton
hal9000@hypermetrics.com


David C. Oshel mailto:dcoshel@mac.com
Cedar Rapids, Iowa http://homepage.mac.com/dcoshel
``I think most pleasantly in metaphors, and smoking brings metaphors to
mind." - Augustus Srb, in Alexei Panshin’s Star Well

Yes, in this case, large code fragments are always set off by “pre”
tags. That does simplify.

As I said, I’m not interested in true natural-language parsing.
Something “mostly” accurate is good enough.

Thanks,
Hal

···

----- Original Message -----
From: “Aredridel” aredridel@nbtsc.org
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Thursday, June 12, 2003 11:16 AM
Subject: Re: HTML → list of sentences? (semi-impossible task)

I would parse into a tree, process there, then strip tags. The reason
being, ruby code and other nongramatical entities are likely to be
offset by tags –

, , , things like that.  Not always,
but it’s a useful heuristic.

It’s not a trivial task – I’ve done a lot of natural-language work for
the Wiki that I run (it’s markup is one of the least code-like of any
wiki). How good you need the results to be are a big deciding factor in
how to implement, for sure. Natural language parsing is a big cpu
cruncher.