HTML -> list of sentences? (semi-impossible task)

HAL_9000 · 12 June 2003 02:38

Hello, all.

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

There are, of course, several things that make
this difficult:

need to distinguish between end-of-sentence
and embedded punctuation, including both
abbreviations and textual references to
Ruby methods such as eof? and split!
need to treat sentence fragments as sentences
need to ignore blocks of code
etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

Convert to text
Process

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal

···

–
Hal Fulton
hal9000@hypermetrics.com

Mauricio_Fernndez · 12 June 2003 06:43

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

Attached is a 30 mins. hack of mine that does something like that.
The scanning part is really a kludge but I’ve been using it w/
acceptable results in a proxy that add hints to webpages on the fly.

There are, of course, several things that make
this difficult:

need to distinguish between end-of-sentence
and embedded punctuation, including both
abbreviations and textual references to
Ruby methods such as eof? and split!

They only way I can think of to do that is having a list of methods and
abbreviations to ignore.

need to treat sentence fragments as sentences

need to ignore blocks of code

Kind of doable if you have a dictionary (/usr/share/dict/words should
be enough). For each candidate sentence, you see how many words are
there and take it if the percentage is above some threshold.

etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

Convert to text

Process

Might be just as easy to do it in one step if
I knew what I was doing.

IMHO it can be done in one pass.

bloom.c (3.83 KB)

extconf.rb (40 Bytes)

scanhtml.rb (2.67 KB)

···

On Thu, Jun 12, 2003 at 11:38:13AM +0900, Hal E. Fulton wrote:

Given http://www.rubygarden.org/ruby?ClassMethodsTutorial, my hack
returns

[“This”, “is”, “simply”, “an”, “extract”, “from”, “a”, “post”, “to”, “ruby-talk”, “by”, “DavidBlack”, “on”, “the”, “topic”, “of”, “class”, “methods.”]
[“It”, “is”, “stored”, “here”, “in”, “the”, “hope”, “that”, “it”, “will”, “be”, “useful!”]
[“It”, “actually”, “goes”, “beyond”, “the”, “surface”, “of”, “class”, “methods”, “to”, “describe”, “the”, “nature”, “of”, “classes”, “as”, “objects”, “so”, “is”, “interesting”, “reading”, “for”, “anyone”, “progressing”, “to”, “intermediate-level”, “Ruby.”]
[“See”, “also”, “ClassMethods”, “for”, “an”, “overview”, “of”, “the”, “options”, “available”, “in”, “Ruby”, “for”, “creating”, “class”, “methods”, “and”, “SingletonTutorial”, “for”, “a”, “detailed”, “explanation”, “of”, “singleton”, “methods.”]
[“Every”, “object”, “responds”, “to”, “certain”, “messages”, “i.e.”, “can”, “call”, “methods”, “with”, “certain”, “names”, “.”]
[“Usually”, “those”, “methods”, “are”, “the”, “instance”, “methods”, “defined”, “by”, “the”, “object’s”, “class”, “However”, “it’s”, “also”, “possible”, “to”, “add”, “methods”, “to”, “individual”, “objects”, “Now”, “c”, “will”, “respond”, “to”, “speak”, “–”, “but”, “other”, “instances”, “of”, “class”, “C”, “will”, “not”, “This”, “means”, “that”, “speak”, “is”, “a”, “singleton”, “method”, “of”, “c.”]
[“now”, “look”, “at”, “this”, “Notice”, “the”, “similarity”, “between”, “the”, “syntax”, “involved”, “in”, “creating”, “a”, “new”, “singleton”, “method”, “for”, “c”, “and”, “creating”, “a”, “class”, “method”, “of”, “class”, “D”, “In”, “fact”, “these”, “are”, “essentially”, “the”, “same”, “thing.”]
[“In”, “both”, “cases”, “what’s”, “happening”, “is”, “that”, “a”, “singleton”, “method”, “is”, “being”, “added”, “to”, “a”, “particular”, “object.”]
[“It”, “just”, “happens”, “to”, “be”, “that”, “in”, “the”, “second”, “case”, “the”, “object”, “getting”, “the”, “new”, “method”, “is”, “a”, “Class”, “object”, “as”, “opposed”, “to”, “a”, “String”, “an”, “Array”, “an”, “instance”, “of”, “MyClass”, “?”]
[“So”, “now”, “D”, “responds”, “to”, “greet”, “just”, “as”, “c”, “responds”, “to”, “speak”, “.”]
[“In”, “other”, “words”, “the”, “term”, “class”, “method”, “is”, “just”, “a”, “special”, “term”, “for”, “something”, “which”, “you”, “can”, “do”, “with”, “any”, “mutable”, “object”, “namely”, “add”, “a”, “singleton”, “method”, “to”, “it.”]
[“It”, “has”, “a”, “special”, “name”, “because”, “in”, “actual”, “program”, “design”, “class”, “methods”, “have”, “a”, “special”, “role”, “to”, “play.”]
[“But”, “what”, “they”, “are”, “at”, “heart”, “is”, “singleton”, “methods”, “defined”, “on”, “objects”, “where”, “those”, “objects”, “happen”, “to”, “be”, “instances”, “of”, “a”, “class”, “called”, “Class.”]
[“The”, “use”, “of”, “uppercase”, “names”, “constants”, “for”, “classes”, “can”, “obscure”, “the”, “fact”, “that”, “classes”, “are”, “just”, “objects.”]
[“Also”, “the”, “usual”, “style”, “is”, “to”, “put”, “class”, “method”, “definitions”, “inside”, “the”, “class”, “definition”, “which”, “makes”, “it”, “look”, “like”, “they”, “have”, “some”, “special”, “status.”]
[“But”, “look”, “at”, “this”, “etc.”]
[“You”, “can”, “see”, “that”, “some”, “of”, “the”, “special”, “treatment”, “of”, “classes”, “–”, “constants”, “as”, “names”, “the”, “separate”, “notion”, “of”, “class”, “method”, “for”, “their”, “singleton”, “methods”, “–”, “is”, “just”, “that”, “special”, “treatment.”]
[“Underneath”, “a”, “class”, “is”, “indeed”, “an”, “object.”]
[“CategoryDocumentation”, “CategoryTutorial”, “HomePage”, “RecentChanges”, “Preferences”, “RubyGarden”, “Edit”, “text”, “of”, “this”, “page”, “View”, “other”, “revisions”, “Last”, “edited”, “May”, “am”, “diff”, “Search”]

note that “i.e.” was recognized
However several problems are yet to be solved:

how to get rid of meaningful lone words? (last line)
what happens to things like
As seen here:
CODE
bla bla bla.
etc

However solving that would transform the 30mins. hack into a 1H kludge,
better stay this way

–
_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |__ \ | | | | | (| | | | |
.__/ _,|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Because I don’t need to worry about finances I can ignore Microsoft
and take over the (computing) world from the grassroots.
– Linus Torvalds

Dave_Oshel1 · 12 June 2003 14:29

It depends on what you mean by “sentence”, 'ey? Do you mean natural
language (English? Rumanian? Urdu? Hakka? Thai? Japanese?), or
artificial formalisms like programming languages (Perl, Ruby, FORTH)?

But someone went to a lot of trouble to carve up their perceptions of
reality (heh) into procrustean HTML, so you may as well begin there.
Determine the major syntactical units (TABLE, DIV, P, HR, PRE, TT, H1,
etc.). Recursing, determine what is a “sentence” on semantic,
idiomatic (BR, B, U), or at least grammatical （カ、ネー、ニ、ヘ、。。。）, grounds.
Collect these purely formal “sentences” and send the list to
post-processing (possibly human inspection) to be vetted and refined
(e.g., does your system account for utterances which are meaningful but
grammatically abbreviated, like “What up?” (MTV argot used by
advertisers to slide nickels out of pockets) or “Annta desu” (kids
choosing sides for oni in Osaka). )

If you have access to a page’s CSS, your hints about what the author(s)
intended are much expanded. Maybe not so impossible after all? This
does not seem like a difficult task to me, but maybe I haven’t
appreciated the context from which the question is posed? Does the
solution have to be extremely general, or is it a one-shot?

David

···

On Wednesday, June 11, 2003, at 09:38 PM, Hal E. Fulton wrote:

Hello, all.

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

There are, of course, several things that make
this difficult:

need to distinguish between end-of-sentence
and embedded punctuation, including both
abbreviations and textual references to
Ruby methods such as eof? and split!

need to treat sentence fragments as sentences

need to ignore blocks of code

etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

Convert to text

Process

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal

–
Hal Fulton
hal9000@hypermetrics.com

–
David C. Oshel mailto:dcoshel@mac.com
Cedar Rapids, Iowa http://homepage.mac.com/dcoshel
``I think most pleasantly in metaphors, and smoking brings metaphors to
mind." - Augustus Srb, in Alexei Panshin’s Star Well

Aredridel1 · 12 June 2003 16:16

Hello, all.

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

There are, of course, several things that make
this difficult:

need to distinguish between end-of-sentence
and embedded punctuation, including both
abbreviations and textual references to
Ruby methods such as eof? and split!

need to treat sentence fragments as sentences

need to ignore blocks of code

etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

Convert to text

Process

I would parse into a tree, process there, then strip tags. The reason
being, ruby code and other nongramatical entities are likely to be
offset by tags –

, , , things like that.  Not always,

but it’s a useful heuristic.
It’s not a trivial task – I’ve done a lot of natural-language work for

the Wiki that I run (it’s markup is one of the least code-like of any

wiki).  How good you need the results to be are a big deciding factor in

how to implement, for sure.  Natural language parsing is a big cpu

cruncher.

···
On Wed, 2003-06-11 at 20:38, Hal E. Fulton wrote:

Might be just as easy to do it in one step if

I knew what I was doing.
Also not sure what is the best tool/library for

this job.
Comments welcome.
Hal
–

Hal Fulton

hal9000@hypermetrics.com

HAL_9000 · 12 June 2003 23:09

Attached is a 30 mins. hack of mine that does something like that.
The scanning part is really a kludge but I’ve been using it w/
acceptable results in a proxy that add hints to webpages on the fly.

The world is full of kludges. One more won’t hurt.

Given http://www.rubygarden.org/ruby?ClassMethodsTutorial, my hack
returns

[snippage]

Great. I will look into your source.

Thanks,
Hal

···

----- Original Message -----
From: “Mauricio Fernández” batsman.geo@yahoo.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Thursday, June 12, 2003 1:43 AM
Subject: Re: HTML → list of sentences? (semi-impossible task)

HAL_9000 · 12 June 2003 23:16

It depends on what you mean by “sentence”, 'ey? Do you mean natural
language (English? Rumanian? Urdu? Hakka? Thai? Japanese?), or
artificial formalisms like programming languages (Perl, Ruby, FORTH)?

In this case, English sentences. Not as in formal grammars, or as
in prison sentences. Not that those two are so different.

But someone went to a lot of trouble to carve up their perceptions of
reality (heh) into procrustean HTML, so you may as well begin there.
Determine the major syntactical units (TABLE, DIV, P, HR, PRE, TT, H1,
etc.). Recursing, determine what is a “sentence” on semantic,
idiomatic (BR, B, U), or at least grammatical （カ、ネー、ニ、
ヘ、。。。）, grounds.
Collect these purely formal “sentences” and send the list to
post-processing (possibly human inspection) to be vetted and refined
(e.g., does your system account for utterances which are meaningful but
grammatically abbreviated, like “What up?” (MTV argot used by
advertisers to slide nickels out of pockets) or “Annta desu” (kids
choosing sides for oni in Osaka). )

I think even that is perhaps too much intelligence.I don’t want to
build in knowledge about nouns and verbs.

If you have access to a page’s CSS, your hints about what the author(s)
intended are much expanded. Maybe not so impossible after all? This
does not seem like a difficult task to me, but maybe I haven’t
appreciated the context from which the question is posed?

My parents sometims quote a comedian from before I was born: “Easy for
you, difficult for me.”

Does the
solution have to be extremely general, or is it a one-shot?

Ehh, somewhat general in the sense of several chapters. But very
one-shot in that I’m looking at one particular document, and it’s
about Ruby.

I think the replies I’ve got are fairly promising along with my
own dirty hack from last night.

Cheers,
Hal

···

----- Original Message -----
From: “Dave Oshel” dcoshel@vcmails.com
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Thursday, June 12, 2003 9:29 AM
Subject: Re: HTML → list of sentences? (semi-impossible task)

David

On Wednesday, June 11, 2003, at 09:38 PM, Hal E. Fulton wrote:

Hello, all.

Here’s an idea I’m toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it…

…and get a list of the sentences in the
document.

There are, of course, several things that make
this difficult:

need to distinguish between end-of-sentence
and embedded punctuation, including both
abbreviations and textual references to
Ruby methods such as eof? and split!

need to treat sentence fragments as sentences

need to ignore blocks of code

etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn’t have any dependencies.

Not sure whether to do it in two steps or not:

Convert to text

Process

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal

–
Hal Fulton
hal9000@hypermetrics.com

–
David C. Oshel mailto:dcoshel@mac.com
Cedar Rapids, Iowa http://homepage.mac.com/dcoshel
``I think most pleasantly in metaphors, and smoking brings metaphors to
mind." - Augustus Srb, in Alexei Panshin’s Star Well

HAL_9000 · 12 June 2003 23:18

Yes, in this case, large code fragments are always set off by “pre”
tags. That does simplify.

As I said, I’m not interested in true natural-language parsing.
Something “mostly” accurate is good enough.

Thanks,
Hal

···

----- Original Message -----
From: “Aredridel” aredridel@nbtsc.org
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Thursday, June 12, 2003 11:16 AM
Subject: Re: HTML → list of sentences? (semi-impossible task)

I would parse into a tree, process there, then strip tags. The reason
being, ruby code and other nongramatical entities are likely to be
offset by tags –

, , , things like that.  Not always,

but it’s a useful heuristic.
It’s not a trivial task – I’ve done a lot of natural-language work for

the Wiki that I run (it’s markup is one of the least code-like of any

wiki).  How good you need the results to be are a big deciding factor in

how to implement, for sure.  Natural language parsing is a big cpu

cruncher.

Topic		Replies	Views
Decent HTML Parser? ruby-talk	17	128	13 July 2006
Finding a sentence (more than one word & punctuation (, . ;)) in a string? ruby-talk	11	131	12 January 2006
HTML cleanup task ruby-talk	7	74	1 December 2006
Splitting a text file into sentences ruby-talk	37	199	3 December 2005
Rdoc allowing arbitrary HTML ruby-talk	9	72	1 August 2007

HTML -> list of sentences? (semi-impossible task)

Related topics