Text analyzator

I have created a little text analyzation tool, that tries to extract
words that are important in a given text. I have implemented one of my
strange ideas, and to my own surprise, it works. I have no idea if any
similar tool exists, so I do not know where to post this. It is written
in Ruby, so I just post it here :slight_smile:

To use this tool, you first have to index a large amount of text files.
It generates an index, which is later used when analyzing text.

For example, I have indexed several fairy tales, and used this index to
extract important words. Here are some results:

Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
Alladin.txt: aladdin, lamp, genie, sultan, wizard

The algorithm works with HTML files and probably any other format that
contains text, Here is an example of analyzation results when HTML
files are indexed:

SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
META-FAQ.html: newsgroup, comp, sunsite, questions, announce
TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive

And now my question: Does anyone know where to find such tools or
algorithms?

You can get it from here, it's public domain:
http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer

martinus

Would you care to explain what could one use this for?

Alex

路路路

On Wed, 2004-09-22 at 00:54, martinus wrote:

I have created a little text analyzation tool, that tries to extract
words that are important in a given text.

Hi Martin,

路路路

--- martinus <martin.ankerl@gmail.com> wrote:

And now my question: Does anyone know where to find
such tools or
algorithms?

Word (Text) analysis is a very active branch of
Information Theory.

Just Google for "word entropy" and spend the rest of
life surfing :wink:

HTH,
-- shanko

__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail

Thought I'd given this simple program a go and review it for those curious as to how well it works. Two tests were carried out, the first I only used the following texts. The second repeated the first with additional training texts. The texts are from the project gutenburg (except openbsd35.readme.txt which for some reason was in the same directory).

$ ls *.txt
8ldvc10.txt openbsd35.readme.txt
grimm10.txt sunzu10.txt

$ cat *.txt |ruby textanalyze.rb c
reading...
Indexed 49916 words in 1.001535 seconds, 49839.4963730673 words per second
Indexed 117545 words in 2.005191 seconds, 58620.3508792928 words per second
Indexed 184142 words in 3.013597 seconds, 61103.7242205909 words per second
Indexed 245471 words in 4.035581 seconds, 60826.6814617276 words per second
Indexed 300307 words in 5.045199 seconds, 59523.3210820822 words per second
Indexed 351646 words in 6.052536 seconds, 58098.9522408458 words per second
Indexed 414601 words in 7.055078 seconds, 58766.3240576504 words per second
Indexed 416108 words in 8.056517 seconds, 51648.6218548288 words per second
storing into wordcount.dat...
Indexed 416108 words in 8.159865 seconds, 50994.4711095098 words per second

I then fed it an text version of my marketing essay which should have very little if anything in common with the training texts.

$ cat ../assignment1.txt|ruby textanalyze.rb a
loading wordcount.dat...
reading...
analyzing...
most characteristic words:
marketing, customers, customer, purchase, interaction

I then added more texts

$ ls *.txt
8ldvc10.txt openbsd35.readme.txt tprnc11.txt
dracu13.txt repub11.txt warw12.txt
grimm10.txt sunzu10.txt

and reran the above creation and analyze steps. to get

most characteristic words:
marketing, customers, customer, 4ps, interaction

So, not bad for such a simple algorithm. As I would have picked the keywords as relationship, marketing, 4Ps,
聽聽and customer retention. I'm surprised coffee didn't show up as I kept using it in examples. It don't do too badly in this simple test especially considering that the training test was chosen at random and not related to the text analyzed. A dictionary of plurals or some other means of dealing with plurals would be my only suggestion.

NB:

Jeff.

路路路

On 22/09/2004, at 7:54 AM, martinus wrote:

I have created a little text analyzation tool, that tries to extract
words that are important in a given text. I have implemented one of my
strange ideas, and to my own surprise, it works. I have no idea if any
similar tool exists, so I do not know where to post this. It is written
in Ruby, so I just post it here :slight_smile:

To use this tool, you first have to index a large amount of text files.
It generates an index, which is later used when analyzing text.

For example, I have indexed several fairy tales, and used this index to
extract important words. Here are some results:

Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
Alladin.txt: aladdin, lamp, genie, sultan, wizard

The algorithm works with HTML files and probably any other format that
contains text, Here is an example of analyzation results when HTML
files are indexed:

SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
META-FAQ.html: newsgroup, comp, sunsite, questions, announce
TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive

And now my question: Does anyone know where to find such tools or
algorithms?

You can get it from here, it's public domain:
http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer

martinus

On Wed, 22 Sep 2004, Alexey Verkhovsky defenestrated me:

> I have created a little text analyzation tool, that tries to extract
> words that are important in a given text.

Would you care to explain what could one use this for?

聽聽I am not the author, but I can think of two...

聽聽I think it could be useful for classification of spam. Apply
this filter and then do bayesian stuff. I bet it would significantly help
in classifying wordy spams as spams (Bayes will not do so well with things
like Nigerian spam messages).

聽聽Another place I bet something like this is used is in google page
ranking. They need algorithms for cutting out the noise.

-Tom

路路路

On Wed, 2004-09-22 at 00:54, martinus wrote:

--
+ http://www.tc.umn.edu/~enebo +---- mailto:enebo@acm.org ----+

Thomas E Enebo, Protagonist | "A word is worth a thousand |
聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽> pictures" -Bruce Tognazzini |

I decided to do a somewhat more ambitious test. After training on
a thousand arbitrary .doc files and a thousand arbitrary .html files
(and tweaking it to return the top 15 words instead of just the top 5) I
fed it Why the lucky stiff's latest opus:

loading wordcount.dat...
reading...
analyzing...
most characteristic words:
he, his, cham, dr, said, ruby, goat, method,
irb, ree, paij, sentence, him, had, end

Not bad at all. Although I haven't read it yet myself, this looks like
a quite reasonable summary. I'm a little surprised at the absence of
flugel and trisomatic, but perhaps WTLS has gotten less predictable in
his vocabulary since the last time I read him.

-- Markus

路路路

On Tue, 2004-09-21 at 21:18, jm wrote:

Thought I'd given this simple program a go and review it for those
curious as to how well it works. Two tests were carried out, the first
I only used the following texts. The second repeated the first with
additional training texts. The texts are from the project gutenburg
(except openbsd35.readme.txt which for some reason was in the same
directory).

$ ls *.txt
8ldvc10.txt openbsd35.readme.txt
grimm10.txt sunzu10.txt

$ cat *.txt |ruby textanalyze.rb c
reading...
Indexed 49916 words in 1.001535 seconds, 49839.4963730673 words per
second
Indexed 117545 words in 2.005191 seconds, 58620.3508792928 words per
second
Indexed 184142 words in 3.013597 seconds, 61103.7242205909 words per
second
Indexed 245471 words in 4.035581 seconds, 60826.6814617276 words per
second
Indexed 300307 words in 5.045199 seconds, 59523.3210820822 words per
second
Indexed 351646 words in 6.052536 seconds, 58098.9522408458 words per
second
Indexed 414601 words in 7.055078 seconds, 58766.3240576504 words per
second
Indexed 416108 words in 8.056517 seconds, 51648.6218548288 words per
second
storing into wordcount.dat...
Indexed 416108 words in 8.159865 seconds, 50994.4711095098 words per
second

I then fed it an text version of my marketing essay which should have
very little if anything in common with the training texts.

$ cat ../assignment1.txt|ruby textanalyze.rb a
loading wordcount.dat...
reading...
analyzing...
most characteristic words:
marketing, customers, customer, purchase, interaction

I then added more texts

$ ls *.txt
8ldvc10.txt openbsd35.readme.txt tprnc11.txt
dracu13.txt repub11.txt warw12.txt
grimm10.txt sunzu10.txt

and reran the above creation and analyze steps. to get

most characteristic words:
marketing, customers, customer, 4ps, interaction

So, not bad for such a simple algorithm. As I would have picked the
keywords as relationship, marketing, 4Ps,
聽聽and customer retention. I'm surprised coffee didn't show up as I kept
using it in examples. It don't do too badly in this simple test
especially considering that the training test was chosen at random and
not related to the text analyzed. A dictionary of plurals or some other
means of dealing with plurals would be my only suggestion.

NB:

Jeff.

On 22/09/2004, at 7:54 AM, martinus wrote:

> I have created a little text analyzation tool, that tries to extract
> words that are important in a given text. I have implemented one of my
> strange ideas, and to my own surprise, it works. I have no idea if any
> similar tool exists, so I do not know where to post this. It is
> written
> in Ruby, so I just post it here :slight_smile:
>
> To use this tool, you first have to index a large amount of text files.
> It generates an index, which is later used when analyzing text.
>
> For example, I have indexed several fairy tales, and used this index to
> extract important words. Here are some results:
>
> Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
> Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
> Alladin.txt: aladdin, lamp, genie, sultan, wizard
>
> The algorithm works with HTML files and probably any other format that
> contains text, Here is an example of analyzation results when HTML
> files are indexed:
>
> SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
> META-FAQ.html: newsgroup, comp, sunsite, questions, announce
> TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive
>
> And now my question: Does anyone know where to find such tools or
> algorithms?
>
> You can get it from here, it's public domain:
> http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer
>
> martinus
>
>

The goal is to automatically create summaries of a text. For example,
if you have a large text file and you have no idea what this is about,
the analyzer should be able to give you a short summery of the file.
Another nice idea might be to add such a feature to a blogging webpage,
each entry could show a short summary, or at least the most important
words.

martinus

Markus wrote:

Not bad at all. Although I haven't read it yet myself, this looks like
a quite reasonable summary. I'm a little surprised at the absence of
flugel and trisomatic, but perhaps WTLS has gotten less predictable in
his vocabulary since the last time I read him.

That post made me smile, since it was ambigous in its heading at the
very least. Do you actually like reading WTLS as much as you seem to be ?

- --
kaspar

semantics & semiotics
code manufacture

www.tua.ch/ruby

You should use training material that is similar to the text you want
to analyze for best results. I don't think it is useful to train .doc
docments when you want to analyze html files.

martinus

聽聽I think it could be useful for classification of spam. Apply
this filter and then do bayesian stuff. I bet it would significantly help
in classifying wordy spams as spams (Bayes will not do so well with things
like Nigerian spam messages).

Not sure why you'd think this, but POPFile (a "pure", i.e.
non-Grahamesque) Bayesian filter does extraordinarily well with 419s,
Nigerian, and similar spam.

For me, anyway.

I love reading his stuff. I have, however, been asked to wait a
day or two after reading anything he wrote before writing any
documentation or client proposals. I almost made one of our lawyers
turn blue once, but the shade did not suit him.

聽聽聽聽聽-- Markus

路路路

On Wed, 2004-09-22 at 05:56, Kaspar Schiess wrote:

Markus wrote:

> Not bad at all. Although I haven't read it yet myself, this looks like
> a quite reasonable summary. I'm a little surprised at the absence of
> flugel and trisomatic, but perhaps WTLS has gotten less predictable in
> his vocabulary since the last time I read him.

That post made me smile, since it was ambigous in its heading at the
very least. Do you actually like reading WTLS as much as you seem to be ?

martinus wrote:

You should use training material that is similar to the text you want
to analyze for best results. I don't think it is useful to train .doc
docments when you want to analyze html files.

Can you clarify this? Do you mean:

1. The text is not pulled from the format but retains some residue from
where it came from (JuliusCaesar.doc will train differently from
JuliusCaesar.html).

2. The material should be of the same general type, coming from the same
type of source; but the actual format does not affect training.

3. Something else?

Hal

Michael Campbell a 茅crit :

I think it could be useful for classification of spam. Apply
this filter and then do bayesian stuff. I bet it would significantly help
in classifying wordy spams as spams (Bayes will not do so well with things
like Nigerian spam messages).

Not sure why you'd think this, but POPFile (a "pure", i.e.
non-Grahamesque) Bayesian filter does extraordinarily well with 419s,

Pardon my profound ignorance, but what do you call '419s' ?
TIA
Bruno

The text is never pulled from any format. If you train only html files,
and then analyze html files, these html tags are treated just like
normal words. They just don't show up in the results, because they are
mostly equally often used in both the training texts and the analyzed
text.
The algorithm is very simple, and takes absolutely no assumption of the
input.

Markus a 茅crit :

路路路

On Wed, 2004-09-22 at 05:56, Kaspar Schiess wrote:

Markus wrote:

Not bad at all. Although I haven't read it yet myself, this looks like
a quite reasonable summary. I'm a little surprised at the absence of
flugel and trisomatic, but perhaps WTLS has gotten less predictable in
his vocabulary since the last time I read him.

That post made me smile, since it was ambigous in its heading at the
very least. Do you actually like reading WTLS as much as you seem to be ?

聽聽聽聽聽I love reading his stuff. I have, however, been asked to wait a
day or two after reading anything he wrote before writing any
documentation or client proposals. I almost made one of our lawyers
turn blue once, but the shade did not suit him.

LOL

bruno modulix wrote:

Michael Campbell a 茅crit :

I think it could be useful for classification of spam. Apply
this filter and then do bayesian stuff. I bet it would significantly help
in classifying wordy spams as spams (Bayes will not do so well with things
like Nigerian spam messages).

Not sure why you'd think this, but POPFile (a "pure", i.e.
non-Grahamesque) Bayesian filter does extraordinarily well with 419s,

Pardon my profound ignorance, but what do you call '419s' ?

I didn't know either, but a google for '419 spam' gave me
this interesting link: http://home.rica.net/alphae/419coal/

Hal