Thought I'd given this simple program a go and review it for those
curious as to how well it works. Two tests were carried out, the first
I only used the following texts. The second repeated the first with
additional training texts. The texts are from the project gutenburg
(except openbsd35.readme.txt which for some reason was in the same
directory).
$ ls *.txt
8ldvc10.txt openbsd35.readme.txt
grimm10.txt sunzu10.txt
$ cat *.txt |ruby textanalyze.rb c
reading...
Indexed 49916 words in 1.001535 seconds, 49839.4963730673 words per
second
Indexed 117545 words in 2.005191 seconds, 58620.3508792928 words per
second
Indexed 184142 words in 3.013597 seconds, 61103.7242205909 words per
second
Indexed 245471 words in 4.035581 seconds, 60826.6814617276 words per
second
Indexed 300307 words in 5.045199 seconds, 59523.3210820822 words per
second
Indexed 351646 words in 6.052536 seconds, 58098.9522408458 words per
second
Indexed 414601 words in 7.055078 seconds, 58766.3240576504 words per
second
Indexed 416108 words in 8.056517 seconds, 51648.6218548288 words per
second
storing into wordcount.dat...
Indexed 416108 words in 8.159865 seconds, 50994.4711095098 words per
second
I then fed it an text version of my marketing essay which should have
very little if anything in common with the training texts.
$ cat ../assignment1.txt|ruby textanalyze.rb a
loading wordcount.dat...
reading...
analyzing...
most characteristic words:
marketing, customers, customer, purchase, interaction
I then added more texts
$ ls *.txt
8ldvc10.txt openbsd35.readme.txt tprnc11.txt
dracu13.txt repub11.txt warw12.txt
grimm10.txt sunzu10.txt
and reran the above creation and analyze steps. to get
most characteristic words:
marketing, customers, customer, 4ps, interaction
So, not bad for such a simple algorithm. As I would have picked the
keywords as relationship, marketing, 4Ps,
聽聽and customer retention. I'm surprised coffee didn't show up as I kept
using it in examples. It don't do too badly in this simple test
especially considering that the training test was chosen at random and
not related to the text analyzed. A dictionary of plurals or some other
means of dealing with plurals would be my only suggestion.
NB:
Jeff.
On 22/09/2004, at 7:54 AM, martinus wrote:
> I have created a little text analyzation tool, that tries to extract
> words that are important in a given text. I have implemented one of my
> strange ideas, and to my own surprise, it works. I have no idea if any
> similar tool exists, so I do not know where to post this. It is
> written
> in Ruby, so I just post it here 
>
> To use this tool, you first have to index a large amount of text files.
> It generates an index, which is later used when analyzing text.
>
> For example, I have indexed several fairy tales, and used this index to
> extract important words. Here are some results:
>
> Little Red Riding Hood.txt: hood, grandma, riding, hunter, red
> Little Mermaid.txt: sirenetta, mermaid, sea, waves, sisters
> Alladin.txt: aladdin, lamp, genie, sultan, wizard
>
> The algorithm works with HTML files and probably any other format that
> contains text, Here is an example of analyzation results when HTML
> files are indexed:
>
> SSL-RedHat-HOWTO.htm: certificate, ssl, private, key, openssl
> META-FAQ.html: newsgroup, comp, sunsite, questions, announce
> TeTeX-HOWTO.html: tetex, tex, ctan, latex, archive
>
> And now my question: Does anyone know where to find such tools or
> algorithms?
>
> You can get it from here, it's public domain:
> http://martinus.geekisp.com/rublog.cgi/Projects/TextAnalyzer
>
> martinus
>
>