[ANN] Ferret 0.1.0 (Port of Java Lucene) released

Yes and also one called rubylucene. Unfortunately Erik Hatcher never had the
time to get those projects off the ground. Hopefully he'll have time to help
me out now that the port is finished though. :wink:

···

On 10/22/05, Hal Fulton <hal9000@hypermetrics.com> wrote:

I'm probably wrong, but I thought we already had
a port of this? A thing called Rucene?

Hi Miles,
Currently the query parser struggles with UTF-8 but apart from that you
should be able to use UTF-8. You'll need to write your own analyzer for
whatever language you are using, as well as implementing your own sort
procedure if you want to sort results by strings. Here is an example I
tried. The strings are chinese so apologies if your browser can't display
them. (I have no idea what they mean).

require 'rubygems'
require 'ferret'
include Ferret

class ChineseAnalyzer
def token_stream(field, string)
tokenizer = Analysis::RegExpTokenizer.new(string)
class <<tokenizer
def token_re() /./ end
end
return tokenizer
end
end

docs = ["µÀµÂ½›", "ËÑË÷ËùÓÐÍøÒ³", "ËÑË÷ËùÓÐÖÐÎÄÍøÒ³", "ËÑË÷¼òÌåÖÐÎÄÍøÒ³"]

index = Index::Index.new(:analyzer => ChineseAnalyzer.new)

docs.each { |doc| index << doc }

puts index[3][""]

tq = Search::TermQuery.new(Index::Term.new("", "Íø"))

index.search_each(tq) do |doc, score|
puts "Document #{doc} found with score #{score}"
end

index.close

···

On 10/22/05, David Balmain <dbalmain.ml@gmail.com> wrote:

On 10/22/05, Miles Keaton <mileskeaton@gmail.com> wrote:

> Have you looked into what it would take to allow all text to be UTF-8?
> Would it take a complete overhaul or a somewhat-minor tweak?
> If a tweak, what would you charge to make that lovely update? :slight_smile:

Hi Devin,
I'm afraid I've only briefly looked at those other IR systems but I'll try
and answer your question as best I can. I think Ferret is currently pretty
easy to learn and use through the Index interface as described in my
original post. I don't think ease of use should turn you off. Once I've done
a bit more work on the documentation, I think it'll be a lot easier to find
your way around than some of the other ones. But it'll be significantly
slower than the C library backed search engines. I'm certainly not the type
of person to say speed isn't important, however, I think ferret should
easily handle the kind of website you are talking about.

Ferret should be a lot faster than SimpleSearch for large document sets.
Having said that, there is a ruby quiz coming up for which I intend to write
a quick and simple search engine that will easily outperform simple search
so if people are interested, I might make that a project too.

== As for the others, the main advantages of Ferret are;

* a more powerful extendable query language. You can do boolean, phrase,
range, fuzzy (for misspellings etc), wildcard, sloppy phrase (out of order
phrases) and more. Check out the Query Parser in the API for more info on
the query language.
http://ferret.davebalmain.com/api/classes/Ferret/QueryParser.html

* a more powerful document structure. I could be wrong about this so someone
please correct me if I am, but I think most of the other IR's just take a
string as a document. Ferrets documents can have multiple fields. Each field
can have a different analyzer (parses field into tokens). You can store
binary fields like images or compress your data. In fact, you could do away
with a database altogether and just use Ferret. (You can also store term
vectors if you want to compare document similarities, but that's getting
pretty technical)

* Ferret is pure ruby (at least it can be if you don't install the C
extension) so it'll run anywhere Ruby does.

* If you are patient, Ferret will one day match or beat the speed of those
other search engines. Hopefully by Christmas but it all depends how much
help I can get between now and then.

== And the main disadvantages;

* Ferret is still alpha and has not been put into production yet. Hopefully
that will change soon.

* Ferret is currently slower than the C backed IRs

Anyway, sorry for such a long email. It's really hard to describe all the
features available. In fact, there is a whole book on Lucene by Erik Hatcher
and Otis Gospodnetic which I highly recommend if you want to take full
advantage of all the features in Ferret. Most of the examples should
translate pretty easily into Ruby.

Please let me know if you have any more questions.
Regards,
Dave

···

On 10/21/05, Devin Mullins <twifkak@comcast.net> wrote:

Question for those (soon to be) in the know:

How does this compare to (Estraier/Hyper
Estraier/Ruby-Odeum/SimpleSearch/other 'IR' systems with Ruby
bindings?) on (ease of learning/ease of use/ease of
maintenance/speed/any other noteworthy attributes)? To put it simply,
which one should I choose?* :slight_smile:

one correction:
RegExpTokenizer should be RETokenizer
(at least in the version you've released publicly)

···

require 'rubygems'
require 'ferret'
include Ferret

class ChineseAnalyzer
def token_stream(field, string)
tokenizer = Analysis::RegExpTokenizer.new(string)
class <<tokenizer
def token_re() /./ end
end
return tokenizer
end
end

docs = ["µÀµÂ½›", "ËÑË÷ËùÓÐÍøÒ³", "ËÑË÷ËùÓÐÖÐÎÄÍøÒ³", "ËÑË÷¼òÌåÖÐÎÄÍøÒ³"]
index = Index::Index.new(:analyzer => ChineseAnalyzer.new)
docs.each { |doc| index << doc }
puts index[3][""]
tq = Search::TermQuery.new(Index::Term.new("", "Íø"))
index.search_each(tq) do |doc, score|
puts "Document #{doc} found with score #{score}"
end
index.close

Doh, I meant to change that for the email. Anyway, in case you're
interested, I also fixed the problem with the query parser so that it should
also parse UTF-8 queries. That'll be out in the next release.

Dave

···

On 10/23/05, Miles Keaton <mileskeaton@gmail.com> wrote:

one correction:
RegExpTokenizer should be RETokenizer
(at least in the version you've released publicly)