Rainbow is a C program that performs document classification using one of several different methods, including naive Bayes, TFIDF/Rocchio, K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr's Probabilitistic Indexing, and a simple-minded form a shrinkage with naive Bayes.
Rainbow's accompanying library, `libbow', is a library of C code intended for support of statistical text-processing programs. The current source distribution includes the library, a text classification front-end (rainbow), a simple TFIDF-based document retrieval front-end (arrow), an AltaVista-style document retrieval front-end (archer), and a unsupported document clustering front-end with hierarchical clustering and deterministic annealing (crossbow).
The library provides facilities for:
* Recursively descending directories, finding text files.
* Finding `document' boundaries when there are multiple docs per file.
* Tokenizing a text file, according to several different methods.
* Including N-grams among the tokens.
* Mapping strings to integers and back again, very efficiently.
* Building a sparse matrix of document/token counts.
* Pruning vocabulary by occurrence counts or by information gain.
* Building and manipulating word vectors.
* Setting word vector weights according to NaiveBayes, TFIDF, and a
simple form of Probabilistic Indexing.
* Scoring queries for retrieval or classification.
* Writing all data structures to disk in a compact format.
* Reading the document/token matrix from disk in an efficient,
sparse fashion.
* Performing test/train splits, and automatic classification tests.
* Operating in server mode, receiving and answering queries over a
socket.
|
It is relatively efficient. Reading, tokenizing and indexing the raw text of 20,000 UseNet articles takes about 3 minutes. Building a naive Bayes classifier from 10,000 articles, and classifying the other 10,000 takes about 1 minute.
The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).
The library does not:
Have parsing facilities.
Do smoothing across N-gram models.
Claim to be finished.
Have good documentation.
Claim to be bug-free.
...many other things.
|