reverend package

Submodules

reverend.thomas module

class reverend.thomas.Bayes(tokenizer=None, combiner=None, data_class=None, training_data=None)[source]

Bases: object

build_cache()[source]

Merges corpora and computes probabilities.

commit()[source]
get_probs(pool, words)[source]

Extracts the probabilities of tokens in a message.

get_tokens(obj)[source]

By default, we expect obj to be a screen and split on whitespace.

Note that this does not change the case. In some applications you may want to lowecase everthing so that “king” and “King” generate the same token.

Override this in your subclass for objects other than text.

Alternatively, you can pass in a tokenizer as part of instance creation.

guess(message)[source]

Guess which buckets the message belongs to.

Parameters:
  • message (str) – The message string to tokenize and subsequently
  • classify.
Returns:

List of tuple pairs indicating which bucket(s) the message string is guessed to be classified under, and the ratio of certainty for this guess. As an example, a 99% probability that the input is a fowl would look like [('fowl', 0.9999)].

Return type:

list of tuple

load(file_path='bayesdata.dat')[source]

Load trained model data from a file path.

Parameters:file_path (str) – Path of database file.
load_handler(file_handler)[source]

Load trained model data from an open file handler.

Parameters:file_handler (file) – Open file pointer, or file-like object.
merge_pools(dest_pool, source_pool)[source]

Merge an existing pool into another.

The data from source_pool is merged into dest_pool. The arguments are the names of the pools to be merged. The pool named source_pool is left in tact and you may want to call remove_pool() to get rid of it.

new_pool(pool_name)[source]

Create a new pool, without actually doing any training.

pool_data(pool_name)[source]

Return a list of the (token, count) tuples.

pool_names()[source]

Return a sorted list of Pool names.

Does not include the system pool ‘__Corpus__’.

pool_probs()[source]
pool_tokens(pool_name)[source]

Return a list of the tokens in this pool.

remove_pool(pool_name)[source]
rename_pool(pool_name, new_name)[source]
static robinson(probs, _)[source]

Computes the probability of a message being spam (Robinson’s method) P = 1 - prod(1-p)^(1/n) Q = 1 - prod(p)^(1/n) S = (1 + (P-Q)/(P+Q)) / 2 Courtesy of http://christophe.delord.free.fr/en/index.html

static robinson_fisher(probs, _)[source]

Computes the probability of a message being spam (Robinson-Fisher method) H = C-1( -2.ln(prod(p)), 2*n ) S = C-1( -2.ln(prod(1-p)), 2*n ) I = (1 + H - S) / 2 Courtesy of http://christophe.delord.free.fr/en/index.html

save(file_path='bayesdata.dat')[source]

Save the trained model to the appropriate path.

Parameters:file_path (str) – Path of database file.
save_handler(file_handler)[source]

Save the trained model to the open file handler.

Parameters:file_handler (file) – Open file pointer, or file-like object.
train(pool, item, uid=None)[source]

Train Bayes by telling him that item belongs in pool.

uid is optional and may be used to uniquely identify the item that is being trained on.

trained_on(msg)[source]
untrain(pool, item, uid=None)[source]
class reverend.thomas.BayesData(name='', pool=None)[source]

Bases: dict

trained_on(item)[source]
class reverend.thomas.Tokenizer(lower=False)[source]

A simple regex-based whitespace tokenizer.

It expects a string and can return all tokens lower-cased or in their existing case.

WORD_RE = <_sre.SRE_Pattern object>
tokenize(obj)[source]
reverend.thomas.chi_2_p(chi, df)[source]

return P(chisq >= chi, with df degree of freedom)

df must be even

Module contents