MorphologicalTokenizer

class tkseem.MorphologicalTokenizer(unk_token='<UNK>', pad_token='<PAD>', vocab_size=10000, special_tokens=[])[source]

Bases: tkseem._base.BaseTokenizer

Auto tokenization using a saved dictionary

Methods Summary

decode(encoded)

Decode ids

detokenize(tokens)

Convert tokens to a string

encode(text)

Convert string to a list of ids

encode_sentences(sentences[, boundries, …])

Encode a list of sentences using the trained model

id_to_token(id)

convert id to token

load_model(file_path)

Load a saved model as a frequency dictionary

save_model(file_path)

Save a model as a freqency dictionary

token_to_id(piece)

Get tokens list

tokenize(text[, use_cache, max_cache_size])

Args:

train()

Use a default dictionary for training

Methods Documentation

decode(encoded)

Decode ids

Args:

encoded (list): list of ids to decode

Returns:

list: tokens

detokenize(tokens)

Convert tokens to a string

Args:

tokens (list): list of tokens

Returns:

str: detokenized string

encode(text)

Convert string to a list of ids

Args:

text (str): input string

Returns:

list: list of ids

encode_sentences(sentences, boundries='', '', out_length=None)

Encode a list of sentences using the trained model

Args:

sentences (list): list of sentences boundries (tuple): boundries for each sentence. out_length (int, optional): specify the max length of encodings. Defaults to 100.

Returns:

[np.array]: numpy array of encodings

id_to_token(id)

convert id to token

Args:

id (int): input id

Returns:

str: token

load_model(file_path)

Load a saved model as a frequency dictionary

Args:

file_path (str): file path of the dictionary

save_model(file_path)

Save a model as a freqency dictionary

Args:

file_path (str): file path to save the model

token_to_id(piece)

Get tokens list

Returns:

list: tokens

tokenize(text, use_cache=False, max_cache_size=1000)
Args:

text (str): input text use_cache (bool, optional): speed up using caching. Defaults to False. max_cache_size (int, optional): max cacne size. Defaults to 1000.

Returns:

list: output list of tokens

train()[source]

Use a default dictionary for training