DisjointLetterTokenizer¶

class tkseem.DisjointLetterTokenizer(unk_token='<UNK>', pad_token='<PAD>', vocab_size=10000, special_tokens=[])[source]¶

Bases: tkseem._base.BaseTokenizer

Disjoint Letters based tokenization

Methods Summary

`decode`(encoded)	Decode ids
`detokenize`(tokens)	Convert tokens to a string
`encode`(text)	Convert string to a list of ids
`encode_sentences`(sentences[, boundries, …])	Encode a list of sentences using the trained model
`id_to_token`(id)	convert id to token
`load_model`(file_path)	Load a saved model as a frequency dictionary
`save_model`(file_path)	Save a model as a freqency dictionary
`token_to_id`(piece)	Get tokens list
`tokenize`(text[, use_cache, max_cache_size])	Args:
`train`(file_path)	Train data using disjoint letters

Methods Documentation

decode(encoded)¶

Decode ids

detokenize(tokens)¶

Convert tokens to a string

encode(text)¶

Convert string to a list of ids

encode_sentences(sentences, boundries='', '', out_length=None)¶

Encode a list of sentences using the trained model

Args:: sentences (list): list of sentences boundries (tuple): boundries for each sentence. out_length (int, optional): specify the max length of encodings. Defaults to 100.
Returns:: [np.array]: numpy array of encodings

id_to_token(id)¶

convert id to token

load_model(file_path)¶

Load a saved model as a frequency dictionary

save_model(file_path)¶

Save a model as a freqency dictionary

token_to_id(piece)¶

Get tokens list

tokenize(text, use_cache=False, max_cache_size=1000)¶

Args:: text (str): input text use_cache (bool, optional): speed up using caching. Defaults to False. max_cache_size (int, optional): max cacne size. Defaults to 1000.
Returns:: list: output list of tokens

train(file_path)[source]¶

Train data using disjoint letters