DisjointLetterTokenizer¶
-
class
tkseem.
DisjointLetterTokenizer
(unk_token='<UNK>', pad_token='<PAD>', vocab_size=10000, special_tokens=[])[source]¶ Bases:
tkseem._base.BaseTokenizer
Disjoint Letters based tokenization
Methods Summary
decode
(encoded)Decode ids
detokenize
(tokens)Convert tokens to a string
encode
(text)Convert string to a list of ids
encode_sentences
(sentences[, boundries, …])Encode a list of sentences using the trained model
id_to_token
(id)convert id to token
load_model
(file_path)Load a saved model as a frequency dictionary
save_model
(file_path)Save a model as a freqency dictionary
token_to_id
(piece)Get tokens list
tokenize
(text[, use_cache, max_cache_size])Args:
train
(file_path)Train data using disjoint letters
Methods Documentation
-
decode
(encoded)¶ Decode ids
- Args:
encoded (list): list of ids to decode
- Returns:
list: tokens
-
detokenize
(tokens)¶ Convert tokens to a string
- Args:
tokens (list): list of tokens
- Returns:
str: detokenized string
-
encode
(text)¶ Convert string to a list of ids
- Args:
text (str): input string
- Returns:
list: list of ids
-
encode_sentences
(sentences, boundries='', '', out_length=None)¶ Encode a list of sentences using the trained model
- Args:
sentences (list): list of sentences boundries (tuple): boundries for each sentence. out_length (int, optional): specify the max length of encodings. Defaults to 100.
- Returns:
[np.array]: numpy array of encodings
-
id_to_token
(id)¶ convert id to token
- Args:
id (int): input id
- Returns:
str: token
-
load_model
(file_path)¶ Load a saved model as a frequency dictionary
- Args:
file_path (str): file path of the dictionary
-
save_model
(file_path)¶ Save a model as a freqency dictionary
- Args:
file_path (str): file path to save the model
-
token_to_id
(piece)¶ Get tokens list
- Returns:
list: tokens
-
tokenize
(text, use_cache=False, max_cache_size=1000)¶ - Args:
text (str): input text use_cache (bool, optional): speed up using caching. Defaults to False. max_cache_size (int, optional): max cacne size. Defaults to 1000.
- Returns:
list: output list of tokens
-