SentencePieceTokenizer¶
-
class
tkseem.
SentencePieceTokenizer
(unk_token='<UNK>', pad_token='<PAD>', vocab_size=10000, special_tokens=[])[source]¶ Bases:
tkseem._base.BaseTokenizer
Sentencepiece based tokenization.
Methods Summary
decode
(encoded)Decode ids
detokenize
(tokens)Convert tokens to a string
encode
(text)Convert string to a list of ids
encode_sentences
(sentences[, boundries, …])Encode a list of sentences using the trained model
id_to_token
(id)convert id to token
load_model
(file_path)Load a saved sp model
save_model
(file_path)Save a model as a freqency dictionary
token_to_id
(token)Get tokens list
tokenize
(text)Tokenize using the frequency dictionary
train
(file_path[, model_type])Train using sentence piece
Methods Documentation
-
detokenize
(tokens)[source]¶ Convert tokens to a string
- Args:
tokens (list): list of tokens
- Returns:
str: detokenized string
-
encode
(text)[source]¶ Convert string to a list of ids
- Args:
text (str): input string
- Returns:
list: list of ids
-
encode_sentences
(sentences, boundries='', '', out_length=None)¶ Encode a list of sentences using the trained model
- Args:
sentences (list): list of sentences boundries (tuple): boundries for each sentence. out_length (int, optional): specify the max length of encodings. Defaults to 100.
- Returns:
[np.array]: numpy array of encodings
-
load_model
(file_path)[source]¶ Load a saved sp model
- Args:
file_path (str): file path of the trained model
-
save_model
(file_path)[source]¶ Save a model as a freqency dictionary
- Args:
file_path (str): file path to save the model
-