SentencePieceTokenizer¶
-
class
tkseem.SentencePieceTokenizer(unk_token='<UNK>', pad_token='<PAD>', vocab_size=10000, special_tokens=[])[source]¶ Bases:
tkseem._base.BaseTokenizerSentencepiece based tokenization.
Methods Summary
decode(encoded)Decode ids
detokenize(tokens)Convert tokens to a string
encode(text)Convert string to a list of ids
encode_sentences(sentences[, boundries, …])Encode a list of sentences using the trained model
id_to_token(id)convert id to token
load_model(file_path)Load a saved sp model
save_model(file_path)Save a model as a freqency dictionary
token_to_id(token)Get tokens list
tokenize(text)Tokenize using the frequency dictionary
train(file_path[, model_type])Train using sentence piece
Methods Documentation
-
detokenize(tokens)[source]¶ Convert tokens to a string
- Args:
tokens (list): list of tokens
- Returns:
str: detokenized string
-
encode(text)[source]¶ Convert string to a list of ids
- Args:
text (str): input string
- Returns:
list: list of ids
-
encode_sentences(sentences, boundries='', '', out_length=None)¶ Encode a list of sentences using the trained model
- Args:
sentences (list): list of sentences boundries (tuple): boundries for each sentence. out_length (int, optional): specify the max length of encodings. Defaults to 100.
- Returns:
[np.array]: numpy array of encodings
-
load_model(file_path)[source]¶ Load a saved sp model
- Args:
file_path (str): file path of the trained model
-
save_model(file_path)[source]¶ Save a model as a freqency dictionary
- Args:
file_path (str): file path to save the model
-