SentencePieceTokenizer¶

class tkseem.SentencePieceTokenizer(unk_token='<UNK>', pad_token='<PAD>', vocab_size=10000, special_tokens=[])[source]¶

Bases: tkseem._base.BaseTokenizer

Sentencepiece based tokenization.

Methods Summary

`decode`(encoded)	Decode ids
`detokenize`(tokens)	Convert tokens to a string
`encode`(text)	Convert string to a list of ids
`encode_sentences`(sentences[, boundries, …])	Encode a list of sentences using the trained model
`id_to_token`(id)	convert id to token
`load_model`(file_path)	Load a saved sp model
`save_model`(file_path)	Save a model as a freqency dictionary
`token_to_id`(token)	Get tokens list
`tokenize`(text)	Tokenize using the frequency dictionary
`train`(file_path[, model_type])	Train using sentence piece

Methods Documentation

decode(encoded)[source]¶

Decode ids

detokenize(tokens)[source]¶

Convert tokens to a string

encode(text)[source]¶

Convert string to a list of ids

encode_sentences(sentences, boundries='', '', out_length=None)¶

Encode a list of sentences using the trained model

Args:: sentences (list): list of sentences boundries (tuple): boundries for each sentence. out_length (int, optional): specify the max length of encodings. Defaults to 100.
Returns:: [np.array]: numpy array of encodings

id_to_token(id)[source]¶

convert id to token

load_model(file_path)[source]¶

Load a saved sp model

save_model(file_path)[source]¶

Save a model as a freqency dictionary

token_to_id(token)[source]¶

Get tokens list

tokenize(text)[source]¶

Tokenize using the frequency dictionary

train(file_path, model_type='bpe')[source]¶

Train using sentence piece

Args:: file_path (str): file to train model_type (str, optional): train using sp. Defaults to “bpe”.