SentencePieceTokenizer

class tkseem.SentencePieceTokenizer(unk_token='<UNK>', pad_token='<PAD>', vocab_size=10000, special_tokens=[])[source]

Bases: tkseem._base.BaseTokenizer

Sentencepiece based tokenization.

Methods Summary

decode(encoded)

Decode ids

detokenize(tokens)

Convert tokens to a string

encode(text)

Convert string to a list of ids

encode_sentences(sentences[, boundries, …])

Encode a list of sentences using the trained model

id_to_token(id)

convert id to token

load_model(file_path)

Load a saved sp model

save_model(file_path)

Save a model as a freqency dictionary

token_to_id(token)

Get tokens list

tokenize(text)

Tokenize using the frequency dictionary

train(file_path[, model_type])

Train using sentence piece

Methods Documentation

decode(encoded)[source]

Decode ids

Args:

encoded (list): list of ids to decode

Returns:

list: tokens

detokenize(tokens)[source]

Convert tokens to a string

Args:

tokens (list): list of tokens

Returns:

str: detokenized string

encode(text)[source]

Convert string to a list of ids

Args:

text (str): input string

Returns:

list: list of ids

encode_sentences(sentences, boundries='', '', out_length=None)

Encode a list of sentences using the trained model

Args:

sentences (list): list of sentences boundries (tuple): boundries for each sentence. out_length (int, optional): specify the max length of encodings. Defaults to 100.

Returns:

[np.array]: numpy array of encodings

id_to_token(id)[source]

convert id to token

Args:

id (int): input id

Returns:

str: token

load_model(file_path)[source]

Load a saved sp model

Args:

file_path (str): file path of the trained model

save_model(file_path)[source]

Save a model as a freqency dictionary

Args:

file_path (str): file path to save the model

token_to_id(token)[source]

Get tokens list

Returns:

list: tokens

tokenize(text)[source]

Tokenize using the frequency dictionary

Args:

text (str): input string

Returns:

list: generated tokens

train(file_path, model_type='bpe')[source]

Train using sentence piece

Args:

file_path (str): file to train model_type (str, optional): train using sp. Defaults to “bpe”.