tokenizers

tkseem Package

Classes

CharacterTokenizer([unk_token, pad_token, …])

Character based tokenization

DisjointLetterTokenizer([unk_token, …])

Disjoint Letters based tokenization

MorphologicalTokenizer([unk_token, …])

Auto tokenization using a saved dictionary

RandomTokenizer([unk_token, pad_token, …])

Randomized based tokenization

SentencePieceTokenizer([unk_token, …])

Sentencepiece based tokenization.

WordTokenizer([unk_token, pad_token, …])

White space based tokenization

Class Inheritance Diagram

Inheritance diagram of tkseem.character_tokenizer.CharacterTokenizer, tkseem.disjoint_letters_tokenizer.DisjointLetterTokenizer, tkseem.morphological_tokenizer.MorphologicalTokenizer, tkseem.random_tokenizder.RandomTokenizer, tkseem.sentencepiece_tokenizer.SentencePieceTokenizer, tkseem.word_tokenizer.WordTokenizer