Welcome to tkseem’s documentation!¶
Installation¶
pip install tkseem
tokenizers¶
tkseem Package¶
Classes¶
|
Character based tokenization |
|
Disjoint Letters based tokenization |
|
Auto tokenization using a saved dictionary |
|
Randomized based tokenization |
|
Sentencepiece based tokenization. |
|
White space based tokenization |
Class Inheritance Diagram¶

Docs¶
[1]:
#!pip3 install tkseem
Frequency Tokenizer¶
[2]:
import tkseem as tk
Read, preprocess then train
[3]:
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
Training WordTokenizer ...
[4]:
print(tokenizer)
WordTokenizer
Tokenize
[5]:
tokenizer.tokenize("السلام عليكم")
[5]:
['السلام', 'عليكم']
Encode as ids
[6]:
encoded = tokenizer.encode("السلام عليكم")
print(encoded)
[557, 798]
Decode back to tokens
[7]:
decoded = tokenizer.decode(encoded)
print(decoded)
['السلام', 'عليكم']
[8]:
detokenized = tokenizer.detokenize(decoded)
print(detokenized)
السلام عليكم
SentencePiece Tokenizer¶
Read, preprocess then train
[9]:
tokenizer = tk.SentencePieceTokenizer()
tokenizer.train('samples/data.txt')
Training SentencePiece ...
Tokenize
[10]:
tokenizer.tokenize("صباح الخير يا أصدقاء")
[10]:
['▁صباح', '▁الخير', '▁يا', '▁أص', 'د', 'قاء']
Encode as ids
[11]:
encoded = tokenizer.encode("السلام عليكم")
print(encoded)
[1799, 2741]
Decode back to tokens
[12]:
decoded = tokenizer.decode(encoded)
print(decoded)
['▁السلام', '▁عليكم']
[13]:
detokenized = tokenizer.detokenize(decoded)
print(detokenized)
السلام عليكم
Morphological Tokenizer¶
Read, preprocess then train
[14]:
tokenizer = tk.MorphologicalTokenizer()
tokenizer.train()
Training MorphologicalTokenizer ...
Tokenize
[15]:
tokenizer.tokenize("السلام عليكم")
[15]:
['ال', '##سلام', 'علي', '##كم']
Encode as ids
[16]:
encoded = tokenizer.encode("السلام عليكم")
print(encoded)
[2, 367, 764, 184]
Decode back to tokens
[17]:
decoded = tokenizer.decode(encoded)
print(decoded)
['ال', '##سلام', 'علي', '##كم']
Random Tokenizer¶
[18]:
tokenizer = tk.RandomTokenizer()
tokenizer.train('samples/data.txt')
Training RandomTokenizer ...
[19]:
tokenizer.tokenize("السلام عليكم أيها الأصدقاء")
[19]:
['السل', '##ام', 'علي', '##كم', 'أي', '##ها', 'الأص', '##دقا', '##ء']
Disjoint Letter Tokenizer¶
[20]:
tokenizer = tk.DisjointLetterTokenizer()
tokenizer.train('samples/data.txt')
Training DisjointLetterTokenizer ...
[21]:
print(tokenizer.tokenize("السلام عليكم أيها الأصدقاء"))
['ا', '##لسلا', '##م', 'عليكم', 'أ', '##يها', 'ا', '##لأ', '##صد', '##قا', '##ء']
Character Tokenizer¶
[22]:
tokenizer = tk.CharacterTokenizer()
tokenizer.train('samples/data.txt')
Training CharacterTokenizer ...
[23]:
tokenizer.tokenize("السلام عليكم")
[23]:
['ا', '##ل', '##س', '##ل', '##ا', '##م', 'ع', '##ل', '##ي', '##ك', '##م']
Export Models¶
Models can be saved for deployment and reloading.
[24]:
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
tokenizer.save_model('freq.pl')
Training WordTokenizer ...
Saving as pickle file ...
load model without pretraining
[25]:
tokenizer = tk.WordTokenizer()
tokenizer.load_model('freq.pl')
Loading as pickle file ...
[26]:
tokenizer.tokenize('السلام عليكم')
[26]:
['السلام', 'عليكم']
Benchmarking¶
Comparing tokenizers in terms of training time
[27]:
import seaborn as sns
import pandas as pd
import time
def calc_time(fun):
tokenizer = fun()
start_time = time.time()
# morph tokenizer doesn't take arguments
if str(tokenizer) == 'MorphologicalTokenizer':
tokenizer.train()
else:
tokenizer.train('samples/data.txt')
return time.time() - start_time
running_times = {}
running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Character'] = calc_time(tk.CharacterTokenizer)
running_times['Morph'] = calc_time(tk.MorphologicalTokenizer)
plt = sns.barplot(data = pd.DataFrame.from_dict([running_times]))
Training WordTokenizer ...
Training SentencePiece ...
Training RandomTokenizer ...
Training DisjointLetterTokenizer ...
Training CharacterTokenizer ...
Training MorphologicalTokenizer ...

comparing tokenizers in tokenization time
[28]:
import seaborn as sns
import pandas as pd
import time
def calc_time(fun):
tokenizer = fun()
# morph tokenizer doesn't take arguments
if str(tokenizer) == 'MorphologicalTokenizer':
tokenizer.train()
else:
tokenizer.train('samples/data.txt')
start_time = time.time()
tokenizer.tokenize(open('samples/data.txt', 'r').read())
return time.time() - start_time
running_times = {}
running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Character'] = calc_time(tk.CharacterTokenizer)
running_times['Morph'] = calc_time(tk.MorphologicalTokenizer)
plt = sns.barplot(data = pd.DataFrame.from_dict([running_times]))
Training WordTokenizer ...
Training SentencePiece ...
Training RandomTokenizer ...
Training DisjointLetterTokenizer ...
Training CharacterTokenizer ...
Training MorphologicalTokenizer ...

Caching¶
Caching is used for speeding up the tokenization process.
[32]:
import tkseem as tk
tokenizer = tk.MorphologicalTokenizer()
tokenizer.train()
Training MorphologicalTokenizer ...
[33]:
%%timeit
out = tokenizer.tokenize(open('samples/data.txt', 'r').read(), use_cache = False)
8.82 s ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[34]:
%%timeit
out = tokenizer.tokenize(open('samples/data.txt', 'r').read(), use_cache = True, max_cache_size = 10000)
7.14 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sentiment Analysis¶
[ ]:
!pip install tkseem
!pip install tnkeeh
[ ]:
!wget https://raw.githubusercontent.com/ARBML/tkseem/master/tasks/sentiment_analysis/sentiment/data.txt
!wget https://raw.githubusercontent.com/ARBML/tkseem/master/tasks/sentiment_analysis/sentiment/labels.txt
Imports¶
[3]:
import numpy as np
import tkseem as tk
import tnkeeh as tn
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import GRU, Embedding, Dense, Input, Dropout, Bidirectional
Process data¶
[4]:
tn.clean_data(file_path = 'sentiment/data.txt', save_path = 'sentiment/cleaned_data.txt', remove_diacritics=True,
execluded_chars=['!', '.', '?'])
tn.split_classification_data('sentiment/cleaned_data.txt', 'sentiment/labels.txt')
train_data, test_data, train_lbls, test_lbls = tn.read_data(mode = 1)
Remove diacritics
Remove Tatweel
Saving to sentiment/cleaned_data.txt
Split data
Save to data
Read data ['test_data.txt', 'test_lbls.txt', 'train_data.txt', 'train_lbls.txt']
[5]:
max_length = max(len(data) for data in train_data)
Tokenize¶
[6]:
tokenizer = tk.SentencePieceTokenizer()
tokenizer.train('data/train_data.txt')
Training SentencePiece ...
Tokenize data¶
[7]:
def preprocess(tokenizer, data, labels):
X = tokenizer.encode_sentences(data)
y = np.array([int(lbl) for lbl in labels])
return X, y
[8]:
# process training data
X_train, y_train = preprocess(tokenizer, train_data, train_lbls)
# process test data
X_test, y_test = preprocess(tokenizer, test_data, test_lbls)
Model¶
[9]:
model = Sequential()
model.add(Embedding(tokenizer.vocab_size, 32))
model.add(Bidirectional(GRU(units = 32)))
model.add(Dense(32, activation = 'tanh'))
model.add(Dropout(0.3))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
Train¶
[10]:
history = model.fit(X_train, y_train, epochs = 12, validation_split = 0.1, batch_size= 128, shuffle = True)
Epoch 1/12
6/6 [==============================] - 3s 445ms/step - loss: 0.6936 - accuracy: 0.4986 - val_loss: 0.6990 - val_accuracy: 0.3625
Epoch 2/12
6/6 [==============================] - 2s 324ms/step - loss: 0.6883 - accuracy: 0.5097 - val_loss: 0.6986 - val_accuracy: 0.3625
Epoch 3/12
6/6 [==============================] - 1s 193ms/step - loss: 0.6827 - accuracy: 0.6139 - val_loss: 0.6890 - val_accuracy: 0.5875
Epoch 4/12
6/6 [==============================] - 2s 254ms/step - loss: 0.6706 - accuracy: 0.8222 - val_loss: 0.6814 - val_accuracy: 0.6625
Epoch 5/12
6/6 [==============================] - 1s 238ms/step - loss: 0.6473 - accuracy: 0.8861 - val_loss: 0.6730 - val_accuracy: 0.6875
Epoch 6/12
6/6 [==============================] - 1s 214ms/step - loss: 0.6117 - accuracy: 0.9014 - val_loss: 0.6543 - val_accuracy: 0.7125
Epoch 7/12
6/6 [==============================] - 2s 266ms/step - loss: 0.5536 - accuracy: 0.9167 - val_loss: 0.6210 - val_accuracy: 0.7500
Epoch 8/12
6/6 [==============================] - 1s 237ms/step - loss: 0.4579 - accuracy: 0.9347 - val_loss: 0.5906 - val_accuracy: 0.7500
Epoch 9/12
6/6 [==============================] - 1s 197ms/step - loss: 0.3353 - accuracy: 0.9500 - val_loss: 0.5605 - val_accuracy: 0.7375
Epoch 10/12
6/6 [==============================] - 1s 219ms/step - loss: 0.2050 - accuracy: 0.9639 - val_loss: 0.5069 - val_accuracy: 0.7625
Epoch 11/12
6/6 [==============================] - 1s 216ms/step - loss: 0.1315 - accuracy: 0.9694 - val_loss: 0.5215 - val_accuracy: 0.7250
Epoch 12/12
6/6 [==============================] - 1s 166ms/step - loss: 0.1063 - accuracy: 0.9625 - val_loss: 0.5699 - val_accuracy: 0.7125
Test¶
[11]:
def classify(sentence):
sequence = tokenizer.encode_sentences([sentence], out_length = max_length)[0]
pred = model.predict(sequence)[0][0]
print(pred)
[12]:
classify("سيئة جدا جدا")
classify("رائعة جدا")
0.06951779
0.89656436
Poetry Classification¶
[1]:
!pip install tkseem
!pip install tnkeeh
/bin/bash: pip: command not found
/bin/bash: pip: command not found
[ ]:
!wget https://raw.githubusercontent.com/ARBML/tkseem/master/tasks/meter_classification/meters/data.txt
!wget https://raw.githubusercontent.com/ARBML/tkseem/master/tasks/meter_classification/meters/labels.txt
Imports¶
[3]:
import tensorflow as tf
import tkseem as tk
import tnkeeh as tn
import numpy as np
from tensorflow.keras.layers import GRU, Embedding, Dense, Input, Dropout, Bidirectional, BatchNormalization, Flatten, Reshape
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
Process data¶
[4]:
tn.clean_data(file_path = 'meters/data.txt', save_path = 'meters/cleaned_data.txt', remove_diacritics=True,
execluded_chars=['!', '.', '?', '#'])
tn.split_classification_data('meters/cleaned_data.txt', 'meters/labels.txt')
train_data, test_data, train_lbls, test_lbls = tn.read_data(mode = 1)
Remove diacritics
Remove Tatweel
Saving to meters/cleaned_data.txt
Split data
Save to data
Read data ['test_data.txt', 'test_lbls.txt', 'train_data.txt', 'train_lbls.txt']
Tokenization¶
[5]:
tokenizer = tk.CharacterTokenizer()
tokenizer.train('data/train_data.txt')
Training CharacterTokenizer ...
Tokenize data¶
[6]:
def preprocess(tokenizer, data, labels):
X = tokenizer.encode_sentences(data)
y = np.array([int(lbl) for lbl in labels])
return X, y
[7]:
# process training data
X_train, y_train = preprocess(tokenizer, train_data, train_lbls)
# process test data
X_test, y_test = preprocess(tokenizer, test_data, test_lbls)
[8]:
max_length = max(len(sent) for sent in X_train)
Model¶
[9]:
model = Sequential()
model.add(Input((max_length,)))
model.add(Embedding(tokenizer.vocab_size, 256))
model.add(Bidirectional(GRU(units = 256, return_sequences=True)))
model.add(Bidirectional(GRU(units = 256, return_sequences=True)))
model.add(Bidirectional(GRU(units = 256)))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.3))
model.add(Dense(14, activation = 'softmax'))
model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
[10]:
model.fit(X_train, y_train, validation_split = 0.1, epochs = 10, batch_size= 256, shuffle = True)
Epoch 1/10
133/133 [==============================] - 465s 3s/step - loss: 2.3899 - accuracy: 0.1572 - val_loss: 1.9431 - val_accuracy: 0.2902
Epoch 2/10
133/133 [==============================] - 452s 3s/step - loss: 1.8384 - accuracy: 0.3214 - val_loss: 1.6722 - val_accuracy: 0.3905
Epoch 3/10
133/133 [==============================] - 436s 3s/step - loss: 1.5614 - accuracy: 0.4314 - val_loss: 1.5018 - val_accuracy: 0.4581
Epoch 4/10
133/133 [==============================] - 381s 3s/step - loss: 1.1860 - accuracy: 0.5879 - val_loss: 0.8718 - val_accuracy: 0.7109
Epoch 5/10
133/133 [==============================] - 370s 3s/step - loss: 0.7501 - accuracy: 0.7595 - val_loss: 0.5991 - val_accuracy: 0.8085
Epoch 6/10
133/133 [==============================] - 360s 3s/step - loss: 0.5233 - accuracy: 0.8410 - val_loss: 0.5352 - val_accuracy: 0.8332
Epoch 7/10
133/133 [==============================] - 361s 3s/step - loss: 0.4070 - accuracy: 0.8807 - val_loss: 0.4281 - val_accuracy: 0.8708
Epoch 8/10
133/133 [==============================] - 355s 3s/step - loss: 0.3229 - accuracy: 0.9074 - val_loss: 0.3947 - val_accuracy: 0.8841
Epoch 9/10
133/133 [==============================] - 356s 3s/step - loss: 0.2724 - accuracy: 0.9241 - val_loss: 0.3725 - val_accuracy: 0.8926
Epoch 10/10
133/133 [==============================] - 355s 3s/step - loss: 0.2301 - accuracy: 0.9352 - val_loss: 0.3540 - val_accuracy: 0.8989
[10]:
<tensorflow.python.keras.callbacks.History at 0x7ff3a7692160>
Test¶
[11]:
label2name = ['السريع', 'الكامل', 'المتقارب', 'المتدارك', 'المنسرح', 'المديد',
'المجتث', 'الرمل', 'البسيط', 'الخفيف', 'الطويل', 'الوافر', 'الهزج', 'الرجز']
[14]:
def classify(sentence):
sequence = tokenizer.encode_sentences([sentence], out_length = max_length)
pred = model.predict(sequence)[0]
print(label2name[np.argmax(pred, 0).astype('int')], np.max(pred))
[15]:
classify("ما تردون على هذا المحب # دائبا يشكو إليكم في الكتب")
classify("ولد الهدى فالكائنات ضياء # وفم الزمان تبسم وسناء")
classify(" لك يا منازل في القلوب منازل # أقفرت أنت وهن منك أواهل")
classify("ومن لم يمت بالسيف مات بغيره # تعددت الأسباب والموت واحد")
classify("أنا النبي لا كذب # أنا ابن عبد المطلب")
classify("هذه دراهم اقفرت # أم ربور محتها الدهور")
classify("هزجنا في بواديكم # فأجزلتم عطايانا")
classify("بحر سريع ماله ساحل # مستفعلن مستفعلن فاعلن")
classify("مَا مَضَى فَاتَ وَالْمُؤَمَّلُ غَيْبٌ # وَلَكَ السَّاعَةُ الَّتِيْ أَنْتَ فِيْهَا")
classify("يا ليلُ الصبّ متى غدهُ # أقيامُ الساعة موعدهُ")
الرمل 0.9957462
الكامل 0.98703927
الكامل 0.9792284
الطويل 0.99692947
الهزج 0.94578993
المديد 0.3755584
الهزج 0.981885
الرجز 0.8000305
المتدارك 0.7176092
المتدارك 0.99850094
[ ]:
# modified version from https://www.tensorflow.org/tutorials/text/nmt_with_attention
Tranlsation¶
[4]:
!wget https://raw.githubusercontent.com/ARBML/tkseem/master/tasks/translation/data/ar_data.txt
!wget https://raw.githubusercontent.com/ARBML/tkseem/master/tasks/translation/data/en_data.txt
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/zaid/.wget-hsts'. HSTS will be disabled.
--2020-08-28 14:49:14-- https://raw.githubusercontent.com/ARBML/tkseem/master/tasks/translation/data/ar_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3705050 (3.5M) [text/plain]
Saving to: ‘ar_data.txt’
ar_data.txt 100%[===================>] 3.53M 719KB/s in 5.1s
2020-08-28 14:49:21 (708 KB/s) - ‘ar_data.txt’ saved [3705050/3705050]
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/zaid/.wget-hsts'. HSTS will be disabled.
--2020-08-28 14:49:21-- https://raw.githubusercontent.com/ARBML/tkseem/master/tasks/translation/data/en_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2510593 (2.4M) [text/plain]
Saving to: ‘en_data.txt’
en_data.txt 100%[===================>] 2.39M 588KB/s in 4.2s
2020-08-28 14:49:26 (588 KB/s) - ‘en_data.txt’ saved [2510593/2510593]
[ ]:
!pip install tkseem
!pip install tnkeeh
[1]:
import re
import nltk
import time
import numpy as np
import tkseem as tk
import tnkeeh as tn
import tensorflow as tf
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
Data Preprocessing¶
[5]:
tn.clean_data('ar_data.txt','ar_clean_data.txt', remove_diacritics=True)
tn.clean_data('en_data.txt','en_clean_data.txt')
tn.split_parallel_data('ar_clean_data.txt', 'en_clean_data.txt', split_ratio=0.3)
train_inp_text, train_tar_text, test_inp_text, test_tar_text = tn.read_data(mode = 2)
Remove diacritics
Remove Tatweel
Saving to ar_clean_data.txt
Remove Tatweel
Saving to en_clean_data.txt
Split data
Save to data
Read data ['ar_data.txt', 'en_data.txt', 'test_inp_data.txt', 'test_tar_data.txt', 'train_inp_data.txt', 'train_tar_data.txt']
Tokenization¶
[6]:
ar_tokenizer = tk.SentencePieceTokenizer(special_tokens=['<s>', '</s>'])
ar_tokenizer.train('data/train_inp_data.txt')
en_tokenizer = tk.SentencePieceTokenizer(special_tokens=['<s>', '</s>'])
en_tokenizer.train('data/train_tar_data.txt')
train_inp_data = ar_tokenizer.encode_sentences(train_inp_text, boundries = ('<s>', '</s>'))
train_tar_data = en_tokenizer.encode_sentences(train_tar_text, boundries = ('<s>', '</s>'))
Training SentencePiece ...
Training SentencePiece ...
Create Dataset¶
[7]:
BATCH_SIZE = 64
BUFFER_SIZE = len(train_inp_data)
dataset = tf.data.Dataset.from_tensor_slices((train_inp_data, train_tar_data)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
Encoder, Decoder¶
[8]:
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))
class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, query, values):
# query hidden state shape == (batch_size, hidden size)
# query_with_time_axis shape == (batch_size, 1, hidden size)
# values shape == (batch_size, max_len, hidden size)
# we are doing this to broadcast addition along the time axis to calculate the score
query_with_time_axis = tf.expand_dims(query, 1)
# score shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
# the shape of the tensor before applying self.V is (batch_size, max_length, units)
score = self.V(tf.nn.tanh(
self.W1(query_with_time_axis) + self.W2(values)))
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
def get_loss_object():
return tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 1))
loss_ = get_loss_object()(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
Initialize models
[9]:
units = 1024
embedding_dim = 256
max_length_inp = train_inp_data.shape[1]
max_length_tar = train_tar_data.shape[1]
steps_per_epoch = len(train_inp_data)//BATCH_SIZE
vocab_inp_size = ar_tokenizer.vocab_size
vocab_tar_size = en_tokenizer.vocab_size
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
Training Procedure¶
[10]:
@tf.function
def train_step(inp, targ, enc_hidden, encoder, decoder, optimizer, en_tokenizer):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([en_tokenizer.token_to_id('<s>')] * BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
loss += loss_function(targ[:, t], predictions)
# using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
def train(epochs = 10, verbose = 0 ):
optimizer = tf.keras.optimizers.Adam()
for epoch in range(epochs):
start = time.time()
enc_hidden = encoder.initialize_hidden_state()
total_loss = 0
for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
batch_loss = train_step(inp, targ, enc_hidden, encoder, decoder, optimizer, en_tokenizer)
total_loss += batch_loss
if batch % 100 == 0 and verbose:
print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch,
batch_loss.numpy()))
if verbose:
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
Start training
[11]:
train(epochs = 10, verbose = 1)
Epoch 1 Batch 0 Loss 8.7736
Epoch 1 Batch 100 Loss 2.1184
Epoch 1 Batch 200 Loss 1.7768
Epoch 1 Batch 300 Loss 1.7248
Epoch 1 Batch 400 Loss 1.6401
Epoch 1 Loss 2.0055
Time taken for 1 epoch 1444.5116345882416 sec
Epoch 2 Batch 0 Loss 1.6100
Epoch 2 Batch 100 Loss 1.5598
Epoch 2 Batch 200 Loss 1.5922
Epoch 2 Batch 300 Loss 1.5228
Epoch 2 Batch 400 Loss 1.4033
Epoch 2 Loss 1.5530
Time taken for 1 epoch 1424.0314059257507 sec
Epoch 3 Batch 0 Loss 1.2111
Epoch 3 Batch 100 Loss 1.4820
Epoch 3 Batch 200 Loss 1.3912
Epoch 3 Batch 300 Loss 1.4882
Epoch 3 Batch 400 Loss 1.2942
Epoch 3 Loss 1.3888
Time taken for 1 epoch 1441.213187456131 sec
Epoch 4 Batch 0 Loss 1.2663
Epoch 4 Batch 100 Loss 1.3889
Epoch 4 Batch 200 Loss 1.1667
Epoch 4 Batch 300 Loss 1.2853
Epoch 4 Batch 400 Loss 1.2746
Epoch 4 Loss 1.2559
Time taken for 1 epoch 1422.2563009262085 sec
Epoch 5 Batch 0 Loss 1.1258
Epoch 5 Batch 100 Loss 1.1021
Epoch 5 Batch 200 Loss 1.1365
Epoch 5 Batch 300 Loss 1.1450
Epoch 5 Batch 400 Loss 1.3664
Epoch 5 Loss 1.1176
Time taken for 1 epoch 1378.149689912796 sec
Epoch 6 Batch 0 Loss 0.9396
Epoch 6 Batch 100 Loss 1.0216
Epoch 6 Batch 200 Loss 1.1066
Epoch 6 Batch 300 Loss 1.0084
Epoch 6 Batch 400 Loss 1.1767
Epoch 6 Loss 0.9732
Time taken for 1 epoch 1328.8411734104156 sec
Epoch 7 Batch 0 Loss 0.9608
Epoch 7 Batch 100 Loss 0.8912
Epoch 7 Batch 200 Loss 0.8274
Epoch 7 Batch 300 Loss 0.8302
Epoch 7 Batch 400 Loss 0.7896
Epoch 7 Loss 0.8303
Time taken for 1 epoch 1294.177453994751 sec
Epoch 8 Batch 0 Loss 0.6882
Epoch 8 Batch 100 Loss 0.6465
Epoch 8 Batch 200 Loss 0.7108
Epoch 8 Batch 300 Loss 0.7176
Epoch 8 Batch 400 Loss 0.7323
Epoch 8 Loss 0.7000
Time taken for 1 epoch 1367.661788702011 sec
Epoch 9 Batch 0 Loss 0.5313
Epoch 9 Batch 100 Loss 0.4794
Epoch 9 Batch 200 Loss 0.6126
Epoch 9 Batch 300 Loss 0.6033
Epoch 9 Batch 400 Loss 0.5891
Epoch 9 Loss 0.5853
Time taken for 1 epoch 1372.0978388786316 sec
Epoch 10 Batch 0 Loss 0.5009
Epoch 10 Batch 100 Loss 0.5200
Epoch 10 Batch 200 Loss 0.4687
Epoch 10 Batch 300 Loss 0.4556
Epoch 10 Batch 400 Loss 0.4321
Epoch 10 Loss 0.4802
Time taken for 1 epoch 1334.807544708252 sec
Test¶
[12]:
def evaluate(sentence):
attention_plot = np.zeros((max_length_tar, max_length_inp))
inputs = ar_tokenizer.encode_sentences([sentence], boundries = ('<s>', '</s>'),
out_length = max_length_inp)
inputs = tf.convert_to_tensor(inputs)
result = ''
hidden = [tf.zeros((1, units))]
enc_out, enc_hidden = encoder(inputs, hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([en_tokenizer.token_to_id('<s>')], 0)
for t in range(max_length_tar):
predictions, dec_hidden, attention_weights = decoder(dec_input,
dec_hidden,
enc_out)
# storing the attention weights to plot later on
attention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()
predicted_id = tf.argmax(predictions[0]).numpy()
result += en_tokenizer.id_to_token(predicted_id) + ' '
if en_tokenizer.id_to_token(predicted_id) == '</s>':
return result, sentence
# the predicted ID is fed back into the model
dec_input = tf.expand_dims([predicted_id], 0)
return result, sentence
def translate(sentences, translations, verbose = 1):
inputs = sentences
outputs = []
for i, sentence in enumerate(sentences):
result, sentence = evaluate(sentence)
result = ar_tokenizer.detokenize(result)
result = result.replace('<s>', '').replace('</s>', '')
result = re.sub(' +', ' ', result)
outputs.append(result)
if verbose:
print('inpt: %s' % (sentence))
print('pred: {}'.format(result))
print('true: {}'.format(translations[i]))
[13]:
translate(test_inp_text[:50], test_tar_text[:50], verbose = 1)
inpt: حسنا هناك بنك لك
pred: Well there ' s the name for you
true: Well there ' s a bank for you
inpt: ماذا حدث يا أبي
pred: What happened Dad
true: What happened Father
inpt: حسنا لقد مرت أربع سنوات تقريبا
pred: Well I ' ll be years since
true: Well it ' s almost four years now
inpt: هذا صحيح أليس كذلك ما
pred: That ' s right isn ' t it
true: That ' s right ain ' t it Ma
inpt: أربع سنوات أربع سنوات 5 يونيو بنسلفانيا
pred: Four months years of the floor
true: Four years Four years 5th June Pa
inpt: لم أستطع مواكبة المدفوعات
pred: I couldn ' t steal up for the prisoner ' s jewels
true: I couldn ' t keep up the payments
inpt: تتذكره
pred: You remember him
true: You remember him
inpt: راندي دنلاب
pred: The Potem oin the toxic ms
true: Randy Dunlap
inpt: لحاء الشجر
pred: l journey less
true: Bark
inpt: لذلك دخلت وطلب مني أن أجلس
pred: So he died and keep me to stop
true: So I dropped in and he asked me to sit down
inpt: جورج هل تعرف ماذا كان يرتدي
pred: George do you know what was
true: George do you know what he was wearing
inpt: كيمونو
pred: Kim was
true: A kimono
inpt: لا
pred: No
true: No
inpt: بلى
pred: Yeah
true: Yeah
inpt: أوه الآن اللحاء
pred: Oh now ' s silly
true: Oh now Bark
inpt: يجب أن يكون ثوب خلع الملابس
pred: The ve got a big room
true: It must have been a dressing gown
inpt: أنا أعرف ثوب خلع الملابس عندما أراه
pred: I know the whole bedroom
true: I know a dressing gown when I see it
inpt: كان كيمونو جورج
pred: He was amazing mom
true: It was a kimono George
inpt: هل لفساتين الملابس الزهور على م
pred: Are you a rooster with the prisoner glasses
true: Do dressing gowns have flowers on ' em
inpt: أوه اللحاء
pred: Oh Merna
true: Oh Bark
inpt: لا مانع من ذلك يا أبي
pred: Don ' t mind who is my father
true: Never mind that Father
inpt: ماذا قال
pred: What did he say
true: What did he say
inpt: أوه لقد كان لطيفا بما فيه الكفاية
pred: Oh he ' s my mom enough
true: Oh he was nice enough
inpt: أوه الآن اللحاء
pred: Oh now ' s silly
true: Oh now Bark
inpt: نعم لقد فعل
pred: Yes he ' s done
true: Yeah he did
inpt: كم من الوقت أعطاك يا أبي
pred: How long time is still Dad
true: How much time did he give you Father
inpt: ستة أشهر
pred: Nine months
true: Six months
inpt: يا حسنا إذن لا يوجد اندفاع فوري
pred: Oh Well uh we ' s no coffin
true: Oh Oh well then there ' s no immediate rush
inpt: متى تصل الشهور الستة
pred: When did you a pot s I inquire
true: When are the six months up
inpt: الثلاثاء
pred: Paper
true: Tuesday
inpt: لكن ولكن لماذا لم تخبرنا عاجلا
pred: But but why didn ' t you stop them
true: But but why didn ' t you tell us sooner
inpt: الثلاثاء
pred: Paper
true: Tuesday
inpt: لا يعطينا الكثير من الوقت أليس كذلك
pred: Don ' t give them a lot of time is it
true: Doesn ' t give us much time does it
inpt: أو
pred: Or
true: Or
inpt: هذا صحيح
pred: That ' s right
true: That ' s right
inpt: بالطبع
pred: Of course
true: Oh sure
inpt: الذي أعطاك هذا اللباس جيش الخلاص
pred: The d put this place is a few weeks
true: Who gave you that dress the Salvation Army
inpt: و
pred: And
true: And uh
inpt: بلى
pred: Yeah
true: Yeah
inpt: حسنا لا أستطيع فعل ذلك بمفردي
pred: Well I can ' t do it on your own
true: Well I can ' t do it alone
inpt: لا إنها لم ترسل لنا البرتقالي
pred: No she ' s not your delicate ed and the Israelites
true: No she ' s never even sent us an orange
inpt: نعم ولكن ماذا عن هارفي
pred: Yes but what brings about
true: Yes but what about Harvey
inpt: أوه نحن لا نريد أن نسأل هارفي
pred: Oh we don ' t want to remember my mom
true: Oh we wouldn ' t want to ask Harvey
inpt: أوه لا لن نسأل هارفي
pred: Oh no I wouldn ' t die
true: Oh no we wouldn ' t ask Harvey
inpt: لا طلبنا من هارفي الزواج من نيلي
pred: Don ' t you caught my mom did you Rachel
true: No we asked Harvey to marry Nellie
inpt: لا يمكننا أن نتوقع من الرجل أن يفعل أكثر من ذلك
pred: We can ' t we ' ve got a man can do that
true: We can ' t expect the guy to do more than that
inpt: روبرت توقف عن الحديث بهذه الطريقة
pred: Elizabeth stop talking to the way
true: Robert stop talking that way
inpt: قصها يا روبرت
pred: A spear it Robert
true: Cut it out Robert
inpt: ليس لدي مجال لكلا منكما
pred: I haven ' t the whole world I ' re in
true: I haven ' t room for both of you
inpt: لا يوجد سوى أريكة صغيرة في غرفة المعيشة
pred: There ' s nothing for a big tree in the street
true: There ' s only a small couch in the living room