Token Embedding

The second part of the source code is related to token and its position embedding. We reivew the former firstly.

from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

The TokenEmbedding class just includes an initialize function and a forward function where the __init__ function (aka. the constructor) is called when the class is instantiated as an object, and the forward function is called when parsing parameters directly to the object, which invokes the __call__ method because the class simply inherits the __call__ method of the nn.Module class. It is here where the forward method is called. Below test code presents an example.

>>> class Test(nn.Module):
...     def __init__(self):
...         super(Test, self).__init__()
...         print('Test class __init__')
...     def forward(self, x):
...         return x + 1
...
>>> t = Test()
Test class __init__
>>> t(1)
2
>>>

Back to the source code, the initialization of the TokenEmbedding class is related to the instantiation of nn.Embedding class and two parameters, vocab_size and emb_size which refer to the length of the token dictionary vocab_transform mentioned in here and the length of each word vector. So, each token in the dictionary will be associated with a number vector (tensor) of length equal to emb_size.

The nn.embedding(num_embeddings, embedding_dim) will initialize a tensor (learnable weights) with shape (num_embeddings, embedding_dim) from $\mathbb{N}\left ( 0, 1 \right )$ . We can review the values in the tensor by calling the .weight function and check if it is normally distributed by p value from scipy.stats.normaltest function as below code shows. It is more likely to be a normal distribution when p value is closer to 1.

>>> from torch import nn
>>> from scipy import stats
>>>
>>> embedding = nn.Embedding(10,512)
>>> print(embedding.weight.shape)
torch.Size([10, 512])
>>> print(embedding.weight)
Parameter containing:
tensor([[ 0.1744,  1.3013, -0.9791,  ..., -0.0872,  0.4686, -0.9148],
        [-0.5932,  0.0042, -0.0580,  ..., -1.7171,  2.0935, -1.3774],
        [-0.6436, -0.4488,  2.2102,  ..., -0.2626, -0.0759,  0.7769],
        ...,
        [-0.0236, -1.0380,  1.0186,  ..., -1.6911,  0.4438, -0.1033],
        [ 0.3624, -0.3315, -0.2723,  ...,  0.8990, -0.5651, -0.2654],
        [ 0.3786,  0.9338,  0.7280,  ..., -1.8523, -1.1715, -0.9778]],
       requires_grad=True)
>>> w = embedding.weight.reshape(1,-1).tolist()[0]
>>> stats.normaltest(w)
NormaltestResult(statistic=0.2605386827757071, pvalue=0.8778589553262913)

We can also plot those values by below code to help check it.

>>> s = pd.DataFrame(w,columns = ['value'])
>>>
>>> fig = plt.figure(figsize = (10,6))
>>> ax1 = fig.add_subplot(2,1,1)
>>> ax1.scatter(s.index, s.values)
<matplotlib.collections.PathCollection object at 0x000001B56A6985E0>
>>> plt.grid()
>>>
>>> ax2 = fig.add_subplot(2,1,2)
>>> s.hist(bins=30,alpha = 0.5,ax = ax2)
array([<AxesSubplot:title={'center':'value'}>], dtype=object)
>>> s.plot(kind = 'kde', secondary_y=True,ax = ax2)
<AxesSubplot:>
>>> plt.grid()
>>> plt.show()

plot result img
After the embedding object is created, we can parsing some tensors with integers directly to the object and it will return the tensors with length embedding_dim from the embedding.weight according to those integers as indices of the tensors. As below instance shows, the embedding is a 5x3 tensor whose values are normally distributed, and the input is a 2x4 tensor whose values are from [0, 5). The first dimension of the output tensor after embedding is a 2 which is equal to the first dimension of the input. The second dimension of the output is 4 since each tensor in the input contains 4 integers and each integer is treated as an index in embedding.weight. So the first tensor [0, 2, 0, 1] in the input means just copying of the 0th, 2th, 0th, and 1th tensor from the embedding tensor to the output and there is why the third dimension of the output is 3 because it just copies tensors from embedding without changing their shapes.

>>> import torch
>>> from torch import nn
>>> from scipy import stats
>>>
>>> embedding = nn.Embedding(5,3)
>>> print(embedding.weight)
Parameter containing:
tensor([[-1.5359,  1.3167, -0.4135],
        [-0.1170,  0.9554, -0.7263],
        [-0.3082, -1.0919, -1.2622],
        [ 0.3853, -1.9481, -0.1821],
        [ 1.0909, -1.1010,  0.6301]], requires_grad=True)
>>> input = torch.LongTensor([[0, 2, 0, 1], [1, 3, 4, 4]])
>>> print(input.size())
torch.Size([2, 4])
>>> output = embedding(input)
>>> print(output.size())
torch.Size([2, 4, 3])
>>> print(output)
tensor([[[-1.5359,  1.3167, -0.4135],
         [-0.3082, -1.0919, -1.2622],
         [-1.5359,  1.3167, -0.4135],
         [-0.1170,  0.9554, -0.7263]],

        [[-0.1170,  0.9554, -0.7263],
         [ 0.3853, -1.9481, -0.1821],
         [ 1.0909, -1.1010,  0.6301],
         [ 1.0909, -1.1010,  0.6301]]], grad_fn=<EmbeddingBackward0>)
>>>

Therefore, after knowing how nn.embedding works, we can use it to embed words. As below sample code shows, because the word dictionary only has 2 words, we only need 2 embedding tensors and the embedding dimension is set to 5 because we want each word will be tranferred to a 5 dimensional tensor. We can set the embedding dimension to any other positive integers greater than 0. So according to the example, the word hello is embedded as [ 1.5414, -0.8476, 1.2966, -1.1901, -0.0852] which is the 0th tensor of embeds.weight, because the value of the word is 0 in the word_to_ix dictionary.

>>> import torch
>>> import torch.nn as nn
>>> from torch.autograd import Variable
>>>
>>> word_to_ix = {'hello': 0, 'world': 1}
>>> embeds = nn.Embedding(2, 5)
>>> print(embeds.weight)
Parameter containing:
tensor([[ 1.5414, -0.8476,  1.2966, -1.1901, -0.0852],
        [-1.9759, -0.4362, -0.9985,  1.3360,  0.1116]], requires_grad=True)
>>>
>>> hello_idx = torch.LongTensor([word_to_ix['hello']])
>>> hello_idx = Variable(hello_idx)
>>> hello_embed = embeds(hello_idx)
>>> print(hello_embed)
tensor([[ 1.5414, -0.8476,  1.2966, -1.1901, -0.0852]],
       grad_fn=<EmbeddingBackward0>)
>>>

Backing to the source code, now we know that the self.embedding() in the forward function will embed tensors with indices to other tensors. The tokens.long() converts the values in the tensor to integers similar to math.floor(). The reason of multiplying the embedded tensor by math.sqrt(self.emb_size) is in section 3.4 of the original paper. The details are related to another paper, so far, I don't get it.

Token Embedding

下一篇： OLD TIME→