Data Sourcing and Processing
The first part of the source code related to data processing is showed below.
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import Multi30k
from typing import Iterable, List
SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'
# Place-holders
token_transform = {}
vocab_transform = {}
# Create source and target language tokenizer. Make sure to install the dependencies.
# pip install -U spacy
# python -m spacy download en_core_web_sm
# python -m spacy download de_core_news_sm
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')
token_transform
and vocab_transform
are two dictionaries for storing the token transfomers of the source (German) and target (English) languages. The token transfomer get_tokenizer
is a word-segmentation function for transferring a string sentence into a list of words and punctuation characters. The below piece of code shows an example of splitting a English sentence into a list of English words by spaces with the basic_english
tokenizer.
>>> import torchtext
>>> from torchtext.data import get_tokenizer
>>> tokenizer = get_tokenizer("basic_english")
>>> tokens = tokenizer("You can now install TorchText using pip!")
>>> tokens
>>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']
Like the code presented before, the get_tokenizer
function can also call other tokenizer libraries (e.g. spacy, moses, toktok, revtok, subword) if they are preinstalled. It should be noticed that the tokenizer from spacy
performs better than the original basic_english
. For example, we can do a comparison test like below showed.
>>> get_tokenizer("basic_english")("Let's go to N.Y.!")
>>> ['let', "'", 's', 'go', 'to', 'n', '.', 'y', '.', '!']
>>> get_tokenizer("spacy", language="en_core_web_sm")("Let's go to N.Y.!")
>>> ['Let', "'s", 'go', 'to', 'N.Y.', '!']
Moreover, I've also tested the performance of the spacy tokenizer on Chinese, but the result showed below is worse than expected because 清华大学
should not be segmented.
>>> get_tokenizer("spacy", language="zh_core_web_sm")("我在清华大学读书。")
>>> ['我', '在', '清华', '大学', '读书', '。']
The details about the spacy tokenization
Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc
object. To construct a Doc
object, you need a Vocab
instance, a sequence of word strings, and optionally a sequence of spaces
booleans, which allow you to maintain alignment of the tokens into the original string.
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc
consists of individual tokens, and we can iterate over them:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
>>> for token in doc:
>>> print(token.text)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Apple | is | looking | at | buying | U.K. | startup | for | $ | 1 | billion |
First, the raw text is split on whitespace characters, similar to text.split(' ') . Then, the tokenizer processes the text from left to right. On each substring, it performs two checks: |
-
Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
-
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.
While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass, like English
or German
, that loads in lists of hard-coded data and exception rules.
Algorithm details: How spaCy's tokenizer works
spaCy introduces a novel tokenization algorithm that gives a better balance between performance, ease of definition and ease of alignment into the original string.
After consuming a prefix or suffix, we consult the special cases again. We want the special cases to handle things like “don’t” in English, and we want the same rule to work for “(don’t)!“. We do this by splitting off the open bracket, then the exclamation, then the closed bracket, and finally matching the special case. Here’s an implementation of the algorithm in Python optimized for readability rather than performance:
def tokenizer_pseudo_code(
text,
special_cases,
prefix_search,
suffix_search,
infix_finditer,
token_match,
url_match
):
tokens = []
for substring in text.split():
suffixes = []
while substring:
while prefix_search(substring) or suffix_search(substring):
if token_match(substring):
tokens.append(substring)
substring = ""
break
if substring in special_cases:
tokens.extend(special_cases[substring])
substring = ""
break
if prefix_search(substring):
split = prefix_search(substring).end()
tokens.append(substring[:split])
substring = substring[split:]
if substring in special_cases:
continue
if suffix_search(substring):
split = suffix_search(substring).start()
suffixes.append(substring[split:])
substring = substring[:split]
if token_match(substring):
tokens.append(substring)
substring = ""
elif url_match(substring):
tokens.append(substring)
substring = ""
elif substring in special_cases:
tokens.extend(special_cases[substring])
substring = ""
elif list(infix_finditer(substring)):
infixes = infix_finditer(substring)
offset = 0
for match in infixes:
if offset == 0 and match.start() == 0:
continue
tokens.append(substring[offset : match.start()])
tokens.append(substring[match.start() : match.end()])
offset = match.end()
if substring[offset:]:
tokens.append(substring[offset:])
substring = ""
elif substring:
tokens.append(substring)
substring = ""
tokens.extend(reversed(suffixes))
for match in matcher(special_cases, text):
tokens.replace(match, special_cases[match])
return tokens
The algorithm can be summarized as follows:
- Iterate over space-separated substrings.
- Look for a token match. If there is a match, stop processing and keep this token.
- Check whether we have an explicitly defined special case for this substring. If we do, use it.
- Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2, so that the token match and special cases always get priority.
- If we didn’t consume a prefix, try to consume a suffix and then go back to #2.
- If we can’t consume a prefix or a suffix, look for a URL match.
- If there’s no URL match, then look for a special case.
- Look for “infixes” – stuff like hyphens etc. and split the substring into tokens on all infixes.
- Once we can’t consume any more of the string, handle it as a single token.
- Make a final pass over the text to check for special cases that include spaces or that were missed due to the incremental processing of affixes.
Global and language-specific tokenizer data is supplied via the language data in spacy/lang
. The tokenizer exceptions define special cases like “don’t” in English, which needs to be split into two tokens: {ORTH: "do"}
and {ORTH: "n't", NORM: "not"}
. The prefixes, suffixes and infixes mostly define punctuation rules – for example, when to split off periods (at the end of a sentence), and when to leave tokens containing periods intact (abbreviations like “U.S.”).
The next part of the data processing code is basically a function for returning the list of tokens of a given sentence.
# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
for data_sample in data_iter:
yield token_transform[language](data_sample[language_index[language]])
Since the input dataset Multi30k
is composed with a series of tuples where each tuple contains two strings (one is in German and the other one is the corresponding translation in English, and the order of them can be switched via the parameter language_pair
), the language_index
is used to determine the indices of the elements in a tuple. So if the input parameter language = 'de'
, the language_index[language]
will return 0, and it will return 1 if language = 'en'
because of language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
. Therefore, data_sample[language_index[language]]
in the for loop will traverse each tuple in the given dataset and be like data_sample[0]
or data_sample[1]
depending on the value of language
, then returns the corresponding string sentences. After that, token_transform[language]
will call corresponding tokenizer and transfer the sentence to a list of tokens. The following code may help you understand the yield_tokens
function.
>>> [i for i in Multi30k(split='valid', language_pair=('de', 'en'))][0]
>>> ('Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen', 'A group of men are loading cotton onto a truck')
>>> [i for i in Multi30k(split='valid', language_pair=('en', 'de'))][0]
>>> ('A group of men are loading cotton onto a truck', 'Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen')
>>>
>>> for data_sample in Multi30k(split='valid', language_pair=('de', 'en')):
... print(data_sample[0], token_transform['de'](data_sample[0]))
... print(data_sample[1], token_transform['en'](data_sample[1]))
... break
...
>>> Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen ['Eine', 'Gruppe', 'von', 'Männern', 'lädt', 'Baumwolle', 'auf', 'einen', 'Lastwagen']
>>> A group of men are loading cotton onto a truck ['A', 'group', 'of', 'men', 'are', 'loading', 'cotton', 'onto', 'a', 'truck']
The next part of the code is used to build two word dictionaries (German and English) in which each word is assigned with a unique integer (starting from 0 to the number of unique words in the given dataset). The words and their values will be used to build word vectors later.
# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
# Training data Iterator
train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
# Create torchtext's Vocab object
vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
min_freq=1,
specials=special_symbols,
special_first=True)
# Set UNK_IDX as the default index. This index is returned when the token is not found.
# If not set, it throws RuntimeError when the queried token is not found in the Vocabulary.
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
vocab_transform[ln].set_default_index(UNK_IDX)
The build_vocab_from_iterator
function will return a Vocab
which is an object from torchtext
. We can use get_stoi()
function to list all the keys and values in that object. Noticing that specials
and special_first
are set, which means the values of four special_symbols
are set from 0 to 3 in the dictionary. We can also play around with the Vocab
and do some checks related to the keys and values.
>>> vocab_transform['en'].get_stoi()
>>> {'pitch': 1533,
'pouring': 1021,
'point': 2242,
'wires': 1869,
'fruit': 529,
'Some': 431,
'.': 5,
'advice': 7488,
'audio': 4055,
'park': 120,
'<bos>': 2,
...
}
>>>
>>> all_unique_words_in_training_data = set([ j for i in yield_tokens(train_iter, 'en') for j in i ])
>>> all_unique_words_in_vocab = set(vocab_transform['en'].get_itos())
>>> all_unique_words_in_vocab.difference(all_unique_words_in_training_data)
>>> {'<unk>', '<bos>', '<eos>', '<pad>'}
>>>
>>> print(len(vocab_transform['en'].get_stoi()))
>>> 10837
>>> d = vocab_transform['en'].get_stoi()
>>> print(min(d.values()), max(d.values()))
>>> 0 10836
In addition, the assigned value of each word in vocab_transform
is related to the decreasing order of the occurrence frequency of the word in the dataset. We can verify this by below code. Noticing that the first 4 elements of vocab_transform
are special symbols, so the value of the most frequently occurring word "a" is set to be 4, followed by words ".", "A", "in", and "the" with the assigned values from 5 to 8 relatively.
>>> import pandas as pd
>>>
>>> all_words_in_training_data = [ j for i in yield_tokens(train_iter, 'en') for j in i ]
>>> df = pd.DataFrame(all_words_in_training_data, columns=['word'])
>>> word_freq = df.groupby(["word"])["word"].count().reset_index(name="count").sort_values(by=["count"], ascending=False)
>>> print(word_freq.head(5))
word count
1875 a 31707
16 . 27623
97 A 17458
5816 in 14847
9859 the 9923
>>> [ vocab_transform['en'].get_stoi()[i] for i in word_freq.head(5)['word'].tolist()]
[4, 5, 6, 7, 8]