Byte pair encoding


Intro

This post is a follow up of my previous post where I give a complete introduction to tokenizers. In that post I talked about the 3 main tokenization types: word-based, character-based and subword-based. In this follow up I would like to delve deeper into the inner workings of one popular subword-based tokenization named Byte-Pair Encoding.

This algorithm can be divided into 2 parts: Learner and Segmenter. The Learner is responsible for taking our training data and tokenize it in order to create our vocabulary whereas the Segmenter splits the data that we wish to test upon.

Learner

Here’s loosely the process that the Learner applies to tokenize our training data (For easier readability i’m using the underscore character to represent spaces):

  1. Initialize the vocabulary with all the unique characters contained in the training data, also referred as corpus. For instance if our corpus was this text: “carpet car”, our initial vocabulary would be:

    • “c”, “a”, “r”, “p”, “e”, “t”, “_”
  2. Now that we have our vocabulary, we will perform whats called as a pre-tokenization to avoid merging characters that we specifically want separated. The purpose of the pre tokenization is to define the groups of characters that are allowed to merge together, without it we were at risk of having a nonsensical set of tokens like this one for example: [“carpet_c”, “ar”]. There are different ways of pre tokenazing our vocabulary. For simplicity, in this example our pre tokenization will be based on the spaces:

    • [“c a r p e t _”, “c a r”]
  3. Now that we have our vocabulary, we will again split the corpus into single characters but this time keep the repeated characters:

    • “c a r p e t _ c a r”
  4. Count the occurrences of all consecutive pair of the characters obtained in step 2. If our corpus was for instance this text: “carpet car”; Then the count of the occurrences of consecutive pairs would be:

    • c a - 2 occurrences
    • a r - 2 occurrences
    • r p - 1 occurrence
    • p e - 1 occurrence
    • e t - 1 occurrence
    • t _ - 1 occurrence
    • _ c - 1 occurrence
  5. Join all occurrences of the pair with the most number of occurrences and in case of having more than 1 pair with the maximum number of occurrences just pick one of those pairs randomly. In this case the maximum number of occurrences is 2 and both < c a > and < a r > have that same number so, I have chosen randomly to concatenate the pair < c a >. After concatenating that pair, our corpus will look like this:

    • “ca r p e t _ ca r”
  6. Now we simply assume that the previously concatenated pair of characters is now a character in and of itself and repeat the previous two steps for how many iterations we want. Note that If you perform the same steps as in step 3, you would find that the pair with maximum occurrences would be < ca r > because < ca > is now treated as a single character. Now you must be wondering “for how many iterations should I run this algorithm?”, well, the answer for that is: It depends (one of data science’s favorite word). Usually what is done in practice is to define a number of tokens that we want and let the algorithm run until that threshold is met.

Segmenter

Okay cool, we have built our beloved vocabulary using the learner, but how do we apply that to new data? Well, the official way to do that is by repeating in our new data, the same order of joins that was done in the Learner. In the case of our previous example used in the Learner section, we saw that the first join performed in our algorithm was of the characters < c a > that eventually led them into becoming the single character < ca >, so that too should be the first join that happens in the Segmenter. If the new data that we are tokenizing is the word “carpenter” the segmentation would look like this:

“c a r p e n t e r”

First join in the Learner was of the pair < c a > :

“ca r p e n t e r”

In the Learner section of this article, specifically in step 3 and 4, after concatenating < c a > and again repeating the process we would find out that the second join we would do is < ca r >, which means that here in the Segmenter, our next join will be < ca r > as well:

“car p e n t e r”

This process would keep going until we have exhausted all the possible steps executed in the Learner phase. Assuming we let the Learner run for only 2 iterations these would be our final tokens:

“car”, “p”, “e”, “n”, “t”, “e”, “r”

Code

import collections
import re

# Both definitions below are based on the ones shown in this paper: https://arxiv.org/abs/1508.07909

def byte_pair_encoding_learner(train_data: dict) -> tuple:
    steps = []
    vocabulary = list({char for string in train_data.keys() for char in string.split()})
    vocabulary.append("[UNK]")
    for _ in range(8):
        # count the number of pair occurences
        pair_count = collections.defaultdict(int)
        for chars, count in train_data.items():
            chars = chars.split()
            for i in range(len(chars) - 1):
                pair_count[chars[i], chars[i + 1]] += count

        # find the pair with maximum count
        pair_max = max(pair_count, key=pair_count.get)

        # store steps needed for the segmentor phase
        steps.append(pair_max)

        # add the pair to the dictionary
        vocabulary.append("".join(pair_max))

        # update the train_data dict by concatenating the max_pair
        updated_train_data = {}
        bigram = re.escape(" ".join(pair_max))
        p = re.compile(r"(?<!\S)" + bigram + r"(?!\S)")
        for word in train_data.keys():
            updated_word = p.sub("".join(pair_max), word)
            updated_train_data[updated_word] = train_data[word]
        train_data = updated_train_data
    return steps, train_data, vocabulary

def byte_pair_encoding_segmenter(steps: list, test_data: list) -> list:
    for pair_max in steps:
        updated_test_data = []
        bigram = re.escape(" ".join(pair_max))
        p = re.compile(r"(?<!\S)" + bigram + r"(?!\S)")
        for word in test_data:
            updated_word = p.sub("".join(pair_max), word)
            updated_test_data.append(updated_word)
        test_data = updated_test_data
    return test_data

def main():

	# Our data is assumed to come on this format. each key represents
	# a pre token with the value being how many times that pre token
	# appears in the text from the test data. in this case our text data
	# would look something like: 
	# "low low low low low lowest lowest newer newer newer newer newer newer wider wider wider new new "
	# The underscores represent the spaces
    pre_tokenized_text = {
        "l o w _": 5,
        "l o w e s t _": 2,
        "n e w e r _": 6,
        "w i d e r _": 3,
        "n e w _": 2,
    }

    steps, final_train_data, vocabulary = byte_pair_encoding_learner(pre_tokenized_text)
    print("final state of the training data:")
    print(final_train_data)
    print()
    print("final state of the vocabulary:")
    print(vocabulary)
    print()
    input1, input2 = ["n e w e r _"], ["l o w e r _"]
    print("segmenter")
    print()
    print("steps: ")
    print(steps)
    print()
    print("results:")
    result = byte_pair_encoding_segmenter(steps, input1)
    result2 = byte_pair_encoding_segmenter(steps, input2)
    print(result)
    print(result2)

if __name__ == "__main__":
    main()

Final Notes

Its important to realize that this may not be the best implementation of the byte pair algorithm, but its good enough for you too see it working in practice. Also realize that we applied the algorithm at a high level simply merging the characters but in the real world this algorithm is a bit more involved at it works at the level of the byte representation of the characters than the characters themselves. If you would like more information on this I would highly recommend you to watch this tutorial made by one of our best publicly known experts in the field of Artificial Intelligence, Andrej Karpathy (Hes a co-founder of OpenAI by the way).

Conclusion

“I’ve come to the conclusion that conclusions are too boring to write” - sun tzu, the art of war