CUDA C 수준에서 한국어 토크나이저 구현 (영어버전

Author

김태은

Category

Hands-on

AI 요약

7 more properties

To process the natural language, LMs (language models) must encounter the words via the training process, deconstruct and manipulate to suit its understanding capabilities, and "learn" the words through vectorization and semantic embedding.

In this context, to "learn" the words has similar implications to registering them into LM's Vocabulary.

As promising as this sounds, practical limitations poses itself and we are forced to balance the happy medium. The so-called Vocabulary is restricted in size, which is proportionate to computing efficiency and available resources. However, reducing the size of the Vocabulary ultimately means to reduce the materials learned by LMs, implying a compensation in model performance due to incompetence in the model's ability to generalize and understand unseen words.

Even if we were to posses unlimited resources, it is not pragmatic to simply have the model learn every word it encounters as is during the training process.

Language evolves. There are constantly new languages being created, majority of them a result of simple manipulations of currently existing words. Even if the model trained on a huge data, it is bound to face words it has never seen.

This requires a huge training data. To have LM understand the differences between sleep and sleeping and slept and other forms of sleep, we'd need to tokenize each word.
Thus the question was this:

How do we minimize Vocabulary size while maintaining a satisfactory model performance?

And further:

Is there a flexible approach to this solution to account for ever-changing nature of language?

"A satisfactory model performance" describes a LM's approach to unlearned words, or in other words, words beyond its Vocabulary. This is known as the OoV (out of vocabulary) problem.

Contemporary NLP provides multitudes of solutions to OoV, with one of them being BPE (Byte Pair Encoding) and its derived algorithms. The main concept is fairly intuitive; it's a tokenization technique that creates subunits of a word, like root words or grammatical syntax, to update in the Vocabulary that allows for analogical inference of words.

Here is a generalized overview of the algorithm:

deconstruct the unknown words, <UNK>, into its fundamental bytes

iteratively replace most frequent pairs of bytes into a single new byte

repeat until there are no more pairs.

The implementation of BPE provides a challenge for multi-byte characters.

A byte represents 8 bits, which translates to 2**8 = 256 different values. A standard expression of a byte through UTF-8, has the format of 0xYY where 0x indicates hexadecimal format and each Y is a hexadecimal digit.

Although english and basic numbers can be displayed with single bytes, as there are over 256 characters in the world, not all of them can be represented through a single byte. Therefore we now have a multi-byte system that represents all types of characters with 2-4 byte sequences. Within a single byte value, we have 0-127 express the ASCII characters and 128-256 reserved for the multi-byte characters.

To return to the consequence of BPE, when the algorithm breaks down words to the bytes, multi-byte characters are also decoded to 2-4 separate tokens and aren't displayed properly when interpreted back into character format.

Reverse Engineering example:

C3 B0 00 ð 2   # 👌 - TrailByte and bit masking
C2 9F 00  2 # 👌 -  2 Second byte
C2 91 00  2 # 👌 -  3 Third byte
C2 8C 00  2 # 👌 -  4 Fourth byte
JavaScript
복사

is represented with F0 9F 91 8C but the first bit in the output is B0 9F 91 8C. It can be easily identified that this is a reoccurring pattern for the multi-byte OoV.

  1 1 1 1  0 0 0 0  --F0

- 1 0 1 1  0 0 0 0  --B0
------------------------
  0 1 0 0  0 0 0 0  --40
JavaScript
복사

This is assumed to be the result of bit masking to troubleshoot unexpected rendering of bytes when LM encounters multi-byte OoV.

To bypass the bit masking and display the OoV multi-byte character properly, an OR 0x40 bit operation is implemented with the following code.

   1 0 1 1  0 0 0 0  --> B0

OR 0 1 0 0  0 0 0 0  --> 40
------------------------
   1 1 1 1  0 0 0 0  --> F0
JavaScript
복사

C3 indicates the beginning of a multi-byte sequence, followed by a masked first bit, C2 indicates the continuation of such sequence and its successor a proper byte of the multi-byte character.

with the code, CJK and other multi-byte characters are deconstructed and reconstructed properly to produce the equal expression as the input.

This practice of engine control is significant in understanding the model’s working principles and allows for efficient device-level control of the model’s output. The approach provides a novel intuition and has a potential for implementations for other fundamental environments.

테스트

한국어 예제

일본어 예제