BPE(Byte-Pair Encoding )代码实现

发布时间:2024年01月11日

BPE 是使用最广泛的sub-word tokenization算法之一。尽管贪婪,但它具有良好的性能,并被作为机器翻译等主流NLP任务的首选tokenize方法之一。

BPE算法原理传送门

1. Byte-Pair Encoding Tokenizer Training


import pandas as pd

# Import gc, a library for controlling the garbage collector
import gc

# Import various classes and functions from the tokenizers library, which is used for creating and using custom tokenizers 
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# Import PreTrainedTokenizerFast, a class for using fast tokenizers from the transformers library
from transformers import PreTrainedTokenizerFast

# Import TfidfVectorizer, a class for transforming text into TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

# Import tqdm, a library for displaying progress bars 
from tqdm.auto import tqdm

# Import Dataset, a class for working with datasets in a standardized way 
from datasets import Dataset

# Set the LOWERCASE flag to False
LOWERCASE = False 

# Set the VOCAB_SIZE to 10000000.
# This means that the maximum number of words in the vocabulary will be 10 million.
VOCAB_SIZE = 10000000
test = pd.read_csv('data/test_text.csv').iloc[:6666]

# Create a tokenizer object using the Byte Pair Encoding (BPE) algorithm
# Define an unknown token as "[UNK]"
raw_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# Normalize the text by applying Unicode Normalization Form C (NFC) and optionally lowercasing it
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])

# Pre-tokenize the text by splitting it into bytes
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()

# Define the special tokens that will be used for the downstream task
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

# Create a trainer object that will train the tokenizer on the given vocabulary size and special tokens
trainer = trainers.BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)

# Load the test dataset from a pandas dataframe and select only the text column
dataset = Dataset.from_pandas(test[['text']])

# Define a generator function that will yield batches of text from the dataset
def train_corp_iter(): 
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

# Train the tokenizer on the batches of text using the trainer object
raw_tokenizer.train_from_iterator(train_corp_iter(), trainer=trainer)

# Wrap the raw tokenizer object into a PreTrainedTokenizerFast object that is compatible with the HuggingFace library
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)


# Initialize an empty list to store the tokenized texts for the test set
tokenized_texts_test = []

# Loop over the texts in the test set and tokenize them using the tokenizer object
for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))
  0%|          | 0/6666 [00:00<?, ?it/s]

2. TF-IDF Vectorization

一般情况下,在用BPE算法对数据进行压缩后,需要进行TF-IDF向量化,来将数据转换成方便建模的形式。

# Define a dummy function that returns the input text as it is
def dummy(text):
    return text

# Create another TfidfVectorizer object with the same parameters, but using the vocabulary obtained from the previous vectorizer
vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, 
                            analyzer = 'word',
                            tokenizer = dummy,
                            preprocessor = dummy,
                            token_pattern = None, strip_accents='unicode'
                            )

# Fit and transform the vectorizer on the tokenized texts of the train set, and get the sparse matrix of tf-idf values
tf_test = vectorizer.fit_transform(tokenized_texts_test)

# Get the vocabulary of the vectorizer, which is a dictionary of n-grams and their indices
vocab = vectorizer.vocabulary_

# Print the vocabulary
print(list(vocab.items())[:10])

# Delete the vectorizer object to free up memory
del vectorizer

# Invoke the garbage collector to reclaim unused memory
gc.collect()

[('?Phones ? ?', 1023935), ('? ? Modern', 716662), ('? Modern ?humans', 679728), ('Modern ?humans ?today', 534040), ('?humans ?today ?are', 3237252), ('?today ?are ?always', 5977665), ('?are ?always ?on', 1675978), ('?always ?on ?their', 1455005), ('?on ?their ?phone', 4210093), ('?their ?phone .', 5562309)]





540
文章来源:https://blog.csdn.net/PyDarren/article/details/135524864
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。