[picoGPT 模型篇]：从字符到 Token ...

之前我们写了一个简单的 GPT 模型，不过它距离现代的 LLM 还有很远的距离，这一次我准备尝试调整一下模型架构，让它更现代一点。

（以下代码与最终代码可以有出入，具体参考仓库）

Tokenizer

首先，我们之前的模型是基于字符进行训练的，模型想要学会写单词就已经很不容易了，更不要说写出通顺的句子。因此，我们首先需要引入 OpenAI 的 tiktoken 库，把原本针对字符的 encode, decode 改为：

import tiktoken
 
enc = tiktoken.get_encoding("gpt2")
 
encode = lambda s: enc.encode_ordinary(s)
decode = lambda l: enc.decode(l)
 
vocab_size = 50304

这里 gpt2用的词表大小其实是 50257，但是我们向上取整到 64 的倍数以便提高性能。

可以确认一下 encode 和 decode 可以正常工作：

print(decode(encode('Hello World!')))

如果我们直接尝试用 shakespeare-char 训练这个模型，会发现出现了严重的过拟合现象：

Data loaded
Start to train...
Iter 0:
train loss: 9.97460651397705, val loss: 9.930878639221191
Best model saved!
Iter 1000:
train loss: 3.0411689281463623, val loss: 4.627305507659912
Best model saved!
Iter 2000:
train loss: 1.4448552131652832, val loss: 5.289175510406494
No improvement. Early stopping counter: 1
Iter 3000:
train loss: 0.4857630133628845, val loss: 6.246184825897217
No improvement. Early stopping counter: 2
Iter 4000:
train loss: 0.2052956372499466, val loss: 6.989179611206055
No improvement. Early stopping counter: 3
Iter 5000:
train loss: 0.128025621175766, val loss: 7.53853702545166
No improvement. Early stopping counter: 4
Iter 6000:
train loss: 0.09932415187358856, val loss: 8.035560607910156
No improvement. Early stopping counter: 5
Early stopping at 6000

这是因为我们引入了 tokenizer 后词表增大了几百倍，参数量也相应地增大了几百倍，但是输入文本经 tokenizer 压缩后总 token 数反而降低了，因此现在模型参数量甚至可能已经大于输入的 token 数，模型可以直接疯狂背书，自然会导致过拟合。

解决办法是换一个大一点的数据集，这里选用了 Tiny Stories 数据集。不过由于数据集的增大，读取数据又成了一个新的问题。之前是直接将整个文本读入内存并 encode，这在数据集较小时没问题，但是当数据集增大，生成的 Tensor 会占用大量内存，而且每次重新训练都要很久来 encode。解决办法是预处理 + 内存映射。

数据读取

预处理

写一个单独的 Python 脚本，作用是下载数据，tokenize，并保存为二进制文件。

一般来讲可以直接用 datasets 库在线加载数据集，但是因为集群网络问题直接调用 datasets.load_dataset 不能正常下载数据集，所以我提前下载好了数据集，这里只需要加载数据集并做处理：

import os
import numpy as np
import tiktoken
from datasets import load_dataset
from tqdm import tqdm
 
num_proc = 8
dtype = np.uint16
 
print("Loading dataset...")
split_dataset = load_dataset("text", data_files={
    'train': "datasets/TinyStories/TinyStories-train.txt",
    'val': "datasets/TinyStories/TinyStories-valid.txt"
})
print("Dataset loaded.")

然后将文本 tokenize：

enc = tiktoken.get_encoding("gpt2")
def process(example):
    ids = enc.encode_ordinary(example['text'])
    return {'ids': ids, 'len': len(ids)}
 
tokenized = split_dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the splits",
    num_proc=num_proc,
)

接着利用 np.memmap 把处理后的数据分批写入硬盘：

for split, set in tokenized.items():
    arr_len = np.sum(set['len'], dtype=np.uint64)
    filename = os.path.join('datasets/TinyStories/', f'{split}.bin')
    print(f"Writing {filename}")
    
    arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
 
    total_batches = 1024
    idx = 0
    for batch_idx in tqdm(range(total_batches), desc=f"Writing {filename}"):
        batch = set.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
        arr_batch = np.concatenate(batch['ids'])
        arr[idx:idx+len(arr_batch)] = arr_batch
        idx += len(arr_batch)
        
    arr.flush()
 
print("Done!")

只需运行一次，运行后会得到两个 .bin 文件。

训练

将原先数据读取部分改为：

data_dir = 'datasets/TinyStories'
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
print("Data loaded")

get_batch 部分也需要修改：

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,), device=device)
    # 需要先转换成 int64 给 torch 调用
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device == 'cuda':
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

模型

微调一下模型架构。

RoPE

首先是把绝对位置编码改为 RoPE，在 PicoGPT.forward 中删去 pos_embedding，在 AttentionHead 中加入 RoPE：

class Head(nn.Module):
    def __init__(self, n_embed, head_dim, block_size):
        # ...
        self.RoPE = RoPE(head_dim)
        # ...
 
    def forward(self, x):
        _, T, _ = x.shape
        # (B, T, n_embed)
        k = self.RoPE(self.k_proj(x)) # (B, T, head_dim)
		q = self.RoPE(self.q_proj(x))
		v = self.v_proj(x)
		# ...

只需在 K 和 V 矩阵的 proj 后加入 RoPE 即可。

RMSNorm

现在的模型中 RMSNorm 比 LayerNorm 更常见，RMSNorm 的运算更为简单，速度更快，且效果基本相同。

把 Transformer Block 中的两个 Norm 层改为 RMSNorm：

class TransformerBlock(nn.Module):
    def __init__(self, num_head, n_embed, block_size):
        super().__init__()
        self.attn = MultiHeadAttn(num_head, n_embed, block_size)
        self.ffn = FFN(n_embed)
        self.norm1 = nn.RMSNorm(n_embed, eps=1e-6)
        self.norm2 = nn.RMSNorm(n_embed, eps=1e-6)
 
    def forward(self, x):
        x = self.norm1(x)
        y = self.attn(x) + x
        y = self.norm2(y)
        y = self.ffn(y) + y
        return y

generate

这里我又另外调整了一下 generate 函数，加入了 temperature 和 top_k 参数：

def generate(self, x, max_new_tokens, temperature=0.7, top_k=50):
    # x: (B, T)
    for _ in range(max_new_tokens):
        input = x[:, -self.block_size:]
        logits, _ = self(input) # (B, T, vocab_size)
        logits = logits[:, -1, :] / temperature # (B, vocab_size)
        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < v[:, [-1]]] = float('-inf')
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1) # (B, 1)
        x = torch.cat([x, next_token], dim=-1)
    return x

实验

接下来就可以重新开始训练了：

Data loaded
Trainable parameters: 29959680
Start to train...
Iter 0:
train loss: 143.59780883789062, val loss: 143.64588928222656
Best model saved!
Iter 1000:
train loss: 4.27586030960083, val loss: 4.286266326904297
Best model saved!
Iter 2000:
train loss: 3.5111801624298096, val loss: 3.483233690261841
Best model saved!
...
Iter 19000:
train loss: 1.9200068712234497, val loss: 1.9404504299163818
Best model saved!
<|endoftext|> and wanted to try it. He took a big bite but it was bitter and sour. He wanted to eat it anyway. He started to feel sick and sick. He wanted the dog to bite him. He did not feel very sick. He did not want to hurt the cat.The cat did not want to play anymore. He wished he had a friend. He wished he had listened to Lily's sister's story. He wished he had been nicer to the medicine and not played with it. He wished he had not been angry. He wished Lily had listened to him."I am sorry, Ben. I was wrong. I did not know. I was a good friend. I did not know it was a bad ending. I will not eat it without you. I will be sorry." The cat meowed.Lily's brother saw what happened. He felt sad and scared. He ran to the cat, but the cat did not want to be friends. He wanted to be friends with the cat. He tried to help the cat. He took the cat. He put the cat on the cat.Lily was very happy. She hugged the cat. She was not hurt. She was not hurt. She had a bad dream.<|endoftext|>Anna and Ben are playing in the park. They like to look for bugs and flowers. They find a big, soft butterfly. They are happy and excited.But then, a big dog comes. The

因为模型的参数量只有 30M 不算很大，所以我只训练了 20000 轮试水，效果还是不错的，起码可以说模型生成内容的语法都是正确的，不过可以观察到，其中的逻辑还有比较混乱的地方，比如生成了语义相同的词或句子，故事也是像发高烧时会梦见的场景……

总结

这次我们给模型加上了 Tokenizer，把 LayerNorm 换成了 RMSNorm，把绝对位置编码换成了 RoPE，并且用上了更大的数据集 TinyStories，这次终于可以看见训出的模型开始说人话了！下一步我可能会学习一下优化方面的知识，改用 FlashAttention 以及加入 KV Cache 等。

Evan's blog

Explorer