花了两天照着 Andrej Karpathy 的 Let’s build GPT: from scratch, in code, spelled out. 写了一个简化版 GPT,因为算是第一个亲手训的小模型吧,于是决定完整记录一下训练过程。
训练用的是单卡 RTX 5090,因为没有加入 nanoGPT 中的分布式训练等稍微进阶的技巧,于是我决定叫它 picoGPT :)
数据处理
首先将 shakespeare-char 数据集下载好放在 input.txt 中,读入数据
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()直接把不同字符作为不同 token,统计数量备用
chars = sorted(list(set(text)))
vocab_size = len(chars)把字符转成数字索引
stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])训练/验证集分割
import torch
device = 'cuda'
data = torch.tensor(encode(text), dtype=torch.long, device=device)
# split
n = int(0.9 * len(data))
train_data = data[:n]
test_data = data[n:]训练
获取 batch
block_size = 8
batch_size = 32
def get_batch(split):
data = train_data if split == 'train' else test_data
ix = torch.randint(len(data) - block_size, (batch_size,), device=device)
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+1+block_size] for i in ix])
return x, y此处获取的 x, y 均为 (batch_size, block_size) 形状(跟随视频,以下分别记 batch_size 为 B,block_size 为 T)
Training loop
先定义模型和优化器
from model import PicoGPT
n_embed = 32
model = PicoGPT(vocab_size, n_embed)
model = model.to(device)
lr = 1e-3
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)写出 training loop 逻辑:
print("Start to train...")
train_iter = 20000
eval_interval = 2000
for iter in range(train_iter):
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
if iter % eval_interval == 0:
print(f"Iter {iter}:")
estimate_loss()一个完整的 Training loop 就包含以下几步:
- 获取 batch
- 计算 forward
- 计算 loss
- 清空梯度
- 计算 backward
- 优化器更新模型参数
接下来就是跟着视频一步步优化模型:
模型
V0: Bigram
用 Bigram 作为基线:
class PicoGPT(nn.Module):
def __init__(self, vocab_size, n_embed):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, n_embed)
self.proj = nn.Linear(n_embed, vocab_size)模型只包含两个层:
nn.Embedding: 将词索引转换为词向量nn.Linear: 从n_embed维映射回vocab_size维,作为各个字符的得分
接下来写出 forward:
def forward(self, x, target=None):
# x, target: (B, T)
y = self.token_embedding(x) # (B, T, n_embed)
y = self.proj(y) # (B, T, vocab_size)
if target is None:
loss = None
else:
B, T, C = y.shape
loss = F.cross_entropy(y.view(B*T, C), target.view(-1))
return y, losstarget 为待学习的目标,若为 None 代表是验证阶段,不需要计算 loss,否则是训练阶段,用交叉熵作为 loss 函数
为了直观地看到训练后输出的变化,我们可以写一个简单的 generate 方法:
def generate(self, x, max_new_tokens):
# x: (B, T)
for _ in range(max_new_tokens):
logits, _ = self(x) # (B, T, vocab_size)
logits = logits[:, -1, :] # (B, vocab_size)
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1) # (B, 1)
x = torch.cat([x, next_token], dim=-1)
return xmain 函数中输入一个 \n 字符作为第一个字符:
def sample():
input = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(model.generate(input, max_new_tokens=100)[0].tolist()))可以先看看没有训练时会输出什么
rN;$!K&D;a
ZAVFA;N;mIQluoAt:FzPH,U&:jh!Ncl$T
T FyZxv!Hlt:yeBmQeKEUh&fzHLtU' OuPdqgtpF;gT:H;FzWo
Ou.意料之中的一段乱码……
让我们试试把这个简单的模型训练一下能得到什么!
Iter 0:
Loss in train set: 4.29146671295166, val set: 4.288028717041016
Iter 2000:
Loss in train set: 2.5236124992370605, val set: 2.529052257537842
Iter 4000:
Loss in train set: 2.511962413787842, val set: 2.4780426025390625
Iter 6000:
Loss in train set: 2.4409923553466797, val set: 2.477735757827759
Iter 8000:
Loss in train set: 2.4723761081695557, val set: 2.503343105316162
Iter 10000:
Loss in train set: 2.4729974269866943, val set: 2.503154754638672
Iter 12000:
Loss in train set: 2.472877264022827, val set: 2.472346305847168
Iter 14000:
Loss in train set: 2.4608867168426514, val set: 2.4862093925476074
Iter 16000:
Loss in train set: 2.4073214530944824, val set: 2.4834976196289062
Iter 18000:
Loss in train set: 2.461843252182007, val set: 2.454488515853882
可以看到训练 20000 轮后 loss 有了明显下降,从 4.28 下降到了 2.45,再看看内容输出:
I Mysutes?
NEDrds re r t:
TANo ierwhed ar elll se t ouf.
an
Ces!'l sa eah ghem
M:
Sey isatlath
看起来已经不是完全的乱码了,起码学会了单词之间要有空格,符号也不能乱加
V1: Context (Average)
基本实现
容易发现,现在的 Bigram 没有任何记忆手段,完全是根据单个字符猜测下一个字符,所以无法输出合理的内容也是完全可以理解的。
要给模型添加记忆,其实就是让模型理解上下文。一个简单的办法是输入时不是输入单个字符,而是输入当前字符上文中(包括当前字符)的所有字符的平均值。
朴素的实现是遍历每个字符并且取均值,但是这效率太慢了,容易注意到这相当于给输入 x 乘上一个归一化后的下三角矩阵 mask,而归一化可以通过 softmax 完成(为了和后面 Self Attention 保持一致)
class PicoGPT(nn.Module):
def __init__(self, vocab_size, n_embed, block_size):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, n_embed)
self.proj = nn.Linear(n_embed, vocab_size)
tril = torch.tril(torch.ones(block_size, block_size))
wei = torch.zeros(block_size, block_size)
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
self.register_buffer('wei', wei)
self.block_size = block_size
def forward(self, x, target=None):
# x, target: (B, T)
_, T = x.shape
y = self.token_embedding(x) # (B, T, n_embed)
y = self.wei[:T, :T] @ y
y = self.proj(y) # (B, T, vocab_size)
# ...generate 也需要修改一下避免生成内容长度超出 block_size 后报错:
def generate(self, x, max_new_tokens):
# x: (B, T)
for _ in range(max_new_tokens):
x = x[:, -self.block_size:]
logits, _ = self(x) # (B, T, vocab_size)
logits = logits[:, -1, :] # (B, vocab_size)
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1) # (B, 1)
x = torch.cat([x, next_token], dim=-1)
return x尝试训练之后发现 loss 反而上升了:
Iter 0:
Loss in train set: 4.23388147354126, val set: 4.218968868255615
Iter 2000:
Loss in train set: 2.8795955181121826, val set: 2.8728349208831787
Iter 4000:
Loss in train set: 2.8516783714294434, val set: 2.825944423675537
Iter 6000:
Loss in train set: 2.8000307083129883, val set: 2.8197808265686035
Iter 8000:
Loss in train set: 2.782620429992676, val set: 2.833695411682129
Iter 10000:
Loss in train set: 2.8480095863342285, val set: 2.8165011405944824
Iter 12000:
Loss in train set: 2.7942678928375244, val set: 2.7880728244781494
Iter 14000:
Loss in train set: 2.7924342155456543, val set: 2.8363823890686035
Iter 16000:
Loss in train set: 2.745555877685547, val set: 2.8346755504608154
Iter 18000:
Loss in train set: 2.785167694091797, val set: 2.8249351978302
我猜测可能的原因有:
- 加入上下文信息后学习难度提高了,之前模型只需学习高频出现的单词对,现在被迫需要额外处理更多上下文信息
- 此时采用的是简单的因果掩码,这个参数不可学习,模型无法选择哪些历史信息更重要
- 处理上下文时 token 的位置很重要,但是目前模型看不到任何有关 token 位置的信息
- 这在之前的 Bigram 中不重要(因为只考虑一个单词对)
位置编码
因此我尝试加入了位置编码:
def __init__(self, vocab_size, n_embed, block_size):
# ...
self.pos_embedding = nn.Embedding(block_size, n_embed)
def forward(self, x, target=None):
# ...
y = tok_emb + pos_emb
y = self.wei[:T, :T] @ y
y = self.proj(y) # (B, T, vocab_size)
# ...训练结果:
Iter 0:
Loss in train set: 4.239902019500732, val set: 4.263148307800293
Iter 2000:
Loss in train set: 2.857059955596924, val set: 2.84916615486145
Iter 4000:
Loss in train set: 2.8142058849334717, val set: 2.825568914413452
Iter 6000:
Loss in train set: 2.818707227706909, val set: 2.8426122665405273
Iter 8000:
Loss in train set: 2.8047380447387695, val set: 2.8363466262817383
Iter 10000:
Loss in train set: 2.81005859375, val set: 2.8199987411499023
Iter 12000:
Loss in train set: 2.8029580116271973, val set: 2.8087267875671387
Iter 14000:
Loss in train set: 2.8055031299591064, val set: 2.8205807209014893
Iter 16000:
Loss in train set: 2.7775373458862305, val set: 2.8117828369140625
Iter 18000:
Loss in train set: 2.7859320640563965, val set: 2.8083486557006836
残差连接 & ReLU
看起来并没有明显提升,我猜测是对前文取平均的操作影响太大导致的,于是又尝试加入了残差连接和 ReLU:
def forward(self, x, target=None):
# ,,,
res = tok_emb + pos_emb
y = self.wei[:T, :T] @ res
y = F.relu(y + res)
y = self.proj(y) # (B, T, vocab_size)果然有了一些进步!
Iter 0:
Loss in train set: 4.522894382476807, val set: 4.508083343505859
Iter 2000:
Loss in train set: 2.534034490585327, val set: 2.52205753326416
Iter 4000:
Loss in train set: 2.4724555015563965, val set: 2.4907724857330322
Iter 6000:
Loss in train set: 2.439121723175049, val set: 2.4694595336914062
Iter 8000:
Loss in train set: 2.3941609859466553, val set: 2.4373888969421387
Iter 10000:
Loss in train set: 2.4100425243377686, val set: 2.409940719604492
Iter 12000:
Loss in train set: 2.3916189670562744, val set: 2.3958818912506104
Iter 14000:
Loss in train set: 2.391474485397339, val set: 2.4080934524536133
Iter 16000:
Loss in train set: 2.35304594039917, val set: 2.3878097534179688
Iter 18000:
Loss in train set: 2.350313663482666, val set: 2.407989263534546
验证集上的 loss 最低下降到了 2.38 左右。
当然,从模型生成内容来看还看不出什么区别:
Malacthant bun-sopotn il gou;
Tlow w evoficach; lin cy, ghesit go of he she! couldobus wan; ssabe ak
V2: Context (Self Attention)
Single Head
接下来就该引入 Self Attention 了:
首先定义一个 Attention Head:
class Head(nn.Module):
def __init__(self, n_embed, head_dim, block_size):
super().__init__()
self.k_proj = nn.Linear(n_embed, head_dim)
self.q_proj = nn.Linear(n_embed, head_dim)
self.v_proj = nn.Linear(n_embed, head_dim)
self.head_dim = head_dim
tril = torch.tril(torch.ones(block_size, block_size))
self.register_buffer('tril', tril)
def forward(self, x):
_, T, _ = x.shape
# (B, T, n_embed)
k = self.k_proj(x) # (B, T, head_dim)
q = self.q_proj(x)
v = self.v_proj(x)
wei = q @ k.transpose(-2, -1) * self.head_dim**-0.5 # (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
y = wei @ v # (B, T, head_dim)
return y然后将模型中的获取上下文信息的部分改为利用 Attention Head 获取:
class PicoGPT(nn.Module):
def __init__(self, vocab_size, n_embed, block_size):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, n_embed)
self.pos_embedding = nn.Embedding(block_size, n_embed)
self.attn = Head(n_embed, n_embed, block_size)
self.attn_proj = nn.Linear(n_embed, n_embed)
self.proj = nn.Linear(n_embed, vocab_size)
self.block_size = block_size
def forward(self, x, target=None):
# x, target: (B, T)
_, T = x.shape
tok_emb = self.token_embedding(x) # (B, T, n_embed)
pos_emb = self.pos_embedding(torch.arange(T, device=x.device))
res = tok_emb + pos_emb
y = self.attn(res) # (B, T, head_dim)
y = self.attn_proj(y) # (B, T, n_embed)
y = F.relu(y + res)
y = self.proj(y) # (B, T, vocab_size)
# ...训练结果:
Iter 0:
Loss in train set: 4.489755630493164, val set: 4.494349479675293
Iter 2000:
Loss in train set: 2.3774020671844482, val set: 2.3949503898620605
Iter 4000:
Loss in train set: 2.300098180770874, val set: 2.3203039169311523
Iter 6000:
Loss in train set: 2.2467241287231445, val set: 2.2703487873077393
Iter 8000:
Loss in train set: 2.2371342182159424, val set: 2.2568199634552
Iter 10000:
Loss in train set: 2.182363271713257, val set: 2.2608296871185303
Iter 12000:
Loss in train set: 2.1943764686584473, val set: 2.239030361175537
Iter 14000:
Loss in train set: 2.173494338989258, val set: 2.2454118728637695
Iter 16000:
Loss in train set: 2.175842046737671, val set: 2.2316362857818604
Iter 18000:
Loss in train set: 2.169466972351074, val set: 2.2129435539245605
可以看到又有了明显进步,再看看生成内容情况(这里把生成字符数调多了一点方便观察):
Yourt'd virs.
Joort.
IF Muntuksbedrin shorblestan.
BRAMND:
This bethal; mors wreave of have musod sset pakavee at my hay no loore. RCAY:
Cen case;
There hearoue ing.
Youls sive hade bafing hay?
Thane but lorseave a deaver bliive; keme yes ow ES MARKING:'d
HEY VINCE: Fort's to futich untirtie thand
可以看到模型已经学习到了一些模式,包括全大写的单词(人名)后接冒号、换行、句尾的标点等,也出现了一些可识别的单词,比如 This, There 等,虽然整体还是无意义的。
MultiHead
接下来可以将单头注意力升级为多头:
class MultiHeadAttn(nn.Module):
def __init__(self, num_head, n_embed, block_size):
super().__init__()
self.heads = nn.ModuleList([Head(n_embed, n_embed // num_head, block_size) for _ in range(num_head)])
def forward(self, x):
return torch.cat([h(x) for h in self.heads], dim=-1)
class PicoGPT(nn.Module):
def __init__(self, vocab_size, n_embed, block_size):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, n_embed)
self.pos_embedding = nn.Embedding(block_size, n_embed)
self.attn = MultiHeadAttn(4, n_embed, block_size)
# ...看看训练结果:Link to originalMulti Head vs Single Head
- 通过并行多个 Head,每个 Head 拥有独立的 Softmax 归一化,保证了模型可以同时捕捉到多种联系
- 相对地,单 Head 的 Attention 如果想同时捕捉多个特征,这些特征的得分在 Softmax 中会产生竞争,导致某些较弱但同样重要的特征被掩盖掉
Data loaded
Start to train...
Iter 0:
Loss in train set: 4.307563304901123, val set: 4.316357612609863
Iter 2000:
Loss in train set: 2.350501537322998, val set: 2.3540198802948
Iter 4000:
Loss in train set: 2.234166383743286, val set: 2.3033454418182373
Iter 6000:
Loss in train set: 2.200052261352539, val set: 2.225684642791748
Iter 8000:
Loss in train set: 2.1654365062713623, val set: 2.2060325145721436
Iter 10000:
Loss in train set: 2.1473195552825928, val set: 2.207012891769409
Iter 12000:
Loss in train set: 2.1218271255493164, val set: 2.2064049243927
Iter 14000:
Loss in train set: 2.103039026260376, val set: 2.198835849761963
Iter 16000:
Loss in train set: 2.0987212657928467, val set: 2.1964051723480225
Iter 18000:
Loss in train set: 2.106106758117676, val set: 2.165680408477783
Ford, sbroues is?
Thasod.
Dour, con! whongs awithe that here the it, pler I up.
Sve a he's
and so'r.
PELORK:
Camue nole lary she.
O tompurcils, my se whond;
ROMORBIO:
Preest,
Whand, and the so-mane to more lif and a bratherecanpe'est me heso.
he umb
No palerys mourn
Jut in uring mods commenink,
A
可以看到 loss 相比单头又下降了一点
V3: Transformer
Block
现在我们已经准备好加入 Transformer Block 了,首先先把原本的 attn_proj 整合到 MultiHeadAttn 中:
class MultiHeadAttn(nn.Module):
def __init__(self, num_head, n_embed, block_size):
super().__init__()
self.heads = nn.ModuleList([Head(n_embed, n_embed // num_head, block_size) for _ in range(num_head)])
self.out_proj = nn.Linear(n_embed, n_embed)
def forward(self, x):
y = torch.cat([h(x) for h in self.heads], dim=-1)
y = self.out_proj(y)
return y此处的 out_proj 负责将各个头的信息整合起来,接下来还需要一层 FFN 来对信息做进一步的“思考”和处理:
class FFN(nn.Module):
def __init__(self, n_embed):
super().__init__()
hidden_dim = 4*n_embed
self.net = nn.Sequential(
nn.Linear(n_embed, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_embed)
)
def forward(self, x):
return self.net(x)一个 Transformer Block 就是 Attention + FFN(加残差连接):
class TransformerBlock(nn.Module):
def __init__(self, num_head, n_embed, block_size):
super().__init__()
self.attn = MultiHeadAttn(num_head, n_embed, block_size)
self.ffn = FFN(n_embed)
def forward(self, x):
y = self.attn(x) + x
y = self.ffn(y) + y
return y修改一下模型
class PicoGPT(nn.Module):
def forward(self, x, target=None):
# x, target: (B, T)
_, T = x.shape
tok_emb = self.token_embedding(x) # (B, T, n_embed)
pos_emb = self.pos_embedding(torch.arange(T, device=x.device))
y = tok_emb + pos_emb
y = self.transformer(y) # (B, T, n_embed)
y = self.proj(y) # (B, T, vocab_size)降低学习率到 5e-4,相应地增加训练次数到 80000,先测试一下单层 Transformer Block 的效果:
Data loaded
Start to train...
Iter 0:
Loss in train set: 4.518762588500977, val set: 4.504559516906738
Iter 2000:
Loss in train set: 2.3470914363861084, val set: 2.365039587020874
Iter 4000:
Loss in train set: 2.2133140563964844, val set: 2.2356655597686768
Iter 6000:
Loss in train set: 2.152195692062378, val set: 2.2316629886627197
Iter 8000:
Loss in train set: 2.163936138153076, val set: 2.1596953868865967
Iter 10000:
Loss in train set: 2.116147994995117, val set: 2.1599292755126953
...
Iter 72000:
Loss in train set: 1.916203260421753, val set: 2.038317918777466
Iter 74000:
Loss in train set: 1.8803707361221313, val set: 2.052816152572632
Iter 76000:
Loss in train set: 1.890195369720459, val set: 2.053239107131958
Iter 78000:
Loss in train set: 1.9111782312393188, val set: 2.0241119861602783
GLOUCESTREDORIOLANUSIO:
Things, and you knower
Yet thee;
Which hoes,
Lee I woectriked strer frieds, crowade age contre and heir che comforl'd
To his I way be time saulse of abousinced,
Lasine trainst reasuntrains man;
But that enjudure my frease manted
it pition.
KING, I'll yound beld it come telf
可以看到 loss 比较稳定地下降到了2.02左右,不过输出文本看起来没有太大变化
堆叠更多 Block
class PicoGPT(nn.Module):
def __init__(self, vocab_size, n_embed, block_size):
# ...
num_head = 4
num_block = 4
self.transformer = nn.Sequential(
*[TransformerBlock(num_head, n_embed, block_size) for _ in range(num_block)],
)
# ...视频和论文中给 Block 加上了 LayerNorm 以保证训练的稳定:
class TransformerBlock(nn.Module):
def __init__(self, num_head, n_embed, block_size):
super().__init__()
self.attn = MultiHeadAttn(num_head, n_embed, block_size)
self.ffn = FFN(n_embed)
self.norm1 = nn.LayerNorm(n_embed)
self.norm2 = nn.LayerNorm(n_embed)
def forward(self, x):
x = self.norm1(x)
y = self.attn(x) + x
y = self.norm2(y)
y = self.ffn(y) + y
return y在 Transformer 结束后再加上一个 LayerNorm:
class PicoGPT(nn.Module):
def __init__(self, vocab_size, n_embed, block_size):
# ...
num_head = 4
num_block = 4
self.transformer = nn.Sequential(
*[TransformerBlock(num_head, n_embed, block_size) for _ in range(num_block)],
nn.LayerNorm(n_embed)
)
# ...我分别尝试了加 LayerNorm 和不加 LayerNorm 的训练,结果基本相同,可能在这种小模型上 LayerNorm 的效果体现的不是很明显。
Data loaded
Start to train...
Iter 0:
Loss in train set: 4.24537467956543, val set: 4.242423057556152
Iter 10000:
Loss in train set: 2.014112710952759, val set: 2.0962538719177246
Iter 20000:
Loss in train set: 1.9180679321289062, val set: 2.0395662784576416
Iter 30000:
Loss in train set: 1.8539788722991943, val set: 2.002016067504883
Iter 40000:
Loss in train set: 1.8302638530731201, val set: 1.9723260402679443
Iter 50000:
Loss in train set: 1.7979899644851685, val set: 1.9632230997085571
Iter 60000:
Loss in train set: 1.7942384481430054, val set: 1.9395010471343994
Iter 70000:
Loss in train set: 1.7818547487258911, val set: 1.9350775480270386
Iter 80000:
Loss in train set: 1.7698839902877808, val set: 1.931738257408142
Iter 90000:
Loss in train set: 1.7602676153182983, val set: 1.9081381559371948
By a less a need
No leave the dim see a lady,
And of we'kizeness in at king, Gead mandure
hat you cust
walt, and here's he, is is us at I came a kingred as like then. For tragewered action
his entemeon alone, wonder to stay nother; as broke must less child with sut, all no knea;
That bestect sing-fa
现在模型的输出已经包含很多可辨识的单词了,只不过仍然无法生成通顺的句子。
V4: Scale up
我先尝试了直接 Scale up,设置 batch_size = 64, block_size = 256, num_head = 6, num_block = 6, n_embed = 384
Data loaded
Start to train...
Iter 0:
Loss in train set: 3.988142967224121, val set: 4.023208141326904
Iter 5000:
Loss in train set: 0.4517122507095337, val set: 2.6705238819122314
Iter 10000:
Loss in train set: 0.15824486315250397, val set: 4.559512615203857
Iter 15000:
Loss in train set: 0.1352931708097458, val set: 5.0354108810424805
Iter 20000:
Loss in train set: 0.11824750155210495, val set: 5.2586750984191895
训了一会之后发现了严重的过拟合现象,于是分别在 Attention 和 FFN 中加入了 Dropout,再次训练:
Iter 0:
Loss in train set: 3.9877395629882812, val set: 4.027334213256836
Iter 1000:
Loss in train set: 1.5057505369186401, val set: 1.6999385356903076
Iter 2000:
Loss in train set: 1.2680703401565552, val set: 1.5151699781417847
Iter 3000:
Loss in train set: 1.1544594764709473, val set: 1.4766182899475098
Iter 4000:
Loss in train set: 1.0669773817062378, val set: 1.4809764623641968
Iter 5000:
Loss in train set: 0.9773276448249817, val set: 1.4964184761047363
Iter 6000:
Loss in train set: 0.8882049322128296, val set: 1.549694538116455
Iter 7000:
Loss in train set: 0.8099228143692017, val set: 1.5853855609893799
Iter 8000:
Loss in train set: 0.7335563898086548, val set: 1.644248604774475
发现有好转,但是仍然有过拟合现象,这可能是数据集太小了,最好也只能把 loss 下降到 1.48 左右,下面加入 Early Stopping 逻辑再训练一次(另外给 AdamW 设置了 weight_decay=0.1):
train_iter = 50000
eval_interval = 1000
trigger_times = 0
max_trigger_times = 5
best_val_loss = float('inf')
for iter in range(train_iter):
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
if iter % eval_interval == 0:
print(f"Iter {iter}:")
losses = estimate_loss()
cur_val_loss = losses['val']
print(f"train loss: {losses['train']}, val loss: {cur_val_loss}")
if cur_val_loss < best_val_loss:
best_val_loss = cur_val_loss
trigger_times = 0
torch.save(model.state_dict(), 'ckpt/best_model.pth')
print("Best model saved!")
else:
trigger_times += 1
print(f"No improvement. Early stopping counter: {trigger_times}")
if trigger_times >= max_trigger_times:
print(f"Early stopping at {iter}")
break结果如下:
Data loaded
Start to train...
Iter 0:
train loss: 3.987602949142456, val loss: 4.027195930480957
Best model saved!
Iter 1000:
train loss: 1.5142121315002441, val loss: 1.707746148109436
Best model saved!
Iter 2000:
train loss: 1.2764283418655396, val loss: 1.5163389444351196
Best model saved!
Iter 3000:
train loss: 1.162351131439209, val loss: 1.470208764076233
Best model saved!
Iter 4000:
train loss: 1.0790393352508545, val loss: 1.472249150276184
No improvement. Early stopping counter: 1
Iter 5000:
train loss: 0.9904430508613586, val loss: 1.4795347452163696
No improvement. Early stopping counter: 2
Iter 6000:
train loss: 0.9032699465751648, val loss: 1.5264209508895874
No improvement. Early stopping counter: 3
Iter 7000:
train loss: 0.8287800550460815, val loss: 1.5565990209579468
No improvement. Early stopping counter: 4
Iter 8000:
train loss: 0.752350389957428, val loss: 1.6219748258590698
No improvement. Early stopping counter: 5
Early stopping at 8000
Could and brave for thee! March our withdraw with thee?
Captain Verona? O, give us leave and keep
Unto the chyse; squake, as 'tis the face of triumph,
At liberty, such a name of blessed
With dial clouds are on those ways.
Richard! Away with him! and plagues
More than thou wast from thy fault;
Tell p
Trainable parameters: 10795841
可以看到其实模型很快(5000 次迭代左右)就把 loss 下降到了 1.47 左右,目前模型已经可以输出一些有意义的短句了!比如 Away with him!, such a name of blessed 等(好像也不是很有意义……)
到此我们的 picoGPT 的训练就基本结束了。在 shakespeare-char 数据集上,达到这个效果已经是一个相当不错的基线了,接下来或许会从几个方向再做一些探索:
- 微调模型架构,把 LayerNorm 换成 RMSNorm,把位置编码换成 RoPE 等,让模型更现代一点
- 加入 Tokenizer,让模型不再是拼凑字符,而是拼凑“单词”,生成连贯的句子
- 性能优化,前面的模型某些地方为了可读性在性能上做了取舍,比如可以把 QKV 三个 Projection 合并为一个大的 Projection,应用 FlashAttention 等
- 加入分布式训练