可以跟着我的《初学者笔记本电脑玩转大模型系列》来实战,当然最好边学边补充深度学习、机器学习的知识(如下链接的一些内容可以多学习)。
如何评价沐神他们写的《动手学深度学习》这本书? - 知乎 (zhihu.com)
之前分享了三篇《初学者笔记本电脑玩转大模型系列》,感兴趣可以访问如下文章:
求索:初学者笔记本电脑玩转大模型系列一:利用ollama跑大模型
求索:初学者笔记本电脑玩转大模型系列二:微调谷歌Gemma模型
求索:初学者笔记本电脑玩转大模型系列三:基于Huggingface微调谷歌Gemma模型
这次我们来挑战一下最近一篇论文的微调方法,这篇论文是《ORPO: Monolithic Preference Optimization without Reference Model》(《ORPO:无需参考模型的单体偏好优化》。该论文提出了一种称为ORPO的方法(Odds Ratio Preference Optimization,赔率比偏好优化),这种方法针对不受欢迎的生成内容施与小小惩罚就足以实现偏好对齐的 SFT,通过将 SFT 和对齐结合到一个新的目标(损失函数)中来训练基础大语言模型,从而免去了耗时耗力的SFT阶段。根据论文架构图显示(如下图),ORPO不需要监督微调、奖励模型和参考模型。
如果对论文详细内容感兴趣,可以访问如下文章:
求索:ORPO:大模型无需微调,直接偏好优化,性能也杠杠的!
话不多说,接下来我们利用这篇论文的ORPO方法优化Gemma 2B。
直接阅读代码,可以访问Github库:
keyonzeng/llm_tuning: large language model tuning examples (github.com)
笔记本电脑配置及环境
笔记本电脑配置:i9-13900HX/32GB,GPU 4090/16GB
主要使用的编程环境:Microsoft PyCharm/VSCode、Jupyter Notebook
操作系统:Windows 11
优化总体思路
基本思路是:我们针对Gemma 2B模型进行ORPO优化,使用的数据集是argilla/dpo-mix-7k,利用Huggingface的Transformers、Transformer Reinforcement Learning(TRL)、Parameter-Efficient Fine-Tuning框架、QLora以及TRL的ORPOTrainer对模型进行优化,性能指标监控使用wandb。
优化具体思路
- 前置条件:CUDA 12.3、PyTorch 2.2.1、PyCharm/VS Code、Jupyter Notebook,安装和使用请参考如下文章:Windows如何安装和运行不同版本CUDA?
- 设置环境变量,由于项目使用wandb来监控性能,因此需要在wandb网站申请key,把这些key信息存放在.env,同时把模型上传到Hugging Face,同样也需要在Hugging Face申请key,如下所示:
wandb="xxxx"
huggingface="xxxxxx"
- 安装相应的一些依赖包,具体包括python-dotenv,bitsandbytes,peft, trl(需要基于源文件安装,最新版才有ORPOTrainer), accelerate,datasets, transfromers。bitsandbytes由于官方没有提供Windows版本,可以使用github上的wheel安装包。
#install the required dependencies
!pip3 install -q -U python-dotenv
!pip3 install -q -U https://github.com/jakaline-dev/bitsandbytes_win/releases/download/0.42.0/bitsandbytes-0.42.0-cp311-cp311-win_amd64.whl
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U git+https://github.com/huggingface/trl.git
!pip3 install -q -U accelerate==0.27.2
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.40.0.dev0
- 由于模型文件比较大,可以提前从huggingface的网站下载gemma-2b的模型文件到本地(中国访问不稳定,需要科学家上网),文件链接为google/gemma-2b at main (huggingface.co),下载文件列表如下:
import torch
torch.cuda.is_available(), torch.version.cuda
# load argilla/dpo-mix-7k dataset
from datasets import load_dataset
#dataset = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized",split="train")
#dataset2 = load_dataset("allenai/ultrafeedback_binarized_cleaned",split="train")
dataset = load_dataset("argilla/dpo-mix-7k",split="train")
dataset[0]["chosen"][0]["content"]
#dataset.to_csv("a.csv")
#format dataset format
from datasets import load_dataset
from transformers import AutoTokenizer
def chatml_format(example):
message = {"role": "user", "content": example['chosen'][0]['content']}
# Format instruction
prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)
# Format chosen answer
chosen = example['chosen'][1]['content']+tokenizer.eos_token
# Format rejected answer
rejected = example['rejected'][1]['content']+tokenizer.eos_token
return {
"prompt": prompt,
"chosen": chosen,
"rejected": rejected,
}
# Load dataset
dataset = load_dataset("argilla/dpo-mix-7k",split="train")
#dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")
#dataset = dataset.filter(
# lambda r:
# r["status"] != "tie" and
# r["chosen_score"] >= 5
# and not r["in_gsm8k_train"]
#)
# Save columns
original_columns = dataset.column_names
# Tokenizer
model_name = "c:/ai/models/gemma"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
# Format dataset
dataset = dataset.map(
chatml_format,
remove_columns=original_columns
)
# Print sample
dataset[1]
- 使用ORPOTrainer和QLora对模型进行优化,由于Gemma 2B的模型文件较大,占用VRAM比较多,我只做了一个epoch,因此几个重要参数你可以根据你的硬件条件进行调整:per_device_train_batch_size=1, gradient_accumulation_steps=16, learning_rate=5e-5,num_train_epochs=1,optim="adamw_bnb_8bit",max_prompt_length=256,
max_length=1024。这些参数可以在tuning过程做一些实际的调整。
#Using ORPOTrainer and QLora to tune Gemma 2B
import os
import gc
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, PeftModel
import wandb
from dotenv import load_dotenv, find_dotenv
from trl import ORPOTrainer
from trl import ORPOConfig
#init env
env =load_dotenv(find_dotenv())
hf_token = os.getenv("huggingface")
wb_token = os.getenv('wandb')
wandb.login(key=wb_token)
#local model path
local_model_path ="c:/ai/models/gemma"
# LoRA configuration
peft_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)
# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
local_model_path,
torch_dtype=torch.bfloat16,
#torch_dtype="auto",
trust_remote_code=True,
quantization_config= BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
)
)
model.config.use_cache = False
new_model = "lion-gemma-2b"
torch.cuda.empty_cache()
# Training arguments
training_args = ORPOConfig (
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant':True},
remove_unused_columns=False,
learning_rate=5e-5,
lr_scheduler_type="cosine",
#max_steps=400,
num_train_epochs=1,
save_strategy="no",
logging_steps=1,
output_dir=new_model,
optim="adamw_bnb_8bit",
warmup_steps=80,
bf16=True,
max_prompt_length=256,
max_length=1024,
report_to="wandb",
)
# Create DPO trainer
orpo_trainer = ORPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config
)
# Fine-tune model with ORPO
orpo_trainer.train()运行了3个多小时,结果如下:
TrainOutput(global_step=421, training_loss=1.4132320338643363, metrics={'train_runtime': 11275.7226, 'train_samples_per_second': 0.599, 'train_steps_per_second': 0.037, 'total_flos': 0.0, 'train_loss': 1.4132320338643363, 'epoch': 1.0}) wandb性能监控如下:
# save tuning checkpoint
final_checkpoint = "gemma_final_checkpoint"
orpo_trainer.model.save_pretrained(final_checkpoint)
tokenizer.save_pretrained(final_checkpoint)
# merge checkpoint with original llm
env =load_dotenv(find_dotenv(),override=True)
hf_token = os.getenv("huggingface")
#Flush memory
del orpo_trainer, model
gc.collect()
torch.cuda.empty_cache()
# Reload model in FP16 (instead of NF4)
base_model = AutoModelForCausalLM.from_pretrained(
local_model_path,
return_dict=True,
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(local_model_path)
# Merge base model with the adapter
model = PeftModel.from_pretrained(base_model, final_checkpoint)
model = model.merge_and_unload()
# Save model and tokenizer
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)
- 把模型上传到Hugging Face上(根据你网络情况是否科学家上网)
# Push them to the HF Hub
#os.environ["HTTPS_PROXY"] ="http://127.0.0.1:7890"
model.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
# Test the new llm
from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
import torch
new_model = "lion-gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(new_model,torch_dtype=torch.bfloat16, device_map="cuda")
model = AutoModelForCausalLM.from_pretrained(
new_model,
torch_dtype=torch.bfloat16,
device_map="cuda",
quantization_config= BitsAndBytesConfig
(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
)) # You may want to use bfloat16 and/or move to GPU here
messages = [
{"role": "user", "content": "Hello, how are you?"},
{
"role": "assistant",
"content": "how can i help you?",
},
{"role": "user", "content": "what is large language model?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
tokenized_chat =tokenized_chat.to('cuda')
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))本次利用ORPO来优化Gemma 2B大功告成,你也可以在自己的笔记本电脑或者个人电脑上实施。
代码库:keyonzeng/llm_tuning: large language model tuning examples (github.com) |
|