InstructGLM:基于ChatGLM-6B在指令數(shù)據(jù)集上進行微調(diào)

520jefferson 2023-04-11 發(fā)布于日本

展開全文

InstructGLM

基于ChatGLM-6B+LoRA在指令數(shù)據(jù)集上進行微調(diào)

https://github.com/yanqiangmiffy/InstructGLM

本項目主要內(nèi)容：

?? 2023/4/9 發(fā)布了基于100萬條由BELLE項目生成的中文指令數(shù)據(jù)的Lora權(quán)重，具體可見output/belle/chatglm-lora.pt

?? 2023/4/8 基于deepspeed支持多卡微調(diào)，速度相比單卡提升8-9倍具體設(shè)置可見微調(diào)3 基于DeepSpeed進行Lora微調(diào)

?? 2023/3/28 開源了基于alpaca和belle數(shù)據(jù)指令微調(diào)后的lora權(quán)重，詳情可見output

?? 2023/3/25 針對ChatGLM-6B模型基于LoRA技術(shù)進行微調(diào)

?? 2023/3/23 基于gradio的demo完善

Todo

[x] deepspeed支持

[ ] 模型評估,如何評估微調(diào)后的模型效果

開源指令數(shù)據(jù)集

斯坦福52k英文指令數(shù)據(jù)

instruction:52K 條指令中的每一條都是唯一的,答案由text-davinci-003模型生成得到的

BELLE項目生成的中文指令數(shù)據(jù)：0.5m&1m

1百萬數(shù)據(jù)：https:///datasets/BelleGroup/generated_train_1M_CN

生成方式基于種子prompt，調(diào)用openai的api生成中文指令

GuanacoDataset 多語言指令數(shù)據(jù)集

Guanaco 是在 Meta 的 LLaMA 7B 模型上訓(xùn)練的指令跟隨語言模型。在 Alpaca 模型原始 52K 數(shù)據(jù)的基礎(chǔ)上，我們添加了額外的 98,369 個條目，涵蓋英語、簡體中文、繁體中文（臺灣）、繁體中文（香港）、日語、德語以及各種語言和語法任務(wù)。通過使用這些豐富的數(shù)據(jù)重新訓(xùn)練和優(yōu)化模型，Guanaco 在多語言環(huán)境中展示了出色的性能和潛力。項目鏈接可以查看 https://guanaco-model./

alpaca中文指令微調(diào)數(shù)據(jù)集

與原始alpaca數(shù)據(jù)json格式相同,數(shù)據(jù)生成的方法是機器翻譯和self-instruct

人工精調(diào)的中文對話數(shù)據(jù)集

加入除了alpaca之外的其他中文聊天對話人工微調(diào)，部分并不中文化的問題，我們將重新詢問chatgpt或文心一言，重新獲取回答并覆蓋掉alpaca的回答

firefly-train-1.1M ，一份高質(zhì)量的包含1.1M中文多任務(wù)指令微調(diào)數(shù)據(jù)集，包含23種常見的中文NLP任務(wù)的指令數(shù)據(jù)。對于每個任務(wù)，由人工書寫若干指令模板，保證數(shù)據(jù)的高質(zhì)量與豐富度。

微調(diào)1：alpaca英文指令數(shù)據(jù)

斯坦福羊駝52k數(shù)據(jù)，原始數(shù)據(jù)格式如下：{

"instruction": "Evaluate this sentence for spelling and grammar mistakes",

"input": "He finnished his meal and left the resturant",

"output": "He finished his meal and left the restaurant."

}

數(shù)據(jù)集地址：https://github.com/tatsu-lab/stanford_alpaca

1.數(shù)據(jù)預(yù)處理

轉(zhuǎn)化alpaca數(shù)據(jù)集為jsonl,這一步可以執(zhí)行設(shè)置數(shù)據(jù)轉(zhuǎn)換后格式，比如：###Instruction:xxx###Input:xxxx###Response:xxxpython cover_alpaca2jsonl.py \

--data_path data/alpaca_data.json \

--save_path data/alpaca_data.jsonl

對文本進行tokenize,加快訓(xùn)練速度，文本長度可根據(jù)運行資源自行設(shè)置python tokenize_dataset_rows.py \

--jsonl_path data/alpaca_data.jsonl \

--save_path data/alpaca \

--max_seq_length 320

2. 模型訓(xùn)練python train_lora.py \

--dataset_path data/alpaca \

--lora_rank 8 \

--per_device_train_batch_size 2 \

--gradient_accumulation_steps 1 \

--max_steps 52000 \

--save_steps 1000 \

--save_total_limit 2 \

--learning_rate 2e-5 \

--fp16 \

--remove_unused_columns false \

--logging_steps 50 \

--output_dir output

微調(diào)2:BELLE中文指令數(shù)據(jù)

包含543314條由BELLE項目生成的中文指令數(shù)據(jù),數(shù)據(jù)格式如下：inputtarget

用一句話描述地球為什么是獨一無二的。\n

地球上有適宜生命存在的條件和多樣化的生命形式

數(shù)據(jù)集地址：https:///datasets/BelleGroup/generated_train_0.5M_CN

1.數(shù)據(jù)預(yù)處理

轉(zhuǎn)化bell數(shù)據(jù)集為jsonl

python cover_alpaca2jsonl.py \

--dataset_name BelleGroup/generated_train_0.5M_CN \

--save_path data/belle_data.jsonl

文本長度統(tǒng)計count 543314.000000

mean 83.536944

std 95.665178

min 4.000000

25% 33.000000

50% 51.000000

75% 88.000000

90% 194.000000

max 4410.000000

Name: input_len, dtype: float64

count 543314.000000

mean 121.079030

std 165.472722

min 1.000000

25% 27.000000

50% 67.000000

75% 151.000000

90% 296.000000

max 9463.000000

Name: target_len, dtype: float64

分詞處理python tokenize_dataset_rows.py \

--jsonl_path data/belle_data.jsonl \

--save_path data/belle \

--max_seq_length 320

轉(zhuǎn)換后的數(shù)據(jù)： input_ids seq_len

0 [20005, 92863, 20012, 20005, 83864, 87784, 871... 20

1 [20005, 92863, 20012, 20005, 91432, 86523, 885... 80

2 [20005, 92863, 20012, 104069, 85056, 86334, 89... 61

3 [20005, 92863, 20012, 91492, 89122, 83866, 852... 24

4 [20005, 92863, 20012, 20005, 83834, 99899, 927... 24

2. 模型訓(xùn)練

基于原始chatglm-6b訓(xùn)練python train_lora.py \

--dataset_path data/belle \

--lora_rank 8 \

--per_device_train_batch_size 2 \

--gradient_accumulation_steps 1 \

--max_steps 52000 \

--save_steps 1000 \

--save_total_limit 2 \

--learning_rate 2e-5 \

--fp16 \

--remove_unused_columns false \

--logging_steps 50 \

--output_dir output

基于alpaca的lora繼續(xù)微調(diào)python train_lora.py \

--dataset_path data/belle \

--lora_rank 8 \

--per_device_train_batch_size 8 \

--gradient_accumulation_steps 1 \

--max_steps 52000 \

--save_steps 10000 \

--save_total_limit 2 \

--learning_rate 2e-5 \

--fp16 \

--remove_unused_columns false \

--logging_steps 50 \

--output_dir output/belle \

--is_resume True \

--resume_path output/alpaca/chatglm-lora.pt

微調(diào)3:基于DeepSpeed進行Lora微調(diào)

支持多卡+zero方案，訓(xùn)練速度可提高8倍左右accelerate launch --config_file config/default_config.yaml train_new.py

實驗環(huán)境

安裝所需要的包：pip install -r requirements.txt -i https://pypi.tuna./simple

顯卡：2xA100 80G

實驗結(jié)果

訓(xùn)練好的lora權(quán)重└─output

├─alpaca:基于52k微調(diào)的lora權(quán)重

├─belle：:基于52k微調(diào)的lora權(quán)重+belle微調(diào)的權(quán)重52000steps

└─belle_raw：belle微調(diào)的權(quán)重104000steps

鏈接：https://pan.baidu.com/s/1c-zRSEUn4151YLoowPN4YA?pwd=hxbr

--來自百度網(wǎng)盤超級會員V3的分享

alpaca數(shù)據(jù)微調(diào)效果

belle數(shù)據(jù)微調(diào)效果

Reference

非常感謝以下作者的無私開源

https://github.com/mymusise/ChatGLM-Tuning

https:///BelleGroup/BELLE-7B-2M

https://github.com/LianjiaTech/BELLE

https:///datasets/BelleGroup/generated_train_0.5M_CN

https:///datasets/JosephusCheung/GuanacoDataset

https://guanaco-model./

https://github.com/carbonz0/alpaca-chinese-dataset

https://github.com/THUDM/ChatGLM-6B

https:///THUDM/chatglm-6b

https://github.com/lich99/ChatGLM-finetune-LoRA

Bugs

gcc版本升級yum install centos-release-scl -y

yum install devtoolset-9 -y

#臨時覆蓋系統(tǒng)原有的gcc引用

scl enable devtoolset-9 bash

# 查看gcc版本

gcc -v

本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊一鍵舉報。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻花（0） +1

來自： 520jefferson > 《機器學(xué)習(xí)/深度學(xué)習(xí)/tensorflow》

舉報/認領(lǐng)

0條評論

發(fā)表

請遵守用戶評論公約

類似文章

520jefferson

關(guān)注對話

TA的最新館藏

一些通用的Makefile文件模板
領(lǐng)域/場景大模型也太難訓(xùn)了吧
PEFT | Transformer參數(shù)量、計算量、顯存占用分析
Broadcast,Scatter,Gather,Reduce,All
淺析 | 大語言模型細節(jié)、訓(xùn)練及微調(diào)
[轉(zhuǎn)] LLMs之InternLM：InternLM/InternLM-7B模型的簡介、安裝、使用方法之詳細攻略

喜歡該文的人也喜歡更多

熱門閱讀換一換

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

InstructGLM:基于ChatGLM-6B在指令數(shù)據(jù)集上進行微調(diào)