【原】LLMs之Vicuna：《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻譯與解讀

處女座的程序猿 2023-05-26 發(fā)布于上海

展開全文

LLMs之Vicuna：《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻譯與解讀

導讀：作者提出了一個開源的聊天機器人Vicuna-13B。它是通過訓練從ShareGPT收集的用戶共享對話，然后在LLaMA基礎模型上進行調整而產(chǎn)生的。根據(jù)初步的GPT-4評估，Vicuna-13B的質量達到了ChatGPT和Bard 90%的質量，超過其他開源模型如LLaMA和Alpaca。作者提出利用GPT-4作為評估工具來評估不同聊天機器人的有效性，通過它產(chǎn)生的答案和分數(shù)。盡管存在局限性，但這證明了自動化評估的潛力。Vicuna-13B的訓練成本很低，大約只有300美元，采用了內(nèi)存優(yōu)化、多輪對話的改進方法，并通過Spot實例降低了成本。該模型的代碼、參數(shù)和在線演示向公眾開放。最后，作者強調Vicuna存在的限制，如在涉及推理和數(shù)學的任務方面存在問題，缺少安全優(yōu)化。但它可以作為未來研究解決這些限制的開端。

時間

2023年3月30日

地址

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality | LMSYS Org

作者

The Vicuna Team

這是與來自多個機構的合作者的共同努力，包括加州大學伯克利分校、CMU、斯坦福大學、加州大學圣地亞哥分校和MBZUAI。

**《Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality》翻譯與解讀**

Vicuna (generated by stable diffusion 2.1)

We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases. The cost of training Vicuna-13B is around $300. The code and weights, along with an online demo, are publicly available for non-commercial use.

Vicuna (generated by stable diffusion 2.1)

我們推出了Vicuna-13B，這是一個通過在LLaMA上調整ShareGPT收集的用戶共享對話進行訓練的開源聊天機器人。利用GPT-4作為評判，初步評估顯示Vicuna-13B達到OpenAI ChatGPT和谷歌Bard 90%*的質量，在90%*的情況下超過LLaMA和斯坦福Alpaca等其他模型。訓練Vicuna-13B的成本約為300美元。代碼、參數(shù)以及在線演示均向公眾開放用于非商業(yè)用途。

*According to a fun and non-scientific evaluation with GPT-4. Further rigorous evaluation is needed.

*根據(jù)GPT-4的有趣和非科學評估。需要進一步嚴格評估。

How Good is Vicuna??Vicuna-13B的性能有多好？

?Figure 1. Relative Response Quality Assessed by GPT-4*

After fine-tuning Vicuna with 70K user-shared ChatGPT conversations, we discover that Vicuna becomes capable of generating more detailed and well-structured answers compared to Alpaca (see examples below), with the quality on par with ChatGPT.

在對Vicuna進行7萬用戶共享ChatGPT對話的調整后,我們發(fā)現(xiàn)與Alpaca相比，Vicuna能夠生成更詳細和結構更好的答案(見下例),質量與ChatGPT相當。

However, evaluating chatbots is never a simple task. With recent advancements in GPT-4, we are curious whether its capabilities have reached a human-like level that could enable an automated evaluation framework for benchmark generation and performance assessments. Our initial finding indicates that GPT-4 can produce highly consistent ranks and detailed assessment when comparing chatbots’ answers (see above example of GPT-4 judgment). Preliminary evaluations based on GPT-4, summarized in Figure 1, show that Vicuna achieves 90%* capability of Bard/ChatGPT. While this proposed framework shows a potential to automate chatbot assessment, it is not yet a rigorous approach. Building an evaluation system for chatbots remains an open question requiring further research. More details are provided in the evaluation section.

然而，評估聊天機器人從來不是一件簡單的任務。隨著GPT-4的最新進展，我們好奇其能力是否達到了人類級別，能否實現(xiàn)基準生成和性能評估的自動化評估框架。我們的初步發(fā)現(xiàn)表明，GPT-4可以產(chǎn)生高度一致的排名和詳細的評估，以比較聊天機器人的答案(見GPT-4判斷的上例)?；贕PT-4的初步評估總結在圖1中，顯示Vicuna達到Bard/ChatGPT的90%*能力。雖然這種提議的框架顯示出自動化評估聊天機器人的潛力，但這還不是一個嚴格的方法。建立聊天機器人的評估系統(tǒng)仍然是一個需要進一步研究的開放問題。更多詳情在評估部分提供。

Online Demo在線演示

Overview概述

?Figure 2. Workflow Overview

The rapid advancement of large language models (LLMs) has revolutionized chatbot systems, resulting in unprecedented levels of intelligence as seen in OpenAI's ChatGPT. However, despite its impressive performance, the training and architecture details of ChatGPT remain unclear, hindering research and open-source innovation in this field. Inspired by the Meta LLaMA and Stanford Alpaca project, we introduce Vicuna-13B, an open-source chatbot backed by an enhanced dataset and an easy-to-use, scalable infrastructure. By fine-tuning a LLaMA base model on user-shared conversations collected from ShareGPT.com, Vicuna-13B has demonstrated competitive performance compared to other open-source models like Stanford Alpaca. This blog post provides a preliminary evaluation of Vicuna-13B's performance and describes its training and serving infrastructure. We also invite the community to interact with our online demo to test the capabilities of this chatbot.

大規(guī)模語言模型(LLM)的快速發(fā)展徹底改變了聊天機器人系統(tǒng)，表現(xiàn)出前所未有的智能，如OpenAI的ChatGPT。然而，盡管性能令人印象深刻，ChatGPT的訓練和架構細節(jié)仍不清楚，阻礙了該領域的研究和開源創(chuàng)新。受Meta LLaMA和斯坦福Alpaca項目的啟發(fā)，我們推出了Vicuna-13B，這是一個由增強數(shù)據(jù)集和易于使用的可擴展基礎設施支持的開源聊天機器人。通過在LLaMA基礎模型上調整從ShareGPT.com收集的用戶共享對話，Vicuna-13B已經(jīng)展示出與其他開源模型(如斯坦福Alpaca)相媲美的性能。本博客文章對Vicuna-13B的性能進行初步評估，并描述了其訓練和服務基礎設施。我們還邀請社區(qū)與我們的在線演示互動，測試此聊天機器人的能力。

Figure 2 provides an overview of our work. To begin, we collected around 70K conversations from ShareGPT.com, a website where users can share their ChatGPT conversations. Next, we enhanced the training scripts provided by Alpaca to better handle multi-round conversations and long sequences. The training was done with PyTorch FSDP on 8 A100 GPUs in one day. For serving the demo, we implemented a lightweight distributed serving system. We conducted a preliminary evaluation of the model quality by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. To compare two different models, we combine the outputs from each model into a single prompt for each question. The prompts are then sent to GPT-4, which assesses which model provides better responses. A detailed comparison of LLaMA, Alpaca, ChatGPT, and Vicuna is shown in Table 1 below.

圖2概述了我們的工作。首先，我們從ShareGPT.com網(wǎng)站收集了約7萬段對話，用戶可以在該網(wǎng)站上共享他們的ChatGPT對話。其次，我們改進了Alpaca提供的訓練腳本，更好地處理多輪對話和長序列。訓練在8個A100 GPU上一天內(nèi)完成，使用PyTorch FSDP。為了演示服務，我們實現(xiàn)了一個輕量級的分布式服務系統(tǒng)。我們通過創(chuàng)建80個多樣化的問題，并利用GPT-4判斷模型輸出來對模型質量進行初步評估。為了比較兩個不同的模型，我們將每個模型的輸出組合在每個問題的單個提示中。然后將提示發(fā)送給GPT-4，它會評估哪個模型提供更好的回答。LLaMA，Alpaca，ChatGPT和Vicuna的詳細比較見下表1。

Table 1. Comparison between several notable models

Model Name	LLaMA	Alpaca	Vicuna	Bard/ChatGPT
Dataset	Publicly available datasets (1T token)	Self-instruct from davinci-003 API (52K samples)	User-shared conversations (70K samples)	N/A
Training code	N/A	Available	Available	N/A
Evaluation metrics	Academic benchmark	Author evaluation	GPT-4 assessment	Mixed
Training cost (7B)	82K GPU-hours	$500 (data) + $100 (training)	$140 (training)	N/A
Training cost (13B)	135K GPU-hours	N/A	$300 (training)	N/A

Training訓練：訓練方法基于alpaca構建+內(nèi)存優(yōu)化+通過Spot實例降低成本

Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, we convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, we divide lengthy conversations into smaller segments that fit the model's maximum context length.

Vicuna是通過調整LLaMA基礎模型來創(chuàng)建的，使用從ShareGPT.com收集的約7萬段用戶共享對話。為確保數(shù)據(jù)質量，我們將HTML轉換回markdown，并過濾掉一些不適當或低質量的樣本。此外，我們將較長的對話分成較小的段，以符合模型的最大上下文長度。

Our training recipe builds on top of Stanford’s alpaca with the following improvements.

Memory Optimizations: To enable Vicuna's understanding of long context, we expand the max context length from 512 in alpaca to 2048, which substantially increases GPU memory requirements. We tackle the memory pressure by utilizing gradient checkpointing and flash attention.

Multi-round conversations: We adjust the training loss to account for multi-round conversations and compute the fine-tuning loss solely on the chatbot's output.

Cost Reduction via Spot Instance: The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.

我們的訓練方法基于斯坦福大學Stanford’s alpaca構建，具有以下改進。

內(nèi)存優(yōu)化：為了使Vicuna理解長上下文，我們將alpaca中的最大上下文長度從512擴展到2048，這大大增加了GPU內(nèi)存需求。我們通過使用梯度檢查點gradient checkpointing和閃光注意力flash attention來解決內(nèi)存壓力。

多輪對話：我們調整訓練損失，以考慮多輪對話，并僅根據(jù)聊天機器人的輸出計算調整損失。

通過Spot實例降低成本：數(shù)據(jù)集擴大40倍和序列長度增加4倍的訓練會帶來相當大的訓練費用挑戰(zhàn)。我們采用SkyPilot托管的Spot實例SkyPilot?managed spot，利用更便宜的Spot實例與自動恢復預防和自動區(qū)域切換來降低成本。該解決方案將7B模型的訓練成本從500美元降低到約140美元，13B模型的訓練成本從約1,000美元降低到300美元。

Serving服務：分布式工作節(jié)點+靈活添加GPU節(jié)點

We build a serving system that is capable of serving multiple models with distributed workers. It supports flexible plug-in of GPU workers from both on-premise clusters and the cloud. By utilizing a fault-tolerant controller and managed spot feature in SkyPilot, this serving system can work well with cheaper spot instances from multiple clouds to reduce the serving costs. It is currently a lightweight implementation and we are working on integrating more of our latest research into it.

我們構建了一個能夠使用分布式工作節(jié)點服務多個模型的服務系統(tǒng)。它支持從本地集群和云中靈活添加GPU工作節(jié)點。通過利用容錯控制器和SkyPilot中的托管Spot功能，此服務系統(tǒng)可以與來自多個云的更便宜的Spot實例很好地配合使用，以降低服務成本。這目前是一個輕量級實現(xiàn)，我們正在努力將我們最新的研究成果集成進去。

How To Evaluate a Chatbot?如何評估聊天機器人？——提出了一種基于GPT-4的評估框架來自動評估聊天機器人的性能

?Figure 3. Response Comparison Assessed by GPT-4

Evaluating AI chatbots is a challenging task, as it requires examining language understanding, reasoning, and context awareness. With AI chatbots becoming more advanced, current open benchmarks may no longer suffice. For instance, the evaluation dataset used in Stanford’s Alpaca, self-instruct, can be effectively answered by SOTA chatbots, making it difficult for humans to discern differences in performance. More limitations include training/test data contamination and the potentially high cost of creating new benchmarks. To tackle these issues, we propose an evaluation framework based on GPT-4 to automate chatbot performance assessment.	評估AI聊天機器人是一項具有挑戰(zhàn)性的任務，因為它需要檢查語言理解、推理和上下文意識。隨著AI聊天機器人變得更加先進，目前的開放基準可能不再足夠。例如，斯坦福大學Alpaca使用的評估數(shù)據(jù)集self-instruct可以被當前最先進的聊天機器人有效回答，這使人類難以辨別性能差異。更多限制包括訓練/測試數(shù)據(jù)污染和潛在的創(chuàng)建新基準的高成本。為解決這些問題，我們提出了一種基于GPT-4的評估框架來自動評估聊天機器人的性能。
First, we devised eight question categories, such as Fermi problems, roleplay scenarios, and coding/math tasks, to test various aspects of a chatbot's performance. Through careful prompt engineering, GPT-4 is able to generate diverse, challenging questions that baseline models struggle with. We select ten questions per category and collect answers from five chatbots: LLaMA, Alpaca, ChatGPT, Bard, and Vicuna. We then ask GPT-4 to rate the quality of their answers based on helpfulness, relevance, accuracy, and detail. We discover that GPT-4 can produce not only relatively consistent scores but also detailed explanations on why such scores are given (detailed examples link). However, we also notice that GPT-4 is not very good at judging coding/math tasks.	首先，我們設計了八個問題類別，如費米問題、角色扮演場景和編碼/數(shù)學任務，以測試聊天機器人性能的各個方面。通過精心設計提示，GPT-4能夠生成基線模型難以應對的多樣化和具有挑戰(zhàn)性的問題。我們從五個聊天機器人中選擇每個類別十個問題的答案：LLaMA、Alpaca、ChatGPT、Bard和Vicuna。然后我們要求GPT-4根據(jù)有用性、相關性、準確性和細節(jié)評價它們的答案質量。我們發(fā)現(xiàn)GPT-4不僅可以產(chǎn)生相對一致的分數(shù)，而且能夠詳細解釋為什么給出這樣的分數(shù)(詳細示例鏈接)。然而，我們也注意到GPT-4在判斷編碼/數(shù)學任務方面不是很好。
Figure 3 displays the comparison results between all baselines and Vicuna. GPT-4 prefers Vicuna over state-of-the-art open-source models (LLaMA, Alpaca) in more than 90% of the questions, and it achieves competitive performance against proprietary models (ChatGPT, Bard). In 45% of the questions, GPT-4 rates Vicuna's response as better or equal to ChatGPT's. As GPT-4 assigns a quantitative score to each response on a scale of 10, we calculate the total score for each (baseline, Vicuna) comparison pair by adding up the scores obtained by each model on 80 questions. As shown in Table 2, Vicuna’s total score is 92% of ChatGPT’s. Despite recent advancements, these chatbots still face limitations, such as struggling with basic math problems or having limited coding ability.	圖3顯示了所有基準線和Vicuna之間的比較結果。在80%的問題中，GPT-4更喜歡Vicuna而不是最先進的開源模型(LLaMA，Alpaca)，并達到專有模型(ChatGPT，Bard)的競爭性能。在45%的問題中，GPT-4將Vicuna的回答評為優(yōu)于或等于ChatGPT的回答。由于GPT-4在10點量表上為每個回答分配一個定量得分，我們通過將每個模型在80個問題上獲得的得分相加來計算每個(基準，Vicuna)比較對的總得分。如表2所示，Vicuna的總得分是ChatGPT的92%。盡管最近有所進展，但這些聊天機器人仍面臨一些限制，如難以應對基本的數(shù)學問題或具有有限的編碼能力。
While this proposed evaluation framework demonstrates the potential for assessing chatbots, it is not yet a rigorous or mature approach, as large language models are prone to hallucinate. Developing a comprehensive, standardized evaluation system for chatbots remains an open question requiring further research.	雖然此提出的評估框架展示了評估聊天機器人的潛力，但由于大語言模型容易產(chǎn)生幻覺，所以這還不是一個嚴格或成熟的方法。開發(fā)全面標準化的聊天機器人評估系統(tǒng)仍然是一個需要進一步研究的開放問題。

Table 2. Total Scores Assessed by GPT-4.

Baseline	Baseline Score	Vicuna Score
LLaMA-13B	513.0	694.0
Alpaca-13B	583.0	704.0
Bard	664.0	655.5
ChatGPT	693.0	638.0

Limitations局限性—不擅長涉及推理或數(shù)學

We have noticed that, similar to other large language models, Vicuna has certain limitations. For instance, it is not good at tasks involving reasoning or mathematics, and it may have limitations in accurately identifying itself or ensuring the factual accuracy of its outputs. Additionally, it has not been sufficiently optimized to guarantee safety or mitigate potential toxicity or bias. To address the safety concerns, we use the OpenAI moderation API to filter out inappropriate user inputs in our online demo. Nonetheless, we anticipate that Vicuna can serve as an open starting point for future research to tackle these limitations.

我們注意到，與其他大型語言模型一樣，Vicuna也存在某些限制。例如，它不擅長涉及推理或數(shù)學的任務，并且在準確識別自己或確保其輸出的事實準確性方面可能存在限制。此外，它還沒有得到足夠的優(yōu)化以確保安全性或減輕潛在的 toxicity或偏見。為了解決安全問題，我們在在線演示中使用OpenAI調解API過濾掉不適當?shù)挠脩糨斎?/span>。盡管如此，我們預計Vicuna可以作為未來研究解決這些限制的開放起點。

Release發(fā)行

In our first release, we will share the training, serving, and evaluation code on a GitHub repo: https://github.com/lm-sys/FastChat. We also released the Vicuna-13B model weights, please find the instructions here. There is no plan to release the dataset. Join our Discord server and follow our Twitter to get the latest updates.

在我們的首次發(fā)布中，我們將在GitHub repo：https：//github.com/lm-sys/FastChat上共享訓練，服務和評估代碼。我們還發(fā)布了Vicuna-13B模型權重，請在這里找相關說明。暫無計劃發(fā)布數(shù)據(jù)集。加入我們的Discord服務器并關注我們的Twitter以獲取最新動態(tài)。

License許可證

The online demo is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us If you find any potential violation.\ The code is released under the Apache License 2.0.

在線演示僅供非商業(yè)用途，受LLaMA模型許可證、OpenAI生成的數(shù)據(jù)使用條款和ShareGPT的隱私實踐的約束。如果發(fā)現(xiàn)任何潛在違規(guī)行為，請聯(lián)系我們。\代碼根據(jù)Apache許可證2.0版發(fā)布。

The Team團隊

This is a joint effort with collaborators from multiple institutions, including UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI.

這是與來自多個機構的合作者的共同努力，包括加州大學伯克利分校、CMU、斯坦福大學、加州大學圣地亞哥分校和MBZUAI。

Students (alphabetical order):

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang

Advisors (alphabetical order):

Joseph E. Gonzalez, Ion Stoica, Eric P. Xing

學生(按字母順序)：

Wei-Lin Chiang,Zhuohan Li,Zi Lin,Ying Sheng,Zhanghao Wu,Hao Zhang,Lianmin Zheng,Siyuan Zhuang,Yonghao Zhuang

顧問(按字母順序)：

Joseph E. Gonzalez,Ion Stoica,Eric P. Xing

Acknowledgment致謝

We would like to thank Xinyang Geng, Hao Liu, and Eric Wallace from BAIR; Xuecheng Li, and Tianyi Zhang from Stanford Alpaca team for their insightful discussion and feedback; Qirong Ho from MBZUAI for providing support on the serving cluster. Please check out a blog post from BAIR about a concurrent effort on their chatbot, Koala.

我們要感謝BAIR的Xinyang Geng，Hao Liu和Eric Wallace;斯坦福Alpaca團隊的Xuecheng Li和Tianyi Zhang提供的精辟討論和反饋;MBZUAI的Qirong Ho為服務集群提供的支持。請查看BAIR關于他們的聊天機器人Koala的同期工作的博客文章。

Citation

@misc{vicuna2023,
    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality},
    url = {https:///blog/2023-03-30-vicuna/},
    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
    month = {March},
    year = {2023}
}