A collection of open-source Instruction-tuning dataset to train chat-based LLMs (ChatGPT,LLaMA,Alpaca).
This is an on-going project. The format and explaination of the following contents will be updated soon. (By Zhihan)
Dataset | Size | Language | Generation method |
---|---|---|---|
tatsu-lab/stanford_alpaca | 52002 | EN | Self-instruct with human written 175 seed tasks using text-davinci-003 |
gururise/AlpacaDataCleaned | 51713 | EN | A cleaned version of Stanford Alpaca Dataset, in order to solve issues like Hallucinations, Merged Instructions, Empty outputs, etc. |
carbonz0/alpaca-chinese-dataset | 20456 | CH | Translate Stanford Alpaca dataset into Chinese by machine, then self-instruct. |
hikarming/alpaca_chinese_dataset | 19442 | CH | Translate Stanford Alpaca dataset into Chinese by ChatGPT, and check them by humans. |
ymcui/Chinese-LLaMA-Alpaca | 51458 | CH | Translate Stanford Alpaca dataset into Chinese by ChatGPT API, and discard some of them. |
A-baoYang/alpaca-7b-chinese/.../alpaca-zhTW.json | 20465 | TC | Translate Stanford Alpaca dataset into traditional Chinese using OpenCC. |
A-baoYang/alpaca-7b-chinese/.../alpaca-en-zh.json | 124469 | CH, EN | Combine the English instruction/input and traditional Chinese output by ChatGPT API ( gpt-3.5-turbo) . |
ntunlplab/traditional-chinese-alpaca/.../alpaca-tw_en_instruction.json | 52002 | CH, EN | A Traditional-Chinese version of the Alpaca dataset, whose instruction part is left as English. |
ntunlplab/traditional-chinese-alpaca/.../alpaca-tw_en-align.json | 52002 | CH, EN | An Traditional-Chinese version of the Alpaca dataset, where there are English and traditional Chinese versions of one single instruction. |
LC1332/Chinese-alpaca-lora | 51672 | CH | Translate Stanford Alpaca dataset into Chinese by ChatGPT API. |
Dataset | Size | Language | Domain | Generation method |
---|---|---|---|---|
hikarming/alpaca_chinese_dataset | 226 | CH | topic-specific | Generate Chinese instructions under various topics by humans, such as bussiness management, education, Romance of the Three Kingdoms, etc. |
sahil280114/codealpaca | 20023 | EN | Code | Self-instuct with prompts to focus on code generation/edting/optimization tasks, using text-davinci-003. |
XueFuzhao/InstructionWild | 52191 (479 seeds) | CH, EN | Collect 429 instructions from ChatGPT usage screenshots and release both English and Chinese versions, using text-davinci-003. | |
BelleGroup/train_0.5M_CN | 500000 (175 seeds) | CH | Self-instruct with 175 Chinese seed tasks translated from the seed tasks in Stanford Alpaca dataset, using text-davinci-003. | |
BelleGroup/train_1M_CN | 1000000 (175 seeds) | CH | Self-instruct with 175 Chinese seed tasks translated from the seed tasks in Stanford Alpaca dataset. | |
BelleGroup/school_math_0.25M | 250000 | CH | Math | Chinese math questions and answers generated by ChatGPT. |
BelleGroup/multiturn_chat_0.8M | 800000 | CH | Multiturn Chat | Instruction contains historical dialog context, distinguishable by Human: and Assistant:, output contains the current reply by assistant. |
GuanacoDataset/.../guanaco_chat_all-utf8.json | 48967 | CH, DE, EN, JA, TC | Multiturn Chat, Multi-lingual | The dataset for the Guanaco model builds upon the 175 tasks from the Alpaca model by providing rewrites of seed tasks in different languages and adding new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition. |
GuanacoDataset/.../guanaco_non_chat-utf8.json | 279644 | CH, DE, EN, JA, TC | Multi-lingual | The original 175 tasks were translated into 4 versions and regenerated independently. |
GuanacoDataset/.../guanaco_non_chat_mini_52K-utf8.json | 52224 | CH, DE, EN, JA, TC | Multi-lingual | A mini version of 52K multi-lang dataset. |
GuanacoDataset/.../general_ans-utf8.json | 75899 | CH, DE, EN, JA, TC | paragraph-level QA, Multi-lingual | |
GuanacoDataset/.../general_questions-utf8.json | 82867 | CH, DE, EN, JA, TC | paragraph-level QA, Multi-lingual | Similar questions are combined to form a tree-like structure, and graph theory algorithms are used to process user questions, content summaries, and contextual logic. |
GuanacoDataset/.../paper_answers-utf8.json | 23393 | CH, DE, EN, JA, TC | paragraph-level QA, paper QA, Multi-lingual | |
GuanacoDataset/.../paper_questions-utf8.json | 23840 | CH, DE, EN, JA, TC | paragraph-level QA, paper QA, Multi-lingual | |
PhoebusSi/alpaca-CoT | EN | Chain-of-Thought | ||
QingyiSi/Alpaca-CoT |