InstructionZoo

A collection of open-source Instruction-tuning dataset to train chat-based LLMs (ChatGPT,LLaMA,Alpaca).

This is an on-going project. The format and explaination of the following contents will be updated soon. (By Zhihan)

Alpaca in different languages

Dataset	Size	Language	Generation method
tatsu-lab/stanford_alpaca	52002	EN	Self-instruct with human written 175 seed tasks using text-davinci-003
gururise/AlpacaDataCleaned	51713	EN	A cleaned version of Stanford Alpaca Dataset, in order to solve issues like Hallucinations, Merged Instructions, Empty outputs, etc.
carbonz0/alpaca-chinese-dataset	20456	CH	Translate Stanford Alpaca dataset into Chinese by machine, then self-instruct.
hikarming/alpaca_chinese_dataset	19442	CH	Translate Stanford Alpaca dataset into Chinese by ChatGPT, and check them by humans.
ymcui/Chinese-LLaMA-Alpaca	51458	CH	Translate Stanford Alpaca dataset into Chinese by ChatGPT API, and discard some of them.
A-baoYang/alpaca-7b-chinese/.../alpaca-zhTW.json	20465	TC	Translate Stanford Alpaca dataset into traditional Chinese using OpenCC.
A-baoYang/alpaca-7b-chinese/.../alpaca-en-zh.json	124469	CH, EN	Combine the English instruction/input and traditional Chinese output by ChatGPT API ( gpt-3.5-turbo) .
ntunlplab/traditional-chinese-alpaca/.../alpaca-tw_en_instruction.json	52002	CH, EN	A Traditional-Chinese version of the Alpaca dataset, whose instruction part is left as English.
ntunlplab/traditional-chinese-alpaca/.../alpaca-tw_en-align.json	52002	CH, EN	An Traditional-Chinese version of the Alpaca dataset, where there are English and traditional Chinese versions of one single instruction.
LC1332/Chinese-alpaca-lora	51672	CH	Translate Stanford Alpaca dataset into Chinese by ChatGPT API.

Instruction Dataset Collection

Dataset	Size	Language	Domain	Generation method
hikarming/alpaca_chinese_dataset	226	CH	topic-specific	Generate Chinese instructions under various topics by humans, such as bussiness management, education, Romance of the Three Kingdoms, etc.
sahil280114/codealpaca	20023	EN	Code	Self-instuct with prompts to focus on code generation/edting/optimization tasks, using text-davinci-003.
XueFuzhao/InstructionWild	52191 (479 seeds)	CH, EN		Collect 429 instructions from ChatGPT usage screenshots and release both English and Chinese versions, using text-davinci-003.
BelleGroup/train_0.5M_CN	500000 (175 seeds)	CH		Self-instruct with 175 Chinese seed tasks translated from the seed tasks in Stanford Alpaca dataset, using text-davinci-003.
BelleGroup/train_1M_CN	1000000 (175 seeds)	CH		Self-instruct with 175 Chinese seed tasks translated from the seed tasks in Stanford Alpaca dataset.
BelleGroup/school_math_0.25M	250000	CH	Math	Chinese math questions and answers generated by ChatGPT.
BelleGroup/multiturn_chat_0.8M	800000	CH	Multiturn Chat	Instruction contains historical dialog context, distinguishable by Human: and Assistant:, output contains the current reply by assistant.
GuanacoDataset/.../guanaco_chat_all-utf8.json	48967	CH, DE, EN, JA, TC	Multiturn Chat, Multi-lingual	The dataset for the Guanaco model builds upon the 175 tasks from the Alpaca model by providing rewrites of seed tasks in different languages and adding new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.
GuanacoDataset/.../guanaco_non_chat-utf8.json	279644	CH, DE, EN, JA, TC	Multi-lingual	The original 175 tasks were translated into 4 versions and regenerated independently.
GuanacoDataset/.../guanaco_non_chat_mini_52K-utf8.json	52224	CH, DE, EN, JA, TC	Multi-lingual	A mini version of 52K multi-lang dataset.
GuanacoDataset/.../general_ans-utf8.json	75899	CH, DE, EN, JA, TC	paragraph-level QA, Multi-lingual
GuanacoDataset/.../general_questions-utf8.json	82867	CH, DE, EN, JA, TC	paragraph-level QA, Multi-lingual	Similar questions are combined to form a tree-like structure, and graph theory algorithms are used to process user questions, content summaries, and contextual logic.
GuanacoDataset/.../paper_answers-utf8.json	23393	CH, DE, EN, JA, TC	paragraph-level QA, paper QA, Multi-lingual
GuanacoDataset/.../paper_questions-utf8.json	23840	CH, DE, EN, JA, TC	paragraph-level QA, paper QA, Multi-lingual
PhoebusSi/alpaca-CoT	EN	Chain-of-Thought
QingyiSi/Alpaca-CoT

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstructionZoo

Alpaca in different languages

Instruction Dataset Collection

About

Releases

Packages

tangzhenyu/InstructionZoo

Folders and files

Latest commit

History

Repository files navigation

InstructionZoo

Alpaca in different languages

Instruction Dataset Collection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages