[go: nahoru, domu]

Skip to content

tangzhenyu/InstructionZoo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 

Repository files navigation

InstructionZoo

A collection of open-source Instruction-tuning dataset to train chat-based LLMs (ChatGPT,LLaMA,Alpaca).

This is an on-going project. The format and explaination of the following contents will be updated soon. (By Zhihan)

Alpaca in different languages

Dataset Size Language Generation method
tatsu-lab/stanford_alpaca 52002 EN Self-instruct with human written 175 seed tasks using text-davinci-003
gururise/AlpacaDataCleaned 51713 EN A cleaned version of Stanford Alpaca Dataset, in order to solve issues like Hallucinations, Merged Instructions, Empty outputs, etc.
carbonz0/alpaca-chinese-dataset 20456 CH Translate Stanford Alpaca dataset into Chinese by machine, then self-instruct.
hikarming/alpaca_chinese_dataset 19442 CH Translate Stanford Alpaca dataset into Chinese by ChatGPT, and check them by humans.
ymcui/Chinese-LLaMA-Alpaca 51458 CH Translate Stanford Alpaca dataset into Chinese by ChatGPT API, and discard some of them.
A-baoYang/alpaca-7b-chinese/.../alpaca-zhTW.json 20465 TC Translate Stanford Alpaca dataset into traditional Chinese using OpenCC.
A-baoYang/alpaca-7b-chinese/.../alpaca-en-zh.json 124469 CH, EN Combine the English instruction/input and traditional Chinese output by ChatGPT API ( gpt-3.5-turbo) .
ntunlplab/traditional-chinese-alpaca/.../alpaca-tw_en_instruction.json 52002 CH, EN A Traditional-Chinese version of the Alpaca dataset, whose instruction part is left as English.
ntunlplab/traditional-chinese-alpaca/.../alpaca-tw_en-align.json 52002 CH, EN An Traditional-Chinese version of the Alpaca dataset, where there are English and traditional Chinese versions of one single instruction.
LC1332/Chinese-alpaca-lora 51672 CH Translate Stanford Alpaca dataset into Chinese by ChatGPT API.

Instruction Dataset Collection

Dataset Size Language Domain Generation method
hikarming/alpaca_chinese_dataset 226 CH topic-specific Generate Chinese instructions under various topics by humans, such as bussiness management, education, Romance of the Three Kingdoms, etc.
sahil280114/codealpaca 20023 EN Code Self-instuct with prompts to focus on code generation/edting/optimization tasks, using text-davinci-003.
XueFuzhao/InstructionWild 52191 (479 seeds) CH, EN Collect 429 instructions from ChatGPT usage screenshots and release both English and Chinese versions, using text-davinci-003.
BelleGroup/train_0.5M_CN 500000 (175 seeds) CH Self-instruct with 175 Chinese seed tasks translated from the seed tasks in Stanford Alpaca dataset, using text-davinci-003.
BelleGroup/train_1M_CN 1000000 (175 seeds) CH Self-instruct with 175 Chinese seed tasks translated from the seed tasks in Stanford Alpaca dataset.
BelleGroup/school_math_0.25M 250000 CH Math Chinese math questions and answers generated by ChatGPT.
BelleGroup/multiturn_chat_0.8M 800000 CH Multiturn Chat Instruction contains historical dialog context, distinguishable by Human: and Assistant:, output contains the current reply by assistant.
GuanacoDataset/.../guanaco_chat_all-utf8.json 48967 CH, DE, EN, JA, TC Multiturn Chat, Multi-lingual The dataset for the Guanaco model builds upon the 175 tasks from the Alpaca model by providing rewrites of seed tasks in different languages and adding new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.
GuanacoDataset/.../guanaco_non_chat-utf8.json 279644 CH, DE, EN, JA, TC Multi-lingual The original 175 tasks were translated into 4 versions and regenerated independently.
GuanacoDataset/.../guanaco_non_chat_mini_52K-utf8.json 52224 CH, DE, EN, JA, TC Multi-lingual A mini version of 52K multi-lang dataset.
GuanacoDataset/.../general_ans-utf8.json 75899 CH, DE, EN, JA, TC paragraph-level QA, Multi-lingual
GuanacoDataset/.../general_questions-utf8.json 82867 CH, DE, EN, JA, TC paragraph-level QA, Multi-lingual Similar questions are combined to form a tree-like structure, and graph theory algorithms are used to process user questions, content summaries, and contextual logic.
GuanacoDataset/.../paper_answers-utf8.json 23393 CH, DE, EN, JA, TC paragraph-level QA, paper QA, Multi-lingual
GuanacoDataset/.../paper_questions-utf8.json 23840 CH, DE, EN, JA, TC paragraph-level QA, paper QA, Multi-lingual
PhoebusSi/alpaca-CoT EN Chain-of-Thought
QingyiSi/Alpaca-CoT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published