BookCorpus: Difference between revisions
m cite repair; |
→top: +link: Data set, Text corpus |
||
Line 1: | Line 1: | ||
{{Short description|Book dataset}} |
{{Short description|Book dataset}} |
||
'''BookCorpus''' (also sometimes referred to as the '''Toronto Book Corpus''') is a dataset consisting of the text of around 11,000 unpublished books [[web scraping|scraped]] from the [[Internet]]. It was the main corpus used to train the initial [[generative pre-trained transformer|GPT]] model by [[OpenAI]],<ref name=gpt-1-paper/> and has been used as training data for other early [[large language model]]s including Google's [[BERT (language model)|BERT]].<ref name=bert-paper/> The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.<ref name=bert-paper/> |
'''BookCorpus''' (also sometimes referred to as the '''Toronto Book Corpus''') is a [[Data set|dataset]] consisting of the text of around 11,000 unpublished books [[web scraping|scraped]] from the [[Internet]]. It was the main [[Text corpus|corpus]] used to train the initial [[generative pre-trained transformer|GPT]] model by [[OpenAI]],<ref name=gpt-1-paper/> and has been used as training data for other early [[large language model]]s including Google's [[BERT (language model)|BERT]].<ref name=bert-paper/> The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.<ref name=bert-paper/> |
||
The corpus was introduced in a 2015 paper by researchers from the [[University of Toronto]] and [[MIT]] titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors".<ref name=bookpaper/><ref name=swallows/> The dataset was initially hosted on a University of Toronto webpage.<ref name=swallows/> An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.<ref name=debt/> Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be [[Smashwords]].<ref name=swallows/><ref name=debt/> |
The corpus was introduced in a 2015 paper by researchers from the [[University of Toronto]] and [[MIT]] titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors".<ref name=bookpaper/><ref name=swallows/> The dataset was initially hosted on a University of Toronto webpage.<ref name=swallows/> An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.<ref name=debt/> Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be [[Smashwords]].<ref name=swallows/><ref name=debt/> |
Revision as of 08:58, 5 July 2023
BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 11,000 unpublished books scraped from the Internet. It was the main corpus used to train the initial GPT model by OpenAI,[1] and has been used as training data for other early large language models including Google's BERT.[2] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.[2]
The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors".[3][4] The dataset was initially hosted on a University of Toronto webpage.[4] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.[5] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.[4][5]
References
- ^ "Improving Language Understanding by Generative Pre-Training" (PDF). Archived (PDF) from the original on January 26, 2021. Retrieved June 9, 2020.
- ^ a b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
- ^ Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
- ^ a b c Lea, Richard (28 September 2016). "Google swallows 11,000 novels to improve AI's conversation". The Guardian.
- ^ a b Bandy, John; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus" (PDF). =Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.