BookCorpus: Difference between revisions

Content deleted Content added

Inline

Revision as of 08:58, 5 July 2023

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 11,000 unpublished books scraped from the Internet. It was the main corpus used to train the initial GPT model by OpenAI,^[1] and has been used as training data for other early large language models including Google's BERT.^[2] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.^[2]

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors".^[3]^[4] The dataset was initially hosted on a University of Toronto webpage.^[4] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.^[5] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.^[4]^[5]

References

^ "Improving Language Understanding by Generative Pre-Training" (PDF). Archived (PDF) from the original on January 26, 2021. Retrieved June 9, 2020.
^ ^a ^b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
^ Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
^ ^a ^b ^c Lea, Richard (28 September 2016). "Google swallows 11,000 novels to improve AI's conversation". The Guardian.
^ ^a ^b Bandy, John; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus" (PDF). =Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

[gpt-1-paper-1] "Improving Language Understanding by Generative Pre-Training" (PDF). Archived (PDF) from the original on January 26, 2021. Retrieved June 9, 2020.

[bert-paper-2] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].

[bookpaper-3] Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[swallows-4] Lea, Richard (28 September 2016). "Google swallows 11,000 novels to improve AI's conversation". The Guardian.

[debt-5] Bandy, John; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus" (PDF). =Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
 {{Short description|Book dataset}}
-'''BookCorpus''' (also sometimes referred to as the '''Toronto Book Corpus''') is a dataset consisting of the text of around 11,000 unpublished books [[web scraping|scraped]] from the [[Internet]]. It was the main corpus used to train the initial [[generative pre-trained transformer|GPT]] model by [[OpenAI]],<ref name=gpt-1-paper/> and has been used as training data for other early [[large language model]]s including Google's [[BERT (language model)|BERT]].<ref name=bert-paper/> The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.<ref name=bert-paper/>
+'''BookCorpus''' (also sometimes referred to as the '''Toronto Book Corpus''') is a [[Data set|dataset]] consisting of the text of around 11,000 unpublished books [[web scraping|scraped]] from the [[Internet]]. It was the main [[Text corpus|corpus]] used to train the initial [[generative pre-trained transformer|GPT]] model by [[OpenAI]],<ref name=gpt-1-paper/> and has been used as training data for other early [[large language model]]s including Google's [[BERT (language model)|BERT]].<ref name=bert-paper/> The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.<ref name=bert-paper/>
 The corpus was introduced in a 2015 paper by researchers from the [[University of Toronto]] and [[MIT]] titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors".<ref name=bookpaper/><ref name=swallows/> The dataset was initially hosted on a University of Toronto webpage.<ref name=swallows/> An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.<ref name=debt/> Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be [[Smashwords]].<ref name=swallows/><ref name=debt/>