Beautiful Soup (HTML parser): Difference between revisions

Beautiful Soup
Original author(s)	Leonard Richardson
Initial release	2004
Stable release	4.12.3 / 17 January 2024; 5 months ago
Repository	code.launchpad.net/beautifulsoup/ ;
Written in	Python
Platform	Python
Type	HTML parser library, Web scraping
License	Python Software Foundation License (Beautiful Soup 3); MIT License (versions 4 and up);
Website	www.crummy.com/software/BeautifulSoup/

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 12:40, 29 May 2024

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,^[3] which is useful for web scraping.^[2]^[4]

History

Beautiful Soup was started by in 2004 by Leonard Richardson.^{[citation needed]} It takes its name from the poem Beautiful Soup from Alice's Adventures in Wonderland^[5] and is a reference to the term "tag soup" meaning poorly-structured HTML code.^[6] Richardson continues to contribute to the project,^[7] which is additionally supported by paid open-source maintainers from the company Tidelift.^[8]

Versions

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x. Beautiful Soup 4 can be installed with pip install beautifulsoup4.

In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7.^[9]

Usage

Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops.^[10]

Code example

The example below uses the Python standard library's urllib^[11] to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.

#!/usr/bin/env python3
# Anchor extraction from HTML document
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
    soup = BeautifulSoup(response, 'html.parser')
    for anchor in soup.find_all('a'):
        print(anchor.get('href', '/'))

References

^ https://git.launchpad.net/beautifulsoup/tree/CHANGELOG. Retrieved 18 January 2024. {{cite web}}: Missing or empty |title= (help)
^ ^a ^b "Beautiful Soup website". Retrieved 18 April 2012. Beautiful Soup is licensed under the same terms as Python itself
^ Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN 978-1-4842-3925-4
^ Python, Real. "Beautiful Soup: Build a Web Scraper With Python – Real Python". realpython.com. Retrieved 2023-06-01.
^ makcorps (2022-12-13). "BeautifulSoup tutorial: Let's Scrape Web Pages with Python". Retrieved 2024-01-24.
^ "Python Web Scraping". Udacity. 2021-02-11. Retrieved 2024-01-24.
^ "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.
^ Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.
^ Richardson, Leonard (7 Sep 2021). "Beautiful Soup 4.10.0". beautifulsoup. Google Groups. Retrieved 27 September 2022.
^ "How To Scrape Web Pages with Beautiful Soup and Python 3 | DigitalOcean". www.digitalocean.com. Retrieved 2023-06-01.
^ Python, Real. "Python's urllib.request for HTTP Requests – Real Python". realpython.com. Retrieved 2023-06-01.

[wikidata-270cf90818bd03dc83ccffd63c9903d697c1d933-v13-1] ttps://git.launchpad.net/beautifulsoup/tree/CHANGELOG. Retrieved 18 January 2024. {{cite web}}: Missing or empty |title= (help)

[crummy.com-2] "Beautiful Soup website". Retrieved 18 April 2012. Beautiful Soup is licensed under the same terms as Python itself

[3] Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN 978-1-4842-3925-4

[4] Python, Real. "Beautiful Soup: Build a Web Scraper With Python – Real Python". realpython.com. Retrieved 2023-06-01.

[5] rps (2022-12-13). "BeautifulSoup tutorial: Let's Scrape Web Pages with Python". Retrieved 2024-01-24.

[6] "Python Web Scraping". Udacity. 2021-02-11. Retrieved 2024-01-24.

[7] "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.

[8] Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.

[9] Richardson, Leonard (7 Sep 2021). "Beautiful Soup 4.10.0". beautifulsoup. Google Groups. Retrieved 27 September 2022.

[10] "How To Scrape Web Pages with Beautiful Soup and Python 3 | DigitalOcean". www.digitalocean.com. Retrieved 2023-06-01.

[11] Python, Real. "Python's urllib.request for HTTP Requests – Real Python". realpython.com. Retrieved 2023-06-01.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

@@ Line 1: / Line 1: @@
 {{Short description|Python HTML/XML parser}}
-{{Other uses|Beautiful Soup (disambiguation){{!}}Beautiful Soup}}{{primary sources|date=May 2023}}
+{{Other uses|Beautiful Soup (disambiguation){{!}}Beautiful Soup}}
 {{Infobox software
 | name = Beautiful Soup
@@ Line 28: / Line 28: @@
 }}
-'''Beautiful Soup''' is a [[Python (programming language)|Python]] package for parsing [[HTML]] and [[XML]] documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,<ref>{{Citation|last=Hajba|first=Gábor László|title=Using Beautiful Soup|date=2018|work=Website Scraping with Python: Using BeautifulSoup and Scrapy|pages=41–96|editor-last=Hajba|editor-first=Gábor László|publisher=Apress|language=en|doi=10.1007/978-1-4842-3925-4_3|isbn=978-1-4842-3925-4}}</ref> which is useful for [[web scraping]].<ref name="crummy.com" /><ref>{{Cite web |last=Python |first=Real |title=Beautiful Soup: Build a Web Scraper With Python – Real Python |url=https://realpython.com/beautiful-soup-web-scraper-python/ |access-date=2023-06-01 |website=realpython.com |language=en}}</ref>
+'''Beautiful Soup''' is a [[Python (programming language)|Python]] package for parsing [[HTML]] and [[XML]] documents, including those with malformed markup. It creates a [[parse tree]] for documents that can be used to extract data from HTML,<ref>{{Citation|last=Hajba|first=Gábor László|title=Using Beautiful Soup|date=2018|work=Website Scraping with Python: Using BeautifulSoup and Scrapy|pages=41–96|editor-last=Hajba|editor-first=Gábor László|publisher=Apress|language=en|doi=10.1007/978-1-4842-3925-4_3|isbn=978-1-4842-3925-4}}</ref> which is useful for [[web scraping]].<ref name="crummy.com" /><ref>{{Cite web |last=Python |first=Real |title=Beautiful Soup: Build a Web Scraper With Python – Real Python |url=https://realpython.com/beautiful-soup-web-scraper-python/ |access-date=2023-06-01 |website=realpython.com |language=en}}</ref>
+==History==
-Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project,<ref>{{Cite web |title=Code : Leonard Richardson |url=https://code.launchpad.net/%7Eleonardr/+branches |access-date=2020-09-19 |website=Launchpad |language=en-US}}</ref> and is additionally supported by Tidelift, a paid subscription to open-source maintenance.<ref>{{Cite web|last=Tidelift|title=beautifulsoup4 {{!}} pypi via the Tidelift Subscription|url=https://tidelift.com/subscription/pkg/pypi-beautifulsoup4|access-date=2020-09-19|website=tidelift.com|language=en}}</ref>
+Beautiful Soup was started by in 2004 by Leonard Richardson.{{cn|date=May 2024}} It takes its name from the poem ''Beautiful Soup'' from [[Alice's Adventures in Wonderland]]<ref>{{Cite web |last=makcorps |date=2022-12-13 |title=BeautifulSoup tutorial: Let's Scrape Web Pages with Python |url=https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/ |access-date=2024-01-24 |language=en-US}}</ref> and is a reference to the term "[[tag soup]]" meaning poorly-structured HTML code.<ref>{{Cite web |date=2021-02-11 |title=Python Web Scraping |url=https://www.udacity.com/blog/2021/02/python-web-scraping.html |access-date=2024-01-24 |website=Udacity |language=en-US}}</ref> Richardson continues to contribute to the project,<ref>{{Cite web |title=Code : Leonard Richardson |url=https://code.launchpad.net/%7Eleonardr/+branches |access-date=2020-09-19 |website=Launchpad |language=en-US}}</ref> which is additionally supported by paid open-source maintainers from the company Tidelift.<ref>{{Cite web|last=Tidelift|title=beautifulsoup4 {{!}} pypi via the Tidelift Subscription|url=https://tidelift.com/subscription/pkg/pypi-beautifulsoup4|access-date=2020-09-19|website=tidelift.com|language=en}}</ref>
-== Code example ==
+===Versions===
+Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is [https://www.crummy.com/software/BeautifulSoup/bs4/download/ Beautiful Soup 4.x]. '''Beautiful Soup 4''' can be installed with <code>pip install beautifulsoup4</code>.
-Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python [[Control flow#Loops|loops]].<ref>{{Cite web |title=How To Scrape Web Pages with Beautiful Soup and Python 3 {{!}} DigitalOcean |url=https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3 |access-date=2023-06-01 |website=www.digitalocean.com |language=en}}</ref> The example below uses the Python [[standard library]]'s urllib<ref>{{Cite web |last=Python |first=Real |title=Python's urllib.request for HTTP Requests – Real Python |url=https://realpython.com/urllib-request/ |access-date=2023-06-01 |website=realpython.com |language=en}}</ref> to load [[Wikipedia]]'s main page, then uses Beautiful Soup to parse the document and search for all links within. <syntaxhighlight lang="python">
+In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7.<ref>{{cite web |last1=Richardson |first1=Leonard |date=7 Sep 2021 |title=Beautiful Soup 4.10.0 |url=https://groups.google.com/g/beautifulsoup/c/flWqqlrcJ9s |access-date=27 September 2022 |website=beautifulsoup |publisher=Google Groups |language=en-US}}</ref>
+==Usage==
+Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python [[Control flow#Loops|loops]].<ref>{{Cite web |title=How To Scrape Web Pages with Beautiful Soup and Python 3 {{!}} DigitalOcean |url=https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3 |access-date=2023-06-01 |website=www.digitalocean.com |language=en}}</ref>
+=== Code example ===
+The example below uses the Python [[standard library]]'s urllib<ref>{{Cite web |last=Python |first=Real |title=Python's urllib.request for HTTP Requests – Real Python |url=https://realpython.com/urllib-request/ |access-date=2023-06-01 |website=realpython.com |language=en}}</ref> to load [[Wikipedia]]'s main page, then uses Beautiful Soup to parse the document and search for all links within. <syntaxhighlight lang="python">
 #!/usr/bin/env python3
 # Anchor extraction from HTML document
@@ Line 43: / Line 51: @@
         print(anchor.get('href', '/'))
 </syntaxhighlight>
-==History==
-Beautiful Soup is named both after a poem in [[Alice's Adventures in Wonderland]]<ref>{{Cite web |last=makcorps |date=2022-12-13 |title=BeautifulSoup tutorial: Let's Scrape Web Pages with Python |url=https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/ |access-date=2024-01-24 |language=en-US}}</ref> and [[tag soup]].<ref>{{Cite web |date=2021-02-11 |title=Python Web Scraping |url=https://www.udacity.com/blog/2021/02/python-web-scraping.html |access-date=2024-01-24 |website=Udacity |language=en-US}}</ref>
-Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is [https://www.crummy.com/software/BeautifulSoup/bs4/download/ Beautiful Soup 4.x]. '''Beautiful Soup 4''' can be installed with <code>pip install beautifulsoup4</code>.
-In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7.<ref>{{cite web |last1=Richardson |first1=Leonard |date=7 Sep 2021 |title=Beautiful Soup 4.10.0 |url=https://groups.google.com/g/beautifulsoup/c/flWqqlrcJ9s |access-date=27 September 2022 |website=beautifulsoup |publisher=Google Groups |language=en-US}}</ref>
 ==See also==