Beautiful Soup (HTML parser): Difference between revisions
→Code example: enable syntax highlighting |
Python 2.7 support retired. |
||
Line 32: | Line 32: | ||
Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project,<ref>{{Cite web|title=Code : Leonard Richardson|url=https://code.launchpad.net/%7Eleonardr/+branches|access-date=2020-09-19|website=Launchpad|language=en}}</ref> and is additionally supported by Tidelift, a paid subscription to open-source maintenance.<ref>{{Cite web|last=Tidelift|title=beautifulsoup4 {{!}} pypi via the Tidelift Subscription|url=https://tidelift.com/subscription/pkg/pypi-beautifulsoup4|access-date=2020-09-19|website=tidelift.com|language=en}}</ref> |
Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project,<ref>{{Cite web|title=Code : Leonard Richardson|url=https://code.launchpad.net/%7Eleonardr/+branches|access-date=2020-09-19|website=Launchpad|language=en}}</ref> and is additionally supported by Tidelift, a paid subscription to open-source maintenance.<ref>{{Cite web|last=Tidelift|title=beautifulsoup4 {{!}} pypi via the Tidelift Subscription|url=https://tidelift.com/subscription/pkg/pypi-beautifulsoup4|access-date=2020-09-19|website=tidelift.com|language=en}}</ref> |
||
It is available for |
It is available for Python 3. |
||
== Code example == |
== Code example == |
||
Line 97: | Line 98: | ||
Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is [https://www.crummy.com/software/BeautifulSoup/bs4/download/ Beautiful Soup 4.x]. '''Beautiful Soup 4''' can be installed with <code>pip install beautifulsoup4</code>. |
Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is [https://www.crummy.com/software/BeautifulSoup/bs4/download/ Beautiful Soup 4.x]. '''Beautiful Soup 4''' can be installed with <code>pip install beautifulsoup4</code>. |
||
In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support the Python 2.7.<ref>{{cite web |first1=Leonard |last1=Richardson |title=Beautiful Soup 4.10.0 |url=https://groups.google.com/g/beautifulsoup/c/flWqqlrcJ9s |website=beautifulsoup |publisher=Google Groups |access-date=27 September 2022 |date=7 Sep 2021}}</ref>. |
|||
==See also== |
==See also== |
Revision as of 00:33, 27 September 2022
Original author(s) | Leonard Richardson |
---|---|
Initial release | 2004 |
Stable release | 4.12.3[1] |
Repository | |
Written in | Python |
Platform | Python |
Type | HTML parser library, Web scraping |
License | Python Software Foundation License (Beautiful Soup 3 - an older version) MIT License 4+[2] |
Website | www |
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML,[3] which is useful for web scraping.[2]
Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project,[4] and is additionally supported by Tidelift, a paid subscription to open-source maintenance.[5]
It is available for Python 3.
Code example
#!/usr/bin/env python3
# Anchor extraction from HTML document
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.find_all('a'):
print(anchor.get('href', '/'))
Advantages and disadvantages of parsers
This table summarizes the advantages and disadvantages of each parser library[2]
Parser | Typical usage | Advantages | Disadvantages |
---|---|---|---|
Python’s html.parser | BeautifulSoup(markup, "html.parser") |
|
|
lxml’s HTML parser | BeautifulSoup(markup, "lxml") |
|
|
lxml’s XML parser |
BeautifulSoup(markup, "lxml-xml") |
|
|
html5lib | BeautifulSoup(markup, "html5lib") |
|
|
Release
Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x. Beautiful Soup 4 can be installed with pip install beautifulsoup4
.
In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support the Python 2.7.[6].
See also
References
- ^ https://git.launchpad.net/beautifulsoup/tree/CHANGELOG. Retrieved 18 January 2024.
{{cite web}}
: Missing or empty|title=
(help) - ^ a b c "Beautiful Soup website". Retrieved 18 April 2012.
Beautiful Soup is licensed under the same terms as Python itself
- ^ Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN 978-1-4842-3925-4
- ^ "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.
- ^ Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.
- ^ Richardson, Leonard (7 Sep 2021). "Beautiful Soup 4.10.0". beautifulsoup. Google Groups. Retrieved 27 September 2022.