Beautiful Soup (HTML parser): Difference between revisions

Beautiful Soup
Original author(s)	Leonard Richardson
Initial release	2004
Stable release	4.12.3 / 17 January 2024; 5 months ago
Repository	code.launchpad.net/beautifulsoup/ ;
Written in	Python
Platform	Python
Type	HTML parser library, Web scraping
License	Python Software Foundation License (Beautiful Soup 3 - an older version); MIT License (versions 4 and up)
Website	www.crummy.com/software/BeautifulSoup/

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 08:53, 27 March 2023

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML,^[3] which is useful for web scraping.^[2]

Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project,^[4] and is additionally supported by Tidelift, a paid subscription to open-source maintenance.^[5]

Code example

#!/usr/bin/env python3
# Anchor extraction from HTML document
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
    soup = BeautifulSoup(response, 'html.parser')
    for anchor in soup.find_all('a'):
        print(anchor.get('href', '/'))

Release

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.x. Beautiful Soup 4 can be installed with pip install beautifulsoup4.

In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support the Python 2.7.^[6]

References

^ https://git.launchpad.net/beautifulsoup/tree/CHANGELOG. Retrieved 18 January 2024. {{cite web}}: Missing or empty |title= (help)
^ ^a ^b "Beautiful Soup website". Retrieved 18 April 2012. Beautiful Soup is licensed under the same terms as Python itself
^ Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN 978-1-4842-3925-4
^ "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.
^ Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.
^ Richardson, Leonard (7 Sep 2021). "Beautiful Soup 4.10.0". beautifulsoup. Google Groups. Retrieved 27 September 2022.

This computer-library-related article is a stub. You can help Wikipedia by expanding it.

[wikidata-270cf90818bd03dc83ccffd63c9903d697c1d933-v13-1] ttps://git.launchpad.net/beautifulsoup/tree/CHANGELOG. Retrieved 18 January 2024. {{cite web}}: Missing or empty |title= (help)

[crummy.com-2] "Beautiful Soup website". Retrieved 18 April 2012. Beautiful Soup is licensed under the same terms as Python itself

[3] Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN 978-1-4842-3925-4

[4] "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.

[5] Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.

[6] Richardson, Leonard (7 Sep 2021). "Beautiful Soup 4.10.0". beautifulsoup. Google Groups. Retrieved 27 September 2022.

[1]

[2]

[3]

[4]

[5]

[6]

@@ Line 44: / Line 44: @@
         print(anchor.get('href', '/'))
 </syntaxhighlight>
-==Advantages and disadvantages of parsers==
-This table summarizes the advantages and disadvantages of each parser library<ref name="crummy.com" />
-{| class="wikitable"
-|-
-! Parser
-! Typical usage
-! Advantages
-! Disadvantages
-|-
-| Python’s html.parser
-| BeautifulSoup(markup, "html.parser")
-|
-*Moderately fast
-*Lenient (As of Python 2.7.3 and 3.2.)
-|
-*Not as fast as lxml, less lenient than html5lib.
-|-
-| lxml’s HTML parser
-| BeautifulSoup(markup, "lxml")
-|
-*Very fast
-*Lenient
-|
-*External C dependency
-|-
-| lxml’s XML parser
-|
-BeautifulSoup(markup, "lxml-xml") <br/>
-BeautifulSoup(markup, "xml")
-|
-*Very fast
-*The only currently supported XML parser
-|
-*External C dependency
-|-
-| html5lib
-| BeautifulSoup(markup, "html5lib")
-|
-*Extremely lenient
-*Parses pages the same way a web browser does
-*Creates valid HTML5
-|
-*Very slow
-*External Python dependency
-|}
 ==Release==

Revision as of 08:53, 27 March 2023

Code example

Release

See also

References