[go: nahoru, domu]

Jump to content

Beautiful Soup (HTML parser): Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
→‎See also: Simplify Nokogiri reference page
current version is 4.9.3
Line 12: Line 12:
| released = {{Start date|2004}}
| released = {{Start date|2004}}
| discontinued =
| discontinued =
| latest release version = 4.9.1
| latest release version = 4.9.3
| latest release date = {{Start date and age|2020|05|17|df=no}}
| latest release date = {{Start date and age|2020|10|03|df=no}}
| latest preview version =
| latest preview version =
| latest preview date = <!-- {{Start date and age|YYYY|MM|DD|df=yes/no}} -->
| latest preview date = <!-- {{Start date and age|YYYY|MM|DD|df=yes/no}} -->

Revision as of 00:41, 19 December 2020

Beautiful Soup
Original author(s)Leonard Richardson
Initial release2004 (2004)
Stable release
4.9.3 / October 3, 2020; 3 years ago (2020-10-03)
Repository
Written inPython
PlatformPython
TypeHTML parser library, Web scraping
LicensePython Software Foundation License (Beautiful Soup 3 - an older version) MIT License 4+[1]
Websitewww.crummy.com/software/BeautifulSoup/

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML,[2] which is useful for web scraping.[1]

Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project,[3] and is additionally supported by Tidelift, a paid subscription to open-source maintenance.[4]

It is available for Python 2.7 and Python 3.

Code example

#!/usr/bin/env python3
# Anchor extraction from HTML document
from bs4 import BeautifulSoup
from urllib.request import urlopen

with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
    soup = BeautifulSoup(response, 'html.parser')
    for anchor in soup.find_all('a'):
        print(anchor.get('href', '/'))

Advantages and Disadvantages

This table summarizes the advantages and disadvantages of each parser library[1]

Parser Typical usage Advantages Disadvantages
Python’s html.parser BeautifulSoup(markup, "html.parser")
  • Moderately fast
  • Lenient (As of Python 2.7.3 and 3.2.)
  • Not as fast as lxml, less lenient than html5lib.
lxml’s HTML parser BeautifulSoup(markup, "lxml")
  • Very fast
  • Lenient
  • External C dependency
lxml’s XML parser

BeautifulSoup(markup, "lxml-xml")
BeautifulSoup(markup, "xml")

  • Very fast
  • The only currently supported XML parser
  • External C dependency
html5lib BeautifulSoup(markup, "html5lib")
  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5
  • Very slow
  • External Python dependency

Release

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is Beautiful Soup 4.9.1 (May 17, 2020). You can install Beautiful Soup 4 with pip install beautifulsoup4.

See also

References

  1. ^ a b c "Beautiful Soup website". Retrieved 18 April 2012. Beautiful Soup is licensed under the same terms as Python itself
  2. ^ Hajba, Gábor László (2018), Hajba, Gábor László (ed.), "Using Beautiful Soup", Website Scraping with Python: Using BeautifulSoup and Scrapy, Apress, pp. 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN 978-1-4842-3925-4
  3. ^ "Code : Leonard Richardson". Launchpad. Retrieved 2020-09-19.
  4. ^ Tidelift. "beautifulsoup4 | pypi via the Tidelift Subscription". tidelift.com. Retrieved 2020-09-19.