Small program to scrape model repository data from the Hugging Face Hub, including the full history of README.md
files.
Kudos to @Fresh-P who is responsible for a bulk of the inital implementation.
For an exploratory analysis of the data see my website.
To run the program consider the following information:
- Download statistics refer to downloads over the past 30 days.
- The program leverages the
requests
package to obtain the HTML pages. Consider using cookies to ensure that the request considers your login information. That way, it will be possible to scrape repositories that require access permission which can be requested beforehand. The cookies ensure that you identify as a user with granted permission rights. The cookies file should be stored in the main folder and namedcookies
. - If the field
commit_history
in the meta-file is empty, the repository likely requires permission rights - If the field
commit_history
in the meta-file contains a4xx
status code, it is likely the result of arequests
error. - The first time you run
main.py
it collects a list of all available model repositories. It will keep that exact same list unless you delete thelinks.txt
file. - Every time you run
main.py
, it checks which repositories from thelinks.txt
have already been scraped (by cross-checking with the meta-file(s)). It only retains the repository links which have not yet been scraped. In addition, it retries scraping all links where thecommit_history
field in the meta-file contains an error code or is empty (that way, you may request permission to access certain repositories and retry scraping that repository).