๐ผ The clean and modern way of accessing IMSLP data and scores programmatically. ๐ถ
The package is available on PyPi and can be installed using your favorite package manager:
pip install imslp
This project attempts to use robust sources of data, that do not require web scraping of some sort:
-
MediaWiki API. IMSLP is one of tens of thousands of websites built on top of MediaWiki, the framework created for Wikipedia.org. As such, it can be accessed through the MediaWiki API for which, fortunately, there exists a fantastic Python wrapper library called
mwclient
. -
IMSLP API. For convenience, the IMSLP built some ad-hoc scripts that can be used to get a list of people and a list of works, in a variety of different formats, including JSON.
It also uses scraping to collect additional information (such as the number of pages in a score, the number of times a score was downloaded, or the user-provided ratings).
While fortunately, as mentioned, IMSLP uses a widely used open-source Wiki platform, MediaWiki, it has a handful of quirks. Such as:
-
Composers are stored as
Category
, for instanceCategory:Scarlatti, Domenico
. For each composer, there is usually three tabs: "Compositions", "Collaborations" and "Collections"; these are stored as separate categories resulting from the concatenation of the composer and subtype, such asCategory:Scarlatti, Domenico/Collections
. -
PDF files for sheet music are stored as "images"; unfortunately, for the time being, the scheme does not appear in the URLs computed for the files. These need to be manually patched.
-
The
imslpdisclaimeraccepted
cookie must be set to"yes"
for files to download properly (otherwise, downloading any file will result in the disclaimer page). Withmwclient
, this can be specified on login.cookies = { "imslp_wikiLanguageSelectorLanguage": "en", "imslpdisclaimeraccepted": "yes", }
-
Much of the metadata associated with images, such as the internal ID or the download counter, is stored separately than the MediaWiki metadata. This makes scraping the rendered HTML page a necessary endeavour.
Fortunately all these quirks are handled by this package!
Here are a handful of other related projects available on GitHub to access the IMSLP data programmatically:
-
jjjake/imslp-scrape: Last commit in May 2012 (32 commits), mix of Python and shell, scraping the website for data (people, score links) with HTML parsing.
-
FrankTheCodeMonkey/IMSLP-Scraper: Last commit in June 2020 (6 commits), Python, scraping the website for data and scores, with HTML parsing and Selenium.
-
josefleventon/imslp-api: Last commit in May 2020 (17 commits), JavaScript, uses IMSLP's custom API to get the list of people and list of works programmatically through a web API query.
More recently, and in other languages:
- IMSLP Instrument Information Parsing Program: Last commit in July 2020 (47 commits), uses scraping to extract instrumentation information.
Let's be clear that all the heavy lifting is done by mwclient
โand
the volunteers who uploaded and/or scanned and/or typeset the scores on IMSLP.
This project is licensed under the LGPLv3 license, with the understanding that importing a Python modular is similar in spirit to dynamically linking against a library.
-
You can use the library
imslp
in any project, for any purpose, as long as you provide some acknowledgement to this original project for use of the library. -
If you make improvements to
imslp
, you are required to make those changes publicly available.