Motivation
This is a package with a goal to provide statistics to better understand how python is used and written.
A package maintainer might ask:
- Can certain functions be depreciated?
- How are my users using my package in tests vs. source vs. notebooks?
- What should I include in tutorials?
- Are new features being adopted?
Python Core Maintainers might ask:
- What are the most and least used stdlib modules?
- Is the community moving away from one module?
- Lets educate PEPs with actual statistics!
This work exposes a sqlite queryable web api via datasette.
NOTE: this dataset is currently extremely biased as we are parsing
the top 4,000 repositories for few scientific libraries in
data/whitelist
. This is not a representative sample of the python
ecosystem nor the entire scientific python ecosystem. Further work is
needed to make this dataset less biased.
Interesting Questions
As with any project that provides large datasets interpretation is even more important than the data itself. Here we provide some guiding questions.
- How many files are we looking at?
- How many repositories are we looking at?
- How many distinct namespaces are we inspecting?
- What are the top 10 most popular pandas functions?
- What are the top 10 most popular numpy attributes?
- What are the most depended upon modules by function usage count?
- What are the top 100 most used stdlib module functions?
- What are the least and most used stdlib modules?
- How are the builtin functions used within source vs. notebooks vs. tests?
- How often are the dunder methods used?
- What is the average length of a line of code?
Workflow
This is a package with components that expose a sqlite database via datasette. Originally this package provided csv files with api usage statistics for packages. The problem is that this cannot anticipate all the questions that users may have. Thus we have a sql interface to ask custom questions on the (currently) 6 GB database.
The scripts involved in this work.
- Assemble list of important repositories/projects that depend on
libraries such as
numpy
,scipy
,requests
,tensorflow
, etc. This work would not be possible without libraries.ioscripts/librariesio.sh
- Construct database by inspecting source code and ast of every
python file and notebook in repositories.
scripts/inspect.sh
- Expose sqlite database via datasette
scripts/serve.sh
Tests
The tests depend on pytest
. The tests are a great demostration of
what python-api-inspect can capture.
pytest