Generates profile reports from a pandas DataFrame
.
The pandas df.describe()
function is great but a little basic for serious exploratory data analysis.
pandas_profiling
extends the pandas DataFrame with df.profile_report()
for quick data analysis.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
The current dependency policy is suboptimal. Pinning the dependencies is great for reproducibility (high guarantee to work), but on the downside requires frequent maintenance and introduces compatibility issues with other packages. Therefore, we are moving away from pinning dependencies and instead specify a minimum version.
Early releases of pandas v1 demonstrated many regressions that broke functionality (as acknowledged by the authors here. At this point, pandas is more stable and we notice high demand for compatibility. We move on to support pandas' latest versions. To ensure compatibility with both versions, we have extended the test matrix to test against both pandas 0.x.y and 1.x.y.
Python 3.6 introduces ordered dicts and f-strings, which we now rely on. This means that from pandas-profiling 2.6, you should minimally run Python 3.6. For users that for some reason cannot update, you can use pandas-profiling 2.5.0, but you unfortunately won't benefit from updates or maintenance.
Starting from this release, we use Github Actions and Travis CI combined to increase maintainability. Travis CI handles the testing, Github Actions automates part of the development process by running black and building the docs.
With your help, we got approved for GitHub Sponsors! It's extra exciting that GitHub matches your contribution for the first year. Therefore, we welcome you to support the project through GitHub!
Find more information here:
April 14, 2020 💘
Contents: Examples | Installation | Documentation | Large datasets | Command line usage | Advanced usage | Types | How to contribute | Editor Integration | Dependencies
The following examples can give you an impression of what the package can do:
- Census Income (US Adult Census data relating income)
- NASA Meteorites (comprehensive set of meteorite landings)
- Titanic (the "Wonderwall" of datasets)
- NZA (open data from the Dutch Healthcare Authority)
- Stata Auto (1978 Automobile data)
- Vektis (Vektis Dutch Healthcare data)
- Website Inaccessibility (demonstrates the URL type)
- Colors (a simple colors dataset)
- Russian Vocabulary (demonstrates text analysis)
- Orange prices and Coal prices (showcase report themes)
- Tutorial: report structure using Kaggle data (advanced) (modify the report's structure)
You can install using the pip package manager by running
pip install pandas-profiling[notebook,html]
Alternatively, you could install the latest version directly from Github:
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
You can install using the conda package manager by running
conda install -c conda-forge pandas-profiling
Download the source code by cloning the repository or by pressing 'Download ZIP' on this page. Install by navigating to the proper directory and running
python setup.py install
The documentation for pandas_profiling
can be found here.
Start by loading in your pandas DataFrame, e.g. by using
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(
np.random.rand(100, 5),
columns=['a', 'b', 'c', 'd', 'e']
)
To generate the report, run:
profile = ProfileReport(df, title='Pandas Profiling Report', html={'style':{'full_width':True}})
We recommend generating reports interactively by using the Jupyter notebook. There are two interfaces (see animations below): through widgets and through a HTML report.
This is achieved by simply displaying the report. In the Jupyter Notebook, run:
profile.to_widgets()
The HTML report can be included in a Juyter notebook:
Run the following code:
profile.to_notebook_iframe()
If you want to generate a HTML report file, save the ProfileReport
to an object and use the to_file()
function:
profile.to_file(output_file="your_report.html")
Alternatively, you can obtain the data as json:
# As a string
json_data = profile.to_json()
# As a file
profile.to_file(output_file="your_report.json")
Version 2.4 introduces minimal mode. This is a default configuration that disables expensive computations (such as correlations and dynamic binning). Use the following syntax:
profile = ProfileReport(large_dataset, minimal=True)
profile.to_file(output_file="output.html")
For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling
executable. Run
pandas_profiling -h
for information about options and arguments.
A set of options is available in order to adapt the report generated.
title
(str
): Title for the report ('Pandas Profiling Report' by default).pool_size
(int
): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).progress_bar
(bool
): If True,pandas-profiling
will display a progress bar.
More settings can be found in the default configuration file, minimal configuration file and dark themed configuration file.
Example
profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file(output_file="output.html")
Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.).
pandas-profiling
currently recognizes the following types:
- Boolean
- Numerical
- Date
- Categorical
- URL
- Path
We have developed a type system for Python, tailored for data analysis: visions.
Selecting the right typeset drastically reduces the complexity the code of your analysis.
Future versions of pandas-profiling
will have extended type support through visions
!
The package is actively maintained and developed as open-source software.
If pandas-profiling
was helpful or interesting to you, you might want to get involved.
There are several ways of contributing and helping our thousands of users.
If you would like to be a industry partner or sponsor, please drop us a line.
The documentation is generated using pdoc3
.
If you are contributing to this project, you can rebuild the documentation using:
make docs
or on Windows:
make.bat docs
Read more on getting involved in the Contribution Guide.
-
Install
pandas-profiling
via the instructions above -
Locate your
pandas-profiling
executable.On macOS / Linux / BSD:
$ which pandas_profiling (example) /usr/local/bin/pandas_profiling
On Windows:
$ where pandas_profiling (example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe
-
In Pycharm, go to Settings (or Preferences on macOS) > Tools > External tools
-
Click the + icon to add a new external tool
-
Insert the following values
- Name: Pandas Profiling
- Program: The location obtained in step 2
- Arguments: "$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
- Working Directory:
$ProjectFileDir$
To use the PyCharm Integration, right click on any dataset file: External Tools > Pandas Profiling.
Other editor integrations may be contributed via pull requests.
The profile report is written in HTML and CSS, which means pandas-profiling requires a modern browser.
You need Python 3 to run this package. Other dependencies can be found in the requirements files:
Filename | Requirements |
---|---|
requirements.txt | Package requirements |
requirements-dev.txt | Requirements for development |
requirements-test.txt | Requirements for testing |
setup.py | Requirements for Widgets etc. |