An investigation into logging data operations performed using the Pandas data analysis Python library.
Contents
This project is an investigation into potential approaches for logging data operations conducted using the Python pandas
library.
I often find the need to document and review data pipelines used to cleanse data or engineer features used in my analyses. However:
- Short of reviewing the actual code used to perform those data operations, actions performed uing
pandas
leaves no record;- Because
pandas
does not natively implement any logging functionality, changes made to data operations are not always readily available for later review unless care was taken to manually document changes to those operations.
A log-stream saved to file would provide an auditable and portable record of actions performed on the data under analysis. It would also have the potential to provide a human-readable record that could be reviewed by non-developer analysts seeking to replicate the analysis using some sort of GUI analytics software such as Excel or Tableau.
Therefore, this project is an investigation into potential approaches to achieve this sort of logging functionality in a repeatable manner.
Note
- This project is starting from a point of personal exploration, and the degree to which I codify my findings into something useful for others to use still has yet to be determined.
- Therefore, documentation related to this investigation will likely be sparse unless/until I happen upon meaningful findings or an approach of real use.
As a starting point, this project contains the source code for two separate Open Source Python libraries written by authors other than myself:
pdlog
, written by Wasim Lorgat, and hosted by the DataProphet GitHub organization: https://github.com/DataProphet/pdlogpandas-log
, written by Eyal Trabelsi and located at: https://github.com/eyaltrabelsi/pandas-log
My initial intent is to:
- Explore, in-depth, the methods employed by these two libraries,
- Learn from the efforts of these two library authors,
- Consider their methods relative to approaches I have attempted in previous projects of my own.
Note
All sections below are still just boilerplate. Thus, they do not reflect any specifics having to do with this project.
The analysis and findings associated with this project can be found here:
https://sedelmeyer.github.io/pandas-logging
Documentation for the python modules built specifically for this analysis (i.e. modules located in the ./src/
directory of this project) can be found here:
https://sedelmeyer.github.io/pandas-logging/modules.html
In order to replicate this analysis and run the Python code available in this analysis locally, follow these steps:
In this section
.. todo:: * Below is a placeholder template containing typical steps required to replicate a PyData project. * Content must be added to each section, outlining requirements and explaining how to replicate the analysis locally
If you'd like to build off of this project to explore additional methods or to practice your own data science and development skills, below are some important notes regarding the configuration of this project.
In this section
.. todo:: * Below are placeholder sections for explaining important characteristics of this project's configuration. * This section should contain all details required for someone else to easily begin adding additional development and analyses to this project.
.. todo:: * Add details on the best method for others to reach you regarding questions they might have or issues they identify related to this project.
.. todo:: * Add links to further reading and/or important resources related to this project.