The epana
package provides some general functions for exploratory data analysis and descriptive statistics. It is not meant to provide a complete set of tools but rather to capture some functions/algorithms/processes that I've found myself repeating even while using high-level libraries that are openly available. The project, additionally, contains scripts and notebooks related to very specific data handling and analysis.
Currently, the packaged utility functions cover the following general areas:
- Logging --
logutils.py
- Reading Encrypted Files --
cryptic.py
- Characters and Encoding --
scrubdub.py
- Tabular Data --
tabular.py
- Relational Data --
crosstabular.py
All of these modules need to be refactored to operate on file-like objects (i.e., like the file object returned by open
; e.g., StringIO or BytesIO) instead of on file names. This will provide for more graceful piping or process chaining. Alternative, and more permissively, they could operate on any object with a read()
method (as in the approach of pandas.read_csv
).
The above modules make direct, heavy use of the following third-party packages, which are build requirements:
pandas
numpy
cchardet
ftfy
The cryptic
module, additionally, uses paramiko
and python-gnupg
but they are not listed as build requirements.
Provides a wrapper for Python's logging
module.
- Defines five log levels:
debug
,info
,warning
,error
, andcritical
. - The function
get_logger
configures basic logging and takes only alogname
, which will often be of function-level specificity. - The function
mstime
returns the current time in milliseconds.
TODO: Create a log decorator that automatically sets the log name based on module and function name, gets the logger, and logs entry/exit debug (or info) messages with timings.
This module is not very portable. It assumes Linux, GPG2, and known_hosts
locations.
Provides functions for reading PGP-encrypted files and, also, insults principles of coherence by providing functions to get remote files via SSH.
- The function
head
decrypts the first compression block of a file, remote or local, and displaysN
lines or bytes. - The function
fopen
is a context for opening files and read in bytes mode but can decrypt inline and can take a local file handle or a remote url in the formuser@server:path
.
Provides functions that operate on byte or character arrays, streams, or buffers for the purpose of character, string, and data type classification.
- Guess and fix file encoding with functions
guess_encoding
andfix_unicode_and_copy
. - Count characters and extract patterns with functions
count_chars
,count_charclasses
,tag_chrs
, andchunk_chrs
. - And other stuff.
TODO: Add smart data-type and semantic classification (like determining if a string is a valid identifier in different coding schemes) based on character patterns.
Loads, reduces, and summarizes tabular data. Currently does not support fixed-width-field data.
- Guess the field-separated dialect with
guess_dialect
. - Load tabular data into Pandas
DataFrame
s usingget_df_raw
,load_files
, ordf_from_sql
. - Operate on Pandas
Series
with various aggregating functions. - Look into in-memory footprint of data and reduce the size of dataframes using
get_mem_usage
,get_reduced_dtypes
, andshrink_df
. - Summarize the content of dataframes using
freq
andget_summary
. - and more.
Counts relational patterns between different tables of data. This is valuable when receiving relational data in ways that do not enforce relational integrity.
TODO: Add modules for dimensional reduction, especially in relational data under assumptions of there being central entities specified.
TODO: Add stats modules.