/associations

Python 3 library to identify high-dimensional statistical relationships in any data set.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Associations

Associations is a Python 3 library used to identify high-dimensional statistical relationships in any data set. This library is useful as a first-pass data analysis tool to understand:

  • A high-level overview of every potentional relationship in a data set.
  • Arbitrary dimensionality, identifying the relationships between every combination of elements.

This library assigns a relative "association" score for every n*m dimensional relationship of elements in the data set, and optionally outputs some graphs to illustrate those scores. You may use other libraries to apply more advanced or formalized statistical models to understand the finer details of those relationships.

This library has not had any changes since 2017, so it is due for an upgrade. Please open any issues or pull requests to make improvements. In the future, I may make significant breaking changes or fully replace much of the library functionality.

Installation

The latest release can be found here. For a direct download of the development version (latest revision, not latest release), click here.

  • Dependencies: NumPy, matplotlib, multiprocessing
  • Build Dependencies: git (only for Arch Linux python-associations-git package)

This is not compatible with Python 2.

Universal

Run this if you would like to install associations directly into Python without the use of a package manager. This should be compatible with any system.

$ python setup.py sdist
# python setup.py

Arch Linux

If you would like to install the latest development version (latest revision), you should install the python-associations-git package. All you have to do is download the PKGBUILD and the ABS will automatically download the source and install the package. You can keep reusing the same PKGBUILD. It will automatically update the version number based on the revision.

To install python-associations-git, run this in an empty directory:

$ wget https://raw.githubusercontent.com/dnut/PKGBUILDs/master/assocations/python-associations-git/PKGBUILD
$ makepkg -si

If you would like to install the latest release, you should install the python-associations package. You can download it here and install using the included PKGBUILD. To update this package, you will need to download the release from that page.

To install python-associations, cd into the python-associations directory and run this command:

$ makepkg -si

You can also download the source code for the latest revision manually and install using the included PKGBUILD. I only recommend this if you are either contributing to development or forking your own local version of the package.

Overview

We can count occurrences with a histogram, find associations between different fields, and are provided tools that aid in the analysis of the resultant data.

libassoc.py

This file contains the most generic procedures that do not belong in any created classes. They are convenient procedures for Python's fundamental data structures.

histogram.py - Histogram()

The primary job of a Histogram() object is to traverse a CSV file and create a NumPy array with as many dimensions as fields we wish to record and to fill that array with the count for every possible occurrence. This is accomplished with the count() method. Access to the internal data structure is provided via the get() method.

Attribute Description
fields Table fields that we want to measure.
histogram NumPy array containing counts.
valists List of lists containing strings of each field's values.
valdicts List of dicts, inverted valists (key = string, val = int)
valists_dict Dict of valists keyed by field names.
valdicts_dict Dict of valdicts keyed by field names.
field_index Keys field values to field names.
field_index_int Keys field values to valists/valdicts index (int).
nonzero_indices Indices for all nonzero values in the histogram.
Method Description
count() Count all occurrences for every possible situation
useful_stuff() Expose the string values for quantitative internal data structure.
reduce() Return new Histogram() with provided numpy array. Used by simplify() and slice().
simplify() Return new Histogram() with fewer dimensions by summing undesired dimensions. For example, create a histogram that drops the sex dimension. All remaining fields have combined value for both male and female.
slice() Return new Histogram() with fewer dimensions by isolating a specific situation. For example, create a histogram representing only males with no data for females.
nonzeros() Generator function that iterates through every nonzero element, optionally providing string representations.
get() Retrieve count for any field value combination.

associations.py

Contains two classes that serve to identify associations in a Histogram(). Associator() finds associations for a specific field combination and Associations() uses Associator() objects to find all associations.

Associator() is a distinct class rather than integrating its methods into Associations() because Associations() uses multiprocessing to dramatically improve execution time on multi-core systems, and it needs relatively isolated objects to be passed to subprocesses. This implementation is intended to be superior to the redundancy of many Associations() objects or the complexity of queues and pipes without hurting code legibility or efficiency.

Associator()

The associator object identifies associations between different field values (eg. fatalities and amputations) by comparing one group to a larger group that encompasses it.

Knowing that white males are injured on Tuesday more frequently than black males is not very useful information because it is likely caused by there being more white males than black. Furthermore, knowing that while males are more injured on Tuesday than other days doesn't tell us whether or not white males and Tuesday are associated because it may be that Tuesdays have more injuries overall. Therefore, we must establish a standardized numerical value that represents the actual association between two fields by taking into consideration the overall populations we are sampling from.

As another example, if we want to find the association between amputations and fatalities (diagnosis and disposition), we need to take the same approach. While the likelihood that an amputation is fatal is valuable information, we are more interested in the relative fatality of different diagnoses. Amputations may have a very low likelihood of fatality, but we must compare it to the likelihood that any other diagnosis leads to fatality before we discover whether amputations are relatively likely to be fatal. Therefore, we must take into consideration the extreme infrequency of fatalities in general to get a standardized numerical representation of how associated each field is.

There are two approaches to resolve our dilemma that are mathematically equivalent. One approach is to divide the number of fatal amputations by the number of amputations with any disposition, which yields the likelihood that an amputation is fatal. But we want to normalize this likelihood by scaling it according to the likelihood that anything my be fatal. To do so, we divide them (total fatalities / total of everything) and that yields the association ratio between amputation and fatality.

Identical results would be reached by first dividing fatal amputations by all fatalities (likelihood that a fatality is caused by amputation) and then dividing that by the average likelihood that an amputation is the cause of any disposition (total amputations / total of everything). This results in the exact same association ratio as the first approach.

Both approaches are the same algorithm run in opposite directions. They are also mathematically equivalent since they both result in the same calculation:

association between amputations and fatalities = (fatal amputations)*(total of everything) / (fatalities)*(amputations)

Originally, for efficiency, I used a specialized version of the aforementioned algorithm (calculate likelihoods then divide) in order to naturally cache totals and subtotals for multiple situations. Unfortunately, this led to a very complex and confusing algorithm.

To keep the algorithm simple, I have written a new one optimized to use the general formula as efficiently as possible. I have actually gotten it to be more efficient than the original algorithm. This algorithm is significantly less complex. It is more maintainable and easier to understand and use, so it is favored.

I still see some potential to optimize a few places in the algorithm to improve efficiency even further, but this would require a lot of benchmarking and will probably not be a huge improvement, so it is not my top priority.

Attribute Description
notable Minimum association ratio (or inverse) to be included.
significant Minimum number of occurrences (statistical significance).
assoc Associations organized by association then subgroup.
subpops Associations organized by subgroup then association.
hist Histogram() object to extract data from.
Method Description
add() Save association ratio.
find() Find the association ratio for every field value combination among a specific field name combination.

Associations()

Attributes: self.pairs and self.subpops contain all association ratios.

>>> self.pairs
{
	pair_type: {
		frozenset(association_pair): {
			frozenset(subgroup/subpopulation): association_ratio
		}
	}
}
>>> self.subpops
{
	subgroup_type: {
		frozenset(subgroup/subpopulation): {
			frozenset(association_pair): association_ratio
		}
	}
}
Method Description
find_all() Use multiprocessing pool to test every field name combination using Associator().find().
helper() Runs Associator().find(). Needed for multiprocessing.
add() Add entire Associator()'s data structures to Associations() object using merge().
merge() Lower level dictionary processor than add().
report() Report associations between two fields.
subgroup_report() Report associations for any pairs within a subgroup/subpopulation.

analysis.py

Contains two classes, Analysis() and AsciiTable()

Analysis()

Analyze data from Histogram() and Associations().

Attribute Description
hist Histogram()
assoc Associations()
gen_assoc Average association ratios for combo types.
maxes and mins Max and min association ratios for combo types.
Method Description
make_hist() Create data structure for a histogram plot.
prep_hist() Used by make_hist() to include only notable data.
plot_hist() Use data from make_hist() to create an actual plot.
plot_assoc() Use make_hist() and plot_hist() for specific purpose of plotting association ratios between two field names.
nice_plot_assoc() Try plot_assoc() with various notable values to create a legible plot containing meaningful data.
plot_all() Run nice_plot_assoc() for every field combination.
max_helper() Find mins and maxes while making hists.
most_common() Most common occurrences.
most_assoc() Most associated occurrences.
extremes() Most associated occurrences (broader).

AsciiTable()

Attribute Description
tables List of table strings.
Method Description
table() Draw ascii table.
table_section() Format data into a section to be interpreted by table().