Caspailleur is a python package for mining concepts and implications in binary data with FCA framework. Part of SmartFCA ANR project.
The stable version of the package can be installed from PyPI with:
pip install caspailleur
and the latest version of the package can be installed from GitHub repository:
pip install caspailleur@git+https://github.com/EgorDudyrev/caspailleur
The field of Formal Concept Analysis has many mathematical terms and some conflicting notation traditions.
Here is the glossary used throughout the caspailleur
package: Glossary.md.
Let us study the "Famous Animals" dataset from FCA repository:
import caspailleur as csp
df, meta = csp.io.from_fca_repo('famous_animals_en')
print(meta)
print(df.replace({True: 'X', False: ''}))
{ 'title': 'Famous Animals',
'source': 'Priss, U. (2006), Formal concept analysis in information science. Ann. Rev. Info. Sci. Tech., 40: 521-543. p.525',
'size': {'objects': 5, 'attributes': 6},
'language': 'English',
'description': 'famous animals and their characteristics' }
cartoon | real | tortoise | dog | cat | mammal | |
---|---|---|---|---|---|---|
Garfield | X | X | X | |||
Snoopy | X | X | X | |||
Socks | X | X | X | |||
Greyfriar's Bobby | X | X | X | |||
Harriet | X | X |
The received df
object is a pandas dataframe.
Other supported context data types can be found in Supported data formats section.
Caspailleur works fast enough with datasets of hundreds of objects and dozens of attributes.
Here, we choose a tiny dataset for illustrative purposes.
Tip
Caspailleur package can only work with binary data (and is optimised for this). You can consult Paspailleur package that extends Caspailleur functionality for complex non-binary data.
Now we can find all concepts in the data:
concepts_df = csp.mine_concepts(df)
print(concepts_df[['extent', 'intent']].map(', '.join))
Concepts table (13 rows)
concept_id | extent | intent |
---|---|---|
0 | Snoopy, Socks, Harriet, Greyfriar's Bobby, Garfield | |
1 | Greyfriar's Bobby, Socks, Harriet | real |
2 | Garfield, Greyfriar's Bobby, Snoopy, Socks | mammal |
3 | Garfield, Snoopy | cartoon, mammal |
4 | Harriet | real, tortoise |
5 | Greyfriar's Bobby, Socks | real, mammal |
6 | Greyfriar's Bobby, Snoopy | dog, mammal |
7 | Garfield, Socks | mammal, cat |
8 | Snoopy | dog, cartoon, mammal |
9 | Garfield | cartoon, mammal, cat |
10 | Greyfriar's Bobby | real, mammal, dog |
11 | Socks | real, mammal, cat |
12 | dog, cat, real, mammal, tortoise, cartoon |
The number of concepts is exponential to the number of objects and attributes in the data.
To find only the most interesting concepts, specify min_support
, min_delta_stability
and/or n_stable_concepts
parameters:
concepts_df = csp.mine_concepts(
df, min_support=3, min_delta_stability=1,
to_compute=['intent', 'keys', 'support', 'delta_stability', 'sub_concepts']
)
print(concepts_df)
concept_id | intent | keys | support | delta_stability | sub_concepts |
---|---|---|---|---|---|
0 | set() | [set()] | 5 | 1 | {1, 2} |
1 | { real } | [{ real }] | 3 | 1 | set() |
2 | { mammal } | [{ mammal }] | 4 | 2 | set() |
For many datasets, the number of concepts is too large to be read by hand. Luckily, relationships between attributes can be described via implication bases whose number is usually much smaller (although there may be many implications selecting no objects).
implications_df = csp.mine_implications(df)
print(implications_df[['premise', 'conclusion', 'support']])
Implications table (10 rows)
implication_id | premise | conclusion | support |
---|---|---|---|
0 | {cartoon} | {mammal} | 2 |
1 | {tortoise} | {real} | 1 |
2 | {dog} | {mammal} | 2 |
3 | {cat} | {mammal} | 2 |
4 | {dog, cat} | {tortoise, real, cartoon} | 0 |
5 | {tortoise, mammal} | {dog, cat, cartoon} | 0 |
6 | {tortoise, cat} | {dog, cartoon} | 0 |
7 | {dog, tortoise} | {cat, cartoon} | 0 |
8 | {tortoise, cartoon} | {dog, cat} | 0 |
9 | {real, cartoon} | {dog, tortoise, cat} | 0 |
We can read the implications in the table and find out dependencies in the data. For example:
- every famous cartoon animal is a mammal
(from impl. 0: cartoon -> mammal); - one can find famous tortoises only in real life
(from impl. 1: tortoise -> real); - nobody is a dog and a cat at the same time
(from impl. 4: dog, cat -> ... with support 0).
If finding full implication basis takes too much time, one can mine only a part of columns and implications:
implications_df = csp.mine_implications(
df, basis_name='Canonical', unit_base=True,
to_compute=['premise', 'conclusion', 'extent'],
min_support=1, min_delta_stability=1
)
print(implications_df)
implication_id | premise | conclusion | extent |
---|---|---|---|
0 | {cat} | mammal | {Socks, Garfield} |
1 | {dog} | mammal | {Greyfriar's Bobby, Snoopy} |
2 | {tortoise} | real | {Harriet} |
3 | {cartoon} | mammal | {Snoopy, Garfield} |
The supported bases are Canonical basis (a.k.a. Pseudo-Intent or Duquenne-Guigues basis) and Canonical Direct basis (a.k.a. Proper Premise or Karell basis). Every basis can also be transformed in a unit-base where every conclusion consists of only one attribute.
Finally, Caspailleur can output all descriptions in the data and their characteristics.
But note that the number of descriptions
= 2^number of attributes
.
descriptions_df = csp.mine_descriptions(df)
print('__n. attributes:__', df.shape[1])
print('__n. descriptions:__', len(descriptions_df))
print('__columns:__', ', '.join(descriptions_df.columns))
print(descriptions_df[['description', 'support', 'is_key']].head(3))
n. attributes: 6
n. descriptions: 64
columns: description, extent, intent, support, delta_stability, is_closed, is_key, is_passkey, is_proper_premise, is_pseudo_intent
description_id | description | support | is_key |
---|---|---|---|
0 | set() | 5 | True |
1 | {cartoon} | 2 | True |
2 | {real} | 3 | True |
Caspailleur package does not support concept lattice visualisation (this task deserves its own package).
For a basic concept lattice visualisation, one can produce a mermaid
diagram code.
Mermaid diagrams can be visualised via https://mermaid.live/ service or can be embedded in GitHub flavored markdown.
concepts_df = csp.mine_concepts(df, min_support=2)
node_labels = concepts_df['intent'].map(', '.join) + '<br><br>'+concepts_df['extent'].map(', '.join)
diagram_code = csp.io.to_mermaid_diagram(node_labels, concepts_df['previous_concepts'])
print(diagram_code)
flowchart TD
A["<br><br>Snoopy, Greyfriar's Bobby, Garfield, Socks, Harriet"];
B["real<br><br>Greyfriar's Bobby, Socks, Harriet"];
C["mammal<br><br>Greyfriar's Bobby, Garfield, Socks, Snoopy"];
D["mammal, cartoon<br><br>Garfield, Snoopy"];
E["mammal, real<br><br>Greyfriar's Bobby, Socks"];
F["mammal, dog<br><br>Greyfriar's Bobby, Snoopy"];
G["cat, mammal<br><br>Garfield, Socks"];
A --- B;
A --- C;
B --- E;
C --- D;
C --- E;
C --- F;
C --- G;
If, above, you see the source of the diagram, visit the GitHub version of this ReadMe for the diagram itself. If, above, you see the diagram, go to the source code of the ReadMe for the diagram code.
A formal context can be defined using many data types.
Below is the list of context types and examples acceptable by high-level caspailleur
functions:
Supported data types
A binary Pandas dataframe.
Can be obtained via csp.io.to_pandas
function.
Example:
print(csp.io.to_pandas(df))
cartoon | real | tortoise | dog | cat | mammal | |
---|---|---|---|---|---|---|
Garfield | True | False | False | False | True | True |
Snoopy | True | False | False | True | False | True |
Socks | False | True | False | False | True | True |
Greyfriar's Bobby | False | True | False | True | False | True |
Harriet | False | True | True | False | False | False |
A list of sets of indices of True columns in the data.
Can be obtained via csp.io.to_itemsets
function.
Example:
print(*csp.io.to_itemsets(df), sep='\n')
{0, 4, 5}
{0, 3, 5}
{1, 4, 5}
{1, 3, 5}
{1, 2}
A triplet: (ItemsetContextType, object names, attribute names).
Can be obtained via csp.io.to_named_itemsets
function.
Example:
print(*csp.io.to_named_itemsets(df), sep='\n')
[{0, 4, 5}, {0, 3, 5}, {1, 4, 5}, {1, 3, 5}, {1, 2}]
['Garfield', 'Snoopy', 'Socks', "Greyfriar's Bobby", 'Harriet']
['cartoon', 'real', 'tortoise', 'dog', 'cat', 'mammal']
A list of bitarrays where every bitarray represents "active" attributes in object's description
Can be obtained via csp.io.to_bitarrays
function;
Example:
print(*csp.io.to_bitarrays(df), sep='\n')
bitarray('100011')
bitarray('100101')
bitarray('010011')
bitarray('010101')
bitarray('011000')
A triplet: (BitarrayContextType, object names, attribute names).
Can be obtained via csp.io.to_named_bitarrays
function;
Example:
print(*csp.io.to_named_bitarrays(df), sep='\n')
[bitarray('100011'), bitarray('100101'), bitarray('010011'), bitarray('010101'), bitarray('011000')]
['Garfield', 'Snoopy', 'Socks', "Greyfriar's Bobby", 'Harriet']
['cartoon', 'real', 'tortoise', 'dog', 'cat', 'mammal']
A list of object's descriptions where every description is a list of bool values.
Can be obtained via csp.io.to_bools
function;
Example:
print(*csp.io.to_bools(df), sep='\n')
[True, False, False, False, True, True]
[True, False, False, True, False, True]
[False, True, False, False, True, True]
[False, True, False, True, False, True]
[False, True, True, False, False, False]
A triplet: (BoolContextType, object names, attribute names).
Can be obtained via csp.io.to_named_bools
function;
Example:
print(*csp.io.to_named_bools(df), sep='\n')
[[True, False, False, False, True, True], [True, False, False, True, False, True], [False, True, False, False, True, True], [False, True, False, True, False, True], [False, True, True, False, False, False]]
['Garfield', 'Snoopy', 'Socks', "Greyfriar's Bobby", 'Harriet']
['cartoon', 'real', 'tortoise', 'dog', 'cat', 'mammal']
A dictionary where every key is an object's name and every value if object's description represented with sets of names of attributes.
Can be obtained via csp.io.to_dictionary
function.
Example:
print(csp.io.to_dictionary(df))
{'Garfield': {'cartoon', 'mammal', 'cat'},
'Snoopy': {'dog', 'cartoon', 'mammal'},
'Socks': {'real', 'mammal', 'cat'},
"Greyfriar's Bobby": {'real', 'mammal', 'dog'},
'Harriet': {'real', 'tortoise'}
}
A formal context can also be saved to and loaded from a .cxt formatted file or a string:
with open('context.cxt', 'w') as file:
csp.io.write_cxt(df, file)
with open('context.cxt', 'r') as file:
df_loaded = csp.io.read_cxt(file)
assert (df == df_loaded).all(None)
Caspailleur does three things to fasten up the computations:
- It exploits the connections between characterisic attribute sets.
E.g. a function to compute proper premises takes intents and keys as inputs, and not the original binary data. - The set of intents is computed by LCM algorithm
well-implemented in scikit-mine package: https://pypi.org/project/scikit-mine/; - All intrinsic computations are performed with bitwise operations
provided by bitarray package: https://pypi.org/project/bitarray/
The diagram below presents dependencies between the characteristic attribute sets. For example, the arrow "intents -> keys" means that the set of intents is required to compute the set of keys.
graph TD;
S["<b>itemsets</b><br><small><tt>csp.np2bas(...)</tt></small>"];
A["<b>intents</b><br><small><tt>csp.list_intents_via_LCM(...)</tt></small>"];
B["<b>keys</b><br><small><tt>csp.list_keys(...)</tt></small>"];
C["<b>passkeys</b><br><small><tt>csp.list_passkeys(...)</tt></small>"];
D["<b>intents ordering</b><br><small><tt>csp.sort_intents_inclusion(...)</tt></small>"];
E["<b>pseudo-intents</b><br><small><tt>csp.list_pseudo_intents_via_keys(...)</tt></small>"];
F["<b>proper premises</b><br><small><tt>csp.iter_proper_premises_via_keys(...)</tt></small>"];
G["<b>linearity index</b><br><small><tt>csp.linearity_index(...)</tt></small>"];
H["<b>distributivity index</b><br><small><tt>csp.distributivity_index(...)</tt></small>"];
S --> A
A --> B;
A --> C;
A --> D;
A --> E;
B --> E;
B --> F; A --> F;
A --> G; D --> G;
D --> H; A --> H;
In case the diagram is not compiling, visit the GitHub version of README: https://github.com/EgorDudyrev/caspailleur
Note
Although caspailleur
package implements many optimisations to fasten up the computations, we do not state that it is the fastest FCA package ever existed.
For example, our algorithm for computing pseudo-intent basis is far from the state-of-art.
Knowing that, we find caspailleur
fast enough for comfortable everyday use.
There are no papers written about caspailleur (yet). So you can cite the package itself.
@misc{caspailleur,
title={caspailleur},
author={Dudyrev, Egor},
year={2023},
howpublished={\url{https://www.smartfca.org/software}},
}
The package development is supported by ANR project SmartFCA (ANR-21-CE23-0023).
SmartFCA (https://www.smartfca.org/) is a big platform that will contain many extensions of Formal Concept Analysis including pattern structures, Relational Concept Analysis, Graph-FCA and others. While caspailleur is a small python package that covers only the basic notions of FCA.