tommyod/Efficient-Apriori

Input dataset format

benpowis opened this issue · 4 comments

Hi there - love your work on this package! I have a question regarding input datasets, in your example this is a list of tuples, but is it possible to work with dataframes too? What are the restrictions around input data?

Many thanks,
Ben

No problem at all.

# Original data
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]

# Convert to panas.DataFrame
df = pd.DataFrame(transactions)

# Convert back to list of tuples
transactions_from_df = [tuple(row) for row in df.values.tolist()]

# They are equal, so this evaluates to True
assert transactions == transactions_from_df

A list of lists will also work, it doesn't have to be a list of tuples.

Thank you @tommyod this looks great - how would you suggest dealing with NaN values? When feeding my df directly to apriori() I get the error:
TypeError: object of type 'int' has no len()

I can use your code above to transform into a list, but in my data I have a couple of baskets which are huge, leading to many 'nan' values in the lists, will these have an adverse effect on the results?

NaN likely represents nothing, so convert ('bread', nan, 'milk', nan) to ('bread', 'milk'). It really depends on your problem at hand. Each tuple should represent a transaction, and having "none-tokens" in a transaction is a no-no. The values in the tuples should be strings.

Cool, thank you - should this help anyone else in the future, here is the method I used to remove nans from lists of varying sizes:

from math import isnan
for y in range(0,len(transactions_from_df)):
    
    transactions_from_df[y] = [x for x in transactions_from_df[y] if not (
                          type(x) == float # let's drop all float values…
                          and isnan(x) # … but only if they are nan
                          )]