Input dataset format

Question

Input dataset format

benpowis opened this issue 5 years ago · 4 comments

Hi there - love your work on this package! I have a question regarding input datasets, in your example this is a list of tuples, but is it possible to work with dataframes too? What are the restrictions around input data?

Many thanks,
Ben

Answer 1 · 2019-08-29T12:01:07.000Z

No problem at all.

# Original data
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]

# Convert to panas.DataFrame
df = pd.DataFrame(transactions)

# Convert back to list of tuples
transactions_from_df = [tuple(row) for row in df.values.tolist()]

# They are equal, so this evaluates to True
assert transactions == transactions_from_df

A list of lists will also work, it doesn't have to be a list of tuples.

Answer 2 · 2019-08-29T12:13:56.000Z

Thank you @tommyod this looks great - how would you suggest dealing with NaN values? When feeding my df directly to apriori() I get the error:
TypeError: object of type 'int' has no len()

I can use your code above to transform into a list, but in my data I have a couple of baskets which are huge, leading to many 'nan' values in the lists, will these have an adverse effect on the results?

Answer 3 · 2019-08-29T12:28:35.000Z

NaN likely represents nothing, so convert ('bread', nan, 'milk', nan) to ('bread', 'milk'). It really depends on your problem at hand. Each tuple should represent a transaction, and having "none-tokens" in a transaction is a no-no. The values in the tuples should be strings.

Answer 4 · 2019-08-29T12:52:45.000Z

Cool, thank you - should this help anyone else in the future, here is the method I used to remove nans from lists of varying sizes:

from math import isnan
for y in range(0,len(transactions_from_df)):
    
    transactions_from_df[y] = [x for x in transactions_from_df[y] if not (
                          type(x) == float # let's drop all float values…
                          and isnan(x) # … but only if they are nan
                          )]