tommyod/Efficient-Apriori

run time

maxsombra opened this issue ยท 7 comments

en python 3.7 with a file of rows 939 and 50 characterstics ,
the algorithm takes more than 2 hours, and I have seen this algorithm run much faster with more data, you know what the problem is.

this are the librarys i have install :

image

Try setting 'min_support' to a high value (e.g. 0.99), then it should run fast. The downside is that few rules will be returned. Decrease the value gradually and re-run till you strike a balance between runtime and the number of rules returned. You can also experiment with "min_confidence".

The problem is likely that there are many frequent itemsets in your particular data set, and has little to do with packages installed.

Please report back if it works! ๐Ÿ‘

Hi Tommy,

We have been testing your library iterating through different possible values for min_support and min_confidence to see if it changes, without success. This was our last test: min_support=0.001, min_confidence=0.001, max_length=30

Execution still takes a long time for a dataset of 930 transactions and 181 columns.

However we are comparing to an old version of your library (==0.4.5 according to pip) in a python 3.4.3 virtualenv which was installed a year ago until now that we need to run apriori again today. In this version the library runs fine and fast (10 seconds tops).

So when trying to set up a new environment, after installing all versions, testing one by one, execution is still pretty slow. We have tried python 3.4, 3.6, and 3.7 without success.

Somehow that only version we installed back in that date is still working, all others seem to have downgraded the performance really significantly.

We have tested in various computers, being the fastest one an Ubuntu 18.04 fresh install with 30GB ram and a 12-core processor, and still takes a pretty long time to what seems to be a small dataset described above.

Here comes a strange finding: when installing efficient-apriori==0.4.5 today, execution is still slow, it does not behave exactly like the old install. It looks like the code for version (0.4.5) at https://github.com/tommyod/Efficient-Apriori/tree/7ea0a1c2265249c8cb9a66d31055099f40231bf2/efficient_apriori is completely different to what we have in our old install and we can't seem to find that in the repo's history.

Any thoughts?
Thanks!

Thank you for the detailed report. This is a bit strange, as i cannot recall any changes that would degrade performance. I'll look into it. Do you have the opportunity to share the data set? (Perhaps with values mapped to integers to obfuscate it)

when installing efficient-apriori==0.4.5 today, execution is still slow, it does not behave exactly like the old install.

This is also very weird. Do you have the old install on a computer? Can you email the source code to me? My email is found here.

There are some choices to be made with missing data, and for making the algorithm run fast enough. Here are my thoughts.

Missing values

I took a look at your adult data and you do know and have all the variables for each transaction, but what if you don't actually know all the values?

You have two options, depending on the nature of the missing data:

  • In the traditional case (e.g., purchases from a supermarket) missing simply means "not present". In that case, only include non-missing observations in each transaction.
  • If missing data means "present, but not observed" it depends on the situation. Option 1. Do as above, and simply remove the non-missing observations. This is sensible if the data is missing at random. Option 2. If "not present" is interesting in the particular analysis (or the data is not missing at random) you should could include dummy columns. It's important that col1=missing is encoded differently than col2=missing, since they do not mean the same thing.

Here's a way to encode the missing values as explained in Option 2:

import pandas as pd
from efficient_apriori import apriori

# The data set has missing values
df = pd.read_csv("data.csv") # Data has shape approximately (1000, 60)

# Get the columns a strings
columns = list(df.columns)

transactions = []  # Prepare a list of transactions

# Loop over every row in the DataFrame to collect transactions
for i in range(len(df)):
    
    # Encode values as e.g., 'col1=N/A' and 'col2=male'.
    # We include column names to differentiate between values such as
    # 'col1=N/A' and 'col2=N/A', since they might mean different things
    # in the analysis (and might not be missing at random)
    values = list(t.strip() for t in df.iloc[i, :].fillna("N/A").values)
    transaction = tuple(f"{column}={value}" for column, value in zip(columns, values))

    # Add transaction to transactions
    transactions.append(transaction)

itemsets, rules = apriori(transactions, min_support=0.8,
                            min_confidence=0.5, max_length=6, verbosity=1)

Running time

This data has many columns, and by encoding every missing values as above the transactions each have 60 observations. This results in combinatorial explosion in the algorithm, so it's very important that the min_support is high enough to prune the data as the algorithm works. (Above I wrote that min_support should be low to make the algorithm run fast. This was wrong, and I edited my post)

Here's a comparison of running time and min_support for this particular data set, which has approx 1000 transactions with 60 observations each. I set min_confidence = 0.5 and max_length = 6 in the plot below.

image

If the algorithm runs for a long time, it will return a lot of junk. To be able to interpret the results and extract meaningful insights you want a few, meaningful rules. This typically means setting min_support and min_confidence to reasonably high values, which also means a shorter running time.

I hope this helps. Let me know if you have any further questions @arnaldoanez @maxsombra .

We were testing in fact only including non-missing observations but the values we tried for min_support were just too low (less than 0.05) hence why it was taking so long and we hadn't noticed that was just too broad.

Thanks for the clarification, we managed to run it successfully now.

Cheers

Et9797 commented

I've ran efficient apriori analyzing millions of customer orders containing shop products using min_support and min_confidence values as low as 0.0005 & 0.01, respectively. Generally it didn't take longer than 10 minutes to run. A min_support of 0.0001 however seems to be the lowest limit as it always then kills the Python process for some reason. 32 GB ram, Ubuntu 22.