outbrain-inc/outrank

Adding promising pairwise feature to original dataframe

Closed this issue · 5 comments

I successfully generated pairwise ranks (pairwise_ranks.tsv) for my data using the template provided at https://github.com/outbrain/outrank/blob/main/examples/run_ranking_pairwise.sh.

I have a question: Is it possible to add the most promising features to the original data using Outrank, or obtain only pairwise features to manually merge with the original data? By 'most promising,' I mean the top-k features, which could be specified by the user. While exploring the --help options, I noticed the output_synthetic_df_name argument, but it doesn't seem to produce any output (not sure it is related to it). Could you provide guidance on this?

SkBlaz commented

Hi @mglowacki100! Very nice to see this operational for you. What exactly do you mean here as the "most promising features"? You computed mutual redundancies (pairwise matrix), I'd imagine what you're actually after are supervised pairwise interactions (hashes of feature n-tuples) https://github.com/outbrain/outrank/blob/main/examples/run_ranking_combinations.sh perhaps?

To answer the question, no, atm there is no functionality that would add these features to a given data set (mostly as data sets are batch-processed, so this would require some thought). However, an option would be to generate a set of functions (Python source) that takes data as input and generates the columns based on found interactions. Would that be of use for your use case?

So, e.g., outrank --generate_combinations_source --input_file {results of your ranking} would generate a simple script that you can integrate into your flows.

Hi @SkBlaz
I've managed to do it by modyfiing yours example: https://github.com/outbrain/outrank/blob/main/scripts/run_minimal.sh
!outrank --data_path avazu --data_source csv-raw --subsampling 1 --task all --heuristic MI-numba-3mr --target_ranking_only True --interaction_order 2 --output_folder ./ranking_outputs --minibatch_size 100 --label click;
subsequently I look at 3mr_ranks.tsv in output folder and parse Feature 3mr_ranking to get feature names with AND. `Most promising' is a heurestic, a simplest one is just top-k interactions. I create new feature just by casting to string and concatenate; somthing like this:

top_k_interactions = [('f4', 'f2'), ('f5, f6')] #k=2
for (f1, f2) in top_k_interactions:
   df[f'{f1}_AND_{f2}'] = df[f1].astype(str) + '_' + df[f2].astype(str) 

SkBlaz commented

@mglowacki100 very nice, yes this is for sure one way to get the interactions out. Feel free to add an example of your logic under ./examples, it might be interesting also for other users.

SkBlaz commented

@mglowacki100 is this still relevant?

@SkBlaz I've found library https://github.com/IIIS-Li-Group/OpenFE/tree/master that better suits to my use case, but thanks for your help and dedication.