Data Pre-Processing Code

Question

Data Pre-Processing Code

harshgeek4coder opened this issue 2 years ago · 10 comments

Hey there, @weiyinwei
Thanks for your paper and the approach to tackle multi-modal deep learning here.
My question is , I can see and get the datasets mentioned, but I see there is no code for pre-processing the datasets.
Could you kindly provide the code for data pre processing - the steps through which you built graph - train.npy, etc

Answer 1 · 2023-06-11T12:38:50.000Z

Also, @weiyinwei - A small request, can you kindly send or push a sample of these DATA files ?
If possible, it would be great
Thanks

Answer 2 · 2023-06-11T13:50:00.000Z

I tried to find dataset clue in author's whole github repository, I couldn't find any file(even process scripts) about it. Seems like due to some copyright issues, author can't share data with us. But I try to follow author's methods mentioned in their paper. Turns out, I collect 6,184,294 records in MovieLens10M, which is 6 times bigger than their 1,239,508.

Answer 3 · 2023-06-11T13:52:34.000Z

I tried to find dataset clue in author's whole github repository, I couldn't find any file(even process scripts) about it. Seems like due to some copyright issues, author can't share data with us. But I try to follow author's methods mentioned in their paper. Turns out, I collect 6,184,294 records in MovieLens10M, which is 6 times bigger than their 1,239,508.

Hey @rohnson1999 , appreciate your input here. I understand what you meant. I agree, but I am also concerned with how the author built graphs - for example, the .npy files -> that is also kind of necessary to know how to process and build those.npy files as graphs to pass it to data loader and then to graph network eventually.
@weiyinwei , Really Appreciate your inputs here.

Thanks

Answer 4 · 2023-06-11T13:56:22.000Z

And I also curious why their user number in MovieLens is 55485, because if you read file: ratings.dat in python(https://grouplens.org/datasets/movielens/10m/), you will see there are 69878 unique user numbers. It's reasonable to cut items number because some movies' trailers and descriptions are missing, but I don't understand why they cut user numbers?

Answer 5 · 2023-06-11T14:01:05.000Z

Based on my understanding, 3 features(Audio, Text, Keyframe pictures).npy files are numpy array files. You can use Vggish, Sentence2Vec, ResNet50 to extract features respectively and eventually get these npy files. But in general, you have to first crawl corresponding movie trailers and movie descriptions on Imdb.com, and then use deep neural models to get the features.

Answer 6 · 2023-06-11T14:02:31.000Z

I spend 2 months try to build my MovieLens dataset, but there are just few information.

Answer 7 · 2024-01-03T17:37:44.000Z

Hi @rohnson1999,
were you able to get or create code for preprocessing data in the mentioned format - positive interactions of users against items for this MMGCN Paper ?

Answer 8 · 2024-01-03T23:00:24.000Z

@rohnson1999 Yes, we should remove the items without the features in three modalities. And also, after removing such items, some users may have an empty interaction history. So, we also remove these users.
@harshgeek4coder The .npy files store the user-item pairs which correspond the edge, i.e., <head_node, tail_node> , in the graph.

Answer 9 · 2024-01-04T03:36:45.000Z

@rohnson1999 Yes, we should remove the items without the features in three modalities. And also, after removing such items, some users may have an empty interaction history. So, we also remove these users. @harshgeek4coder The .npy files store the user-item pairs which correspond the edge, i.e., <head_node, tail_node> , in the graph.

Hi @weiyinwei ,
Thanks for the reply.
I had one question - this edge which is positive interaction - is it unidirectional or bidirectional ?
Since for Graph Neural Networks, the edge creation and passing it to further dense layers might require both sided edges.

If possible, can you provide any sample pre-processing code for movielens dataset?
Thanks a ton!

Answer 10 · 2024-08-16T18:03:54.000Z

I tried to find dataset clue in author's whole github repository, I couldn't find any file(even process scripts) about it. Seems like due to some copyright issues, author can't share data with us. But I try to follow author's methods mentioned in their paper. Turns out, I collect 6,184,294 records in MovieLens10M, which is 6 times bigger than their 1,239,508.

Hey @rohnson1999 , appreciate your input here. I understand what you meant. I agree, but I am also concerned with how the author built graphs - for example, the .npy files -> that is also kind of necessary to know how to process and build those.npy files as graphs to pass it to data loader and then to graph network eventually. @weiyinwei , Really Appreciate your inputs here.

Thanks

the auther said briefly in the paper, I guess it's transferred by data provider, from raw videas to visual/textual data,I'm also curious about that extraction process,