yaochenzhu/LLM4Rec

Can you elaborate the steps for setting up Hadoop?

Opened this issue · 2 comments

I want to try out this algorithm on my dataset but I'm unfamiliar with Hadoop. Can you mention the required steps from scratch to be able to run this code? Would be great if you mention the required directory structure and where to download the pertained weights from.
Also, the variable gpt2_server_root in training.py is not defined anywhere.

Hi there,

Thanks for your interest in our work.

Actually, I use hadoop only because LinkedIn store data in a HDFS server, which has different file system with the pod that execute the program. Therefore, when executing the program, data need to be copied from the remote server to the local pod. The gpt2_server_root is just the remote dir that store pretrained weights and dataset, whereas local_root is the dir where pretrained weights and dataset are copied.

To run the program locally, you can just comment out all the remote-related codes and copying-related codes, and directly store the weights and data in the corresponding local_root specified in the program.

As for the variable gpt2_server_root, it is just the server_root. This is a typo when I try to remove the LinkedIn-sensitive information from the codes. I have fixed the codes. Thank you so much for your feedback.

Wish you success in your work.

Best,
Yaochen

Also, I have added the huggingface link where the original GPT-2 weights and tokenizer can be downloaded. Thank you so much for your valuable suggestion.