/Advanced-Data-analytics

bash script and pytorch implementation for Word2vec models

Primary LanguagePython

DOI

Advanced-Data-analytics

Aim and motivation

* If we can predict homeless people’s behaviour, we will able to provide help and services with these people. 
* Among all methods, machine learning techniques have been proved that they are able to improve the decision making in the health-care sector (Chen et al., 2019). 
* Session-based recommenders  are useful when we have user interaction history  that they can learn based on the short-term interaction (Wang et al., 2022). These methods are emerging in the healthcare system to recommend the next-treatment recommendation (Haas, n.d.). 
* Our Aim is to predict the event within a session.  
* We used Word2vec model (Rong, 2016) that that capture the semantic similarities to predict the next event.

Data set

* In this work we used the MLB public dataset to represent the medical data. 
* The features in this dataset are correlated with the features that we will see in the real dataset. That’s why this dataset represents health care dataset.
* The data contains, a series of discreet events, including medical tests that can come back with good or bad results or vital crash that needs emergency or intense medical aid. 
* Another type of events in our database are stretched over a period. These events have starting and ending point 

Preprocessing Method

image

image

To deal with imbalanced Classes, we used Weighted Random Sampler:

image

image

Machine learning Method

image

Results and cluster performance

By using job arrays and creating a loop in the shell
* Submitted several jobs on the GPU partition. 
* Each job had unique input to do hyper parameter optimization
* We have successfully received the results for about  200 jobs

image

image

Conclusion and Reflection

* Submitted several jobs on the GPU partition. 
* In each job I trained the model with 1500 epoch
* Each job took about 40 mins on GPU  (about 6 hours on CPU partition) 
* By observing the jobs on the cluster each time 12 jobs was running in parallel.
* Whole of the experiments took about 5 days in the cluster which is almost equal to the 60 days in personal laptop.
* Could use the cluster to find the best parameters almost 10 times faster than using my own resource.
* Found clusters recourses very useful and and they are time saving for doing experiments