I am making my project - along with the code - available online in case it is useful for other students or people who are getting interested in Machine Learning. Questions and suggestions are welcome.
Every year hundreds of thousands of international workers apply for H-1B non-immigrant visas in the United States. In order to be able to qualify for worker H1-B visas, a person needs to have a job offer from a U.S. based company. This is also the kind of visa usually requested by international students pursuing higher education in the country. This study aims to train a classifier based on features of the dataset to be able to predict whether a given request would be granted eligibility to the H-1B program. Given the number of people requesting visas every year - and the likelihood to increase over the next years despite political pressure - it would be interesting to analyze some of the existing data and provide a model that could help to understand successful over non-successful applications. This is handled as a multi-class classification problem as we have to identify one among different solutions, however the number of outputs used will be discussed given the fact that some of these results are more influenced by external factors and do not have a significant impact on the results. To approach this problem three different classifiers are trained and compared to identify the five most important features to tackle this problem. While prevailing wage is the highest weighted features as expected, part-time positions weight stronger than expected and the worksite does not necessarily affects the outcome of the application. Finally, a Logistic Regression classifier proved to be the best option among those analyzed to process this data considering time needed to train and predict as well as output produced.
https://www.kaggle.com/elraphabr/predicting-outcome-for-h-1b-eligibility-in-the-us
In case you need more processing power to run your model, it might be a good option to use some cloud infrastructure. During my project I gave it a try on AWS. Although I ended up training the model locally, having a quick guide on how to implement the notebook in AWS proved to be helpful. Also, make sure you use a powerful EC2 instance. I wasn't able to run the full database only using the free tier so I needed to request a limit increase to p2.xlarge, which should be more powerful to run ML models. After finishing, remember to stop all spot instances, cancel requests and delete any snapshots and volumes. Otherwise, it will keep generating extra charges.
SSH TCP 22 0.0.0.0/0 Custom TCP Rule TCP 8888 0.0.0.0/0
AWS_private_key.pem
chmod 400 AWS_private_key.pem ssh-add AWS_private_key.pem
ssh -i "AWS_private_key.pem" ec2-user@ec2-[IP].eu-central-1.compute.amazonaws.com
sudo yum update
wget https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh sh Anaconda2-5.0.1-Linux-x86_64.sh
export PATH="/home/ec2-user/anaconda2/bin:$PATH"
sudo yum install gcc
sudo yum install python-pip sudo -H pip install jupyter
Pass: this is my password Sha1: sha1:18dfaae5d0f8:cdd4683f2918e8482311e24adc22166d29489204
jupyter notebook --generate-config mkdir certs cd certs sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
vim ~/.jupyter/jupyter_notebook_config.py
c = get_config() c.IPKernelApp.pylab = 'inline' c.NotebookApp.certfile = u'/home/ec2-user/certs/mycert.pem' c.NotebookApp.ip = '*' c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:18dfaae5d0f8:cdd4683f2918e8482311e24adc22166d29489204' c.NotebookApp.port = 8888
jupyter notebook &
https://[IP]:8888/tree/