A newspaper editor was researching immigration data trends on H1B(H-1B, H-1B1, E-3) visa application processing over the past years, trying to identify the occupations and states with the most number of approved H1B visas. She has found statistics available from the US Department of Labor and its Office of Foreign Labor Certification Performance Data.But while there are ready-made reports for 2018 and 2017, the site doesn’t have them for past years.
The goal of this project is to create a mechanism to analyze past years data, specifically calculate two metrics: Top 10 Occupations and Top 10 States for certified visa applications.
Using argparse
define and parse required and optional arguments.
- Read and process each row to get the occupation and state columns for applications with certified case status (e.g. if we are interested to know about CASE_STATUS = 'CERTIFIED' applications),
- Maintain the information about occupation and states in two dictionaries:
- occupation_status_count dictionary:
- key: occupation
- value: count
- state_status_count dictionary:
- key: state
- value: count
- As we have to look at all the applications before returning the dictionaries, the run time and space complexity for this method would be O(N)
This step is taken care by a helper method get_topk_metrics
get_topk_metrics(dictionary with aggregated values, k)
: Given two dictionaries with key-value pair as (string, integer) and an integer k, return a list of frequentk
key-value pairs sorted by value in descending order. Break the ties by alphabetical order.- I used minheap to get top k elements from the occupation and state counter dictionaries
In this step output for top 10 occupations and top 10 states is written into respective files.
output_data(args, output_file_path, top_k_results, total_status_count, output_columns)
: write given list of tuples to the given output file path with given output_columns names
- Place the input file in the
input
folder and name it ash1b_input.csv
- Place the required tests in the
insight_testsuite folder
- Run
./run.sh
command to start the program - Program takes four mandatory arguments: program name, input file path and two output file paths.
To run:
python3 ./src/h1b_counting.py -i inputfilepath -o1 output1filepath -o2 output2filepath
Example:
python3 ./src/h1b_counting.py -i ./input/h1b_input.csv -o1 ./output/top_10_occupations.txt -o2 ./output/top_10_states.txt
- There will be two files the output folder:
- top_10_occupations.txt - the file containing top 10 occupations for certified visa applications
- top_10_states.txt - the file containing top 10 states for certified visa applications