Step 2: Run Mapreduce from code in folder step1-clean/(make file, run file, check file included) on h1b.txt. The result would be tab seperated 12-column file. The result is under hdfs dfs /user/my1843/project/clean/part-r-00000 -> rename to /user/my1843/project/clean/part1
Step 2: Run Mapreduce from code in folder step1-clean/(make file, run file, check file included) on h1b.txt. The result would be tab seperated 12-column file. The result is under hdfs dfs /user/yb1059/project/cleanOutputUpdate -> rename and move to /user/my1843/project/clean/part2
Unit 2 Profile
Minkang Yang - under folder unit2-profile-source1
result is resonable
Yubing Bai - under folder unit2-profile-source2
result is resonable
Unit 3 Merge 2 source
step 1: put two file under the same folder, called hdfs dfs /user/my1843/project/clean/
step 2: run a hive command to create table merge the 2 sources into one table -> name it data20 . see command under folder unit3-merge2source/ for detail
Unit 4 Industry Categorize
step 1: create table from data20, select only employer_name, job_title -> name it employer_name_job_title.
step 3: Run Mapreduce from code in folder unit3-industry-categoty/key-word/(make file, run file, check file included) on distinct_employer_name.tsv -> result would be in /user/my1843/project/wordCount
step 5: from the employer_name_count_order, find the key word and put them into different category, namely Administrative, Architest And Engineer, Education, Medical, Computer, Finance and business, Entertainment.
step 6: Run MapReduce from code in unit3-industry-categoty/add-one-column/(make file, run file, check file included) ->result overwrite /user/my1843/project/clean
NOW /user/my1843/project/clean would be the newest tab separated file.
create table with 13-column, call it data30.
Unit 5 Analyze in Hive
Step 1: analyze processing time
see code in HiveCommand under section analyze processing time
Step 2: analyze employing time
see code in HiveCommand under section analyze employing time
Step 3: analyze states distribution
see code in HiveCommand under section analyze states distribution
Step 4: analyze wage distribution
see code in HiveCommand under section analyze wage distribution
Step 5: analyze occupation distribution
see code in HiveCommand under section analyze occupation distribution
Unit 6 Analyze in MLlib
step 1: generate vector [case_status, wage_rate, decision_time, employing_time, state, occupation], see code under section generate vector
step 2: export vector file from beeline to dumbo, then to hdfs
step 3: Run Map Reduce to create data vector for later trainning on MlLib.
step 4: export the data get from the last step to local folder unit6-AnalyzeInMlLib/ then open jupyter-notebook and run ML_Analysis.ipynb get the prediction model.