UW-Madison Stat605 final group project.
- Jingshan Huang
- Xiangyu Wang
- Zijin Wang
- Yicen Liu
- Yuan Cao
Analyse Yelp dataset
1.transfer them to .tsv file with code/clean_review_tsv.sh code/business_to_tsv.sh
2.split review_clean.tsv into 3 subfiles with code: split -d -n 3 review_clean.tsv
3.for each subfiles review_0x, do 'join -j 1 -t $'\t' <(sort review_0x) <(sort business_use.tsv) > review_join_0x'
4.cat review_join_00 review_join_01 review_join_02 > review_join.tsv
cat review_join.tsv | awk -F"\t" '{if ($2 < 3) $2=1}1' | awk -F"\t" '{if ($2==3) $2=2}1' | awk -F"\t" '{if ($2 > 3) $2=3}1' > review_join2.tsv ">" have to direct to a new file
cat review_join2.tsv | sed 's/\t"([[:digit:]]{4})-.*[[:digit:]]{2}"\t/\t\1\t/g' > review_join.tsv
usage: ./clean_review.sh
change file review.json to review_clean.csv with 4 cols: business_id,stars,text,date