In this project, we will apply various big data analytics techniques and algorithms in order to study the YouTube-8M dataset which contains machine generated labels, RGB features and audio features for each videos on YouTube platform. We will find out the dominant video categories, frequent itemsets of video categories and group videos into clusters using available features.
YouTube-8M video-level features dataset is used in this project. Video-level features are stored as tensorflow.Example protocol buffers. A tensorflow.Example proto is reproduced here in text format:
features: {
feature: {
key : "id"
value: {
bytes_list: {
value: (Video id)
}
}
}
feature: {
key : "labels"
value: {
int64_list: {
value: [1, 522, 11, 172] # label list
}
}
}
feature: {
# Average of all 'rgb' features for the video
key : "mean_rgb"
value: {
float_list: {
value: [1024 float features]
}
}
}
feature: {
# Average of all 'audio' features for the video
key : "mean_audio"
value: {
float_list: {
value: [128 float features]
}
}
}
}
- Dominant categories on YouTube
- K-th Frequent Itemsets of video categories
- Group videos into clusters according to audio features
- Count video categories under MapReduce framework using key = category id
- Implement Apriori algorithm under MapReduce framework
- Implement K-Means algorithm
# Execute MapReduce job
yarn jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.3.jar \
-files hdfs:///yt8m-analysis/task1/mapper.py,hdfs:///yt8m-analysis/task1/reducer.py \
-mapper 'python3 mapper.py' \
-reducer 'python3 reducer.py' \
-input /preprocessed_data/category.txt \
-output /yt8m-analysis/task1/output
# View output
hadoop fs -text /yt8m-analysis/task1/output/*