Apply big data analytics methods and algorithms to analyze and compare famous Rappers based on their lyrics. The different comparison methods used rely on terms frequency and diversity (number of unique words). We use classical word counts algoriths, TF.IDF scores, cosine similarity, statistics and clustering.
Requires BeautifulSoup
- Run
python crawler.py [url]
- Example:
python crawler.py http://www.azlyrics.com/k/kendricklamar.html
- See lyrics in data/azlyrics/
Format: python filter.py [filter]
Usage:
python filter.py L
(L = lemmatization)- Requires nltk's wordnet package to be installed.
- In the python environment, run
nltk.download()
. The NLTK downloader will pop up, under the Packages tab select the wordnet package for installation.
python filter.py P
(P = profanity)
This section uses different libraries such as Spark. In addition to having the libraries installed the following is required
- SPARK env var pointing to spark folder location: SPARK= (for linux)
- Add following path to PATH env variable: export PATH=$PATH:${SPARK}/bin: (for linux)
- Add add SPARK's python libs to the python path: PYTHONPATH=${SPARK}/python (for linux)
Running the different analysis:
- Run
wordcount.py
from the wordcount package first. The result are found in the output subfoler of the package. Those results are necessary for other analysis. - For 2D analysys/clustering of the songs (dimensions are: diversity and size), you can either run
cureclustersongvectors.py
,diversitystats.py
orkmeansclustersongvectors.py
from theclustering
package. Refer to each module documentation for more details. - For Vocabulary set based classification, run
classification.py
from theclassification
package. Refer to the module documentation for more information