Summary

Code documentation

  1. Used a python virtual enviroment to install required packages, requirements are mentioned in requirements.txt

  2. The code is structured as a class named NewsSentimentAnalysis, with separate functions for carrying out different phases of the task
          • The first function fetchArticles gets the data from the given url (https://www.aljazeera.com/where/mozambique/), and extracts the top         10 news articles and their texts.
          • The saveAsJSON function stores the collected data in JSON form.
          • The article texts are then cleaned in the cleanArticles function.
          • Sentences are extracted from the articles in the extractSentences function.
          • Two separate functions exist for analysing the sentiment of the sentences in the articles.
                a. The textblob_analysis function uses the textblob library to return the polarity of the sentences, the average polarity of all the                 sentences in an article is returned as the polarity of the article.
                b. The flair_analysis function uses the flair library and their pretrained models to get the sentiment of each sentence and it’s                 confidence score. The overall sentiment of each article is retrieved by averaging over the scores of each sentence.
          • The main function inside the class calls all the functions mentioned above in the right order.

  3. Running the code
    • To run the code simply use the command
    python3 sentiment_analysis.py
    This command returns the analysis with flair
    CPU run time (user + sys): 9.484s

    • In case you want to see the results for the textblob analysis, please run
    python3 sentiment_analysis.py --textblob
    CPU run time (user + sys): 5.757s

Results

I have used two approaches to analyse the sentiments of the articles, and compared the results. The first method uses Textblob which is a lexicon-based approach that tries to figure out the sentiment of a sentence by checking the semantic orientation and intensity of every word in the sentence. The drawback with this approach is that it is unable to actually understand the relationship between the words in a sentence and the context and struggles with complicated sentences that have a lot of neutral words. The second approach, Flair, uses neural network based transformer models such as BERT. This allows it to understand the context in each sentence and then analyse the sentiment. The only drawback is that it usually takes much longer to analyse the sentences.

Most of the articles retrieved have a negative sentiment, and the analysis using flair captures this trend pretty accurately. Whereas, the textblob approach ends up giving an almost neutral or slightly sentiment to all the articles. Hence, using flair gives us a better accuracy.