Tales_from_the_Cryptos_NLP: A Jupyter Notebook repository from maitree7

Tales from the Crypto - Natural Language Processing

Background

As there's been a lot of hype in the news lately about cryptocurrency, we would like to invest, so to speak, of the latest news headlines regarding Bitcoin & Ethereum to get a better feel for the current public sentiment around each coin.

Using fundamental NLP techniques to understand the sentiment in the latest news article featuring Bitcoin & Ethereum and also other factors involved with the coin prices such as common words & phrases and organizations & entities mentioned in the articles.

Packages Used:

1. Sentiment Analysis

Use of Vader Sentiment Analysis

    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    analyzer = SentimentIntensityAnalyzer()

2. Natural Language Processing

Natural Language Toolkit NLTK
Tokenizing Words & Sentences with NLTK Tokenizing
Generate N-grams N-grams
Word Cloud Word_Cloud

3. Named Entity Recognition

NER SpaCy
Visualizer Displacy

Files

Starter Notebook

Sentiment Analysis

1. Use of newsapi to pull the latest news articles for Bitcoin and Ethereum
btc_articles = newsapi.get_everything(q='bitcoin', language='en', sort_by='relevancy', )

2. Creation of Dataframe of Sentiment Scores for each coin

Bitcoin	Ethereum

3. Descriptive statistics

Bitcoin	Ethereum

Which coin had the highest mean positive score?
```
   Bitcoin - 0.07
```
Which coin had the highest negative score?
```
   Ethereum - 0.025 
```
Which coin had the highest positive score?
```
   Ethereum - 0.9198 
```

Natural Language Processing

1. Import the following Libraries from nltk:

```python
    from nltk.tokenize import word_tokenize, sent_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer, PorterStemmer
    from string import punctuation
    import re
```

2. Use NLTK and Python to tokenize the text for each coin

Remove punctuation

    regex = re.compile("[^a-zA-Z0-9 ]")
    re_clean = regex.sub('', text)

Lowercase each word

    words = word_tokenize(re_clean.lower())

Remove stop words

    sw = set(stopwords.words('english'))

Lemmatize Words into Root words

    lemmatizer = WordNetLemmatizer()
    lem = [lemmatizer.lemmatize(word) for word in words]

3. Look at the ngrams and word frequency for each coin

Use NLTK to produce the ngrams for N = 2

    def get_token(df):
         tokens = []
    for i in df['tokens']:
        tokens.extend(i)
    return tokens
    btc_tokens = get_token(btc_sentiment_df)
    eth_tokens = get_token(eth_sentiment_df)

    #Generate the Bitcoin N-grams where N=2
    def bigram_counter(tokens, N):
    words_count = dict(Counter(ngrams(tokens, n=N)))
    return words_count

    bigram_btc = bigram_counter(btc_tokens, 2)

List the top 10 words for each coin

    # Use the token_count function to generate the top 10 words from each coin
    def token_count(tokens, N=10):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

Bitcoin	Ethereum

Generate word clouds for each coin to summarize the news for each coin.

    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-whitegrid')
    import matplotlib as mpl
    mpl.rcParams['figure.figsize'] = [20.0, 10.0]

Named Entity Recognition

1. Import SpaCy and displacy
python import spacy from spacy import displacy # Load the spaCy model nlp = spacy.load('en_core_web_sm')
2. Build a named entity recognition model for both coins
python # Run the NER processor on all of the text doc = nlp(btc_content) # Add a title to the document doc.user_data["title"] = "BITCOIN NER"
3. Visualize the tags using SpaCy
python displacy.render(doc, style='ent')

4. List all Entities
python for ent in doc.ents: print('{} {}'.format(ent.text, ent.label_))

Bitcoin	Ethereum

maitree7/Tales_from_the_Cryptos_NLP

Background

Packages Used:

Files