This challenge is to implement two features:
- Clean and extract the text from the raw JSON tweets that come from the Twitter Streaming API, and track the number of tweets that contain unicode.
- Calculate the average degree of a vertex in a Twitter hashtag graph for the last 60 seconds, and update this each time a new tweet appears.
├── README.md
├── run.sh
├── src
│ ├── average_degree.py
│ └── tweets_cleaned.py
├── tweet_input
│ └── tweets.txt
└── tweet_output
├── ft1.txt
└── ft2.txt
- Feature 1
- Feature 2
- Run the scripts
-
For challenge 1, the script can be found at
src/tweets_cleaned.py
. -
The following libraries were used:
-
Explanation of the algorithm:
- Read from the source file one line (tweet) at a time
- The line will be parsed and encoded as JSON format
- Some lines may not have the necessary keys, so we need to check for that
- If the line does NOT contain the keys we are interested in, skip it (take another line)
- If the line contains the keys we are interested in, we will retrieve the values at
created_at
(timestamp) andtext
(the content of the tweet) - We clean the contents using
cleanString
function - Next we check if the string returned at the previous step has non-ASCII characters
- If the contents of the extracted text (from key
text
) has only ASCII characters, remove escape characters and new-lines - this is achieved by functiontransformString
- If the contents of the extracted text (from key
text
) has extra characters (other than ASCII), remove those characters, remove escapes and new-lines - this is achieved by functiontransformString
. At this step we also increment the unicode tweets counter - Append the line to the text file with clean tweets
- All lines were curated and clean, so it's time to write the number of tweets at the end of the file
-
How to use the script independently
# cd src/
# python tweets_cleaned.py ../tweet_input/tweets.txt ../tweet_output/ft1.txt
The output will be in tweet_output/ft1.txt
file.
-
For challenge 2, the script can be found at
src/average_degree.py
. -
The following libraries were used:
-
Explanation of the algorithm:
- Read from the source file one line (tweet) at a time
- The line (tweet) is being curated based on the
cleanData
function - The curated line (tweet) with more than 1 hashtag will be returned in the form of a dictionary containing a list of hashtags, the timestamp and a list of edges formed between hashes
- The line is appended to a master stack list (
L
) - this stack will hold 60sec worth of tweets - Check if there are any tweets older than 60 sec compared with the latest tweet
- If any old tweets, pop the first entry in the master list
- Else go on with checking if the new line (tweet) can contribute any new vertices (hashtags) to the master vertices list (
V
) - Extend the master vertices list (V) with the new unique hashtags
- Create an adhoc list (E) of possible edges from all the tweets in the master list (
L
) This list will always be emptied when a new line (tweet) will be parsed - The function
createGraph
will create a new graph based on the new edges (E
) and vertices (V
) and at the same time return the average degree of the graph - The return from
createGraph
is written to file
-
How to use the script independently
# cd src/
# python average_degree ../tweet_output/ft1.txt ../tweet_output/ft2.txt
NOTE!!! The input for this script is the file with all cleaned tweets that results from running the previous script
The output will be in tweet_output/ft2.txt
file.
Alternatively you could run both scripts using the run.sh
script in this folder.
# ./run.sh
ENJOY!!!