The #datacleaningchallenge is an twitter event aimed at promoting best practices in data cleaning. The challenge encourages participants to share their experiences, tips, and tricks in data cleaning by working on a dirty data. It also served as a medium for enthusiast to get a feel of what data cleaning is all about while receiving mentorship from the organizers. The online event launched on the 9th of March 2023 via a twitter space and the event lasted for the whole or March.
This project is a Python-based analysis of tweets related to the #datacleaningchallenge. The Data Visualization was done using Power BI. I wrote a Medium post explaining my insights in more details. Check it out HERE
The Organizers of the #datacleaningchallenge that took place in March are looking to start another challenge in April. I aim to provide insights on the data gotten from the challenge, how people perceive data cleaning, the most talked about tools which could give a hint on the tools the participants used and the strategies on how to make the next challenge even bigger.
I gathered my data by using python Snscrape library. I scraped my twitter data using the #DataCleaningChallenge hashtagg from 1st of March to 31st of March 2023. The function I used to scrape the data is shown beloow
# to define a function that scrapes tweet from tweet
def scrape_hashtag_tweets(hashtag, start_date, end_date):
"""
Scrapes all tweets containing a certain hashtag within a specified time frame,
ignoring case sensitivity.
Args:
hashtag (str): the hashtag to scrape, without the "#" symbol
start_date (str): the start date in "YYYY-MM-DD" format
end_date (str): the end date in "YYYY-MM-DD" format
"""
# Convert the start and end dates to datetime objects
start_dt = dt.datetime.strptime(start_date, "%Y-%m-%d")
end_dt = dt.datetime.strptime(end_date, "%Y-%m-%d")
# Create a list to store the scraped tweets
tweets = []
# Iterate over all tweets containing the specified hashtag
for tweet in sntwitter.TwitterSearchScraper(f"#{hashtag} since:{start_date} until:{end_date}").get_items():
# Ignore tweets that don't match the hashtag (ignoring case sensitivity)
if hashtag.lower() not in tweet.content.lower():
continue
# Add the relevant information about the tweet to the list
tweets.append({
"id": tweet.id, #
"content": tweet.content,
"timestamp": tweet.date,
"username": tweet.user.username,
"userdisplayname": tweet.user.displayname,
"userlocation": tweet.user.location,
"retweetCount": tweet.retweetCount,
"likeCount": tweet.likeCount,
"language": tweet.lang,
"source": tweet.source
})
return tweets
return go(f, seed, [])
}
In total, I scraped 922 tweets and 11 columns.
Data Cleaning and Preprocessing was carried out using python. The few steps I took to achieve this are
- Ensuring no duplicate tweet id
- Checking for null values and handling them if exists
- Ensuring the data types are consistent
- Extracting the content from the
source
column which appeared in html tag format
# python function extract the content between the HTML tags using str.extract()
def extract_html_tags(df, column_name):
content = df[column_name].str.extract(r'>(.*?)<')
return content
# apply the extract_html_tags function to the 'source' column of the tweets dataframe
tweets['platform'] = extract_html_tags(tweets, 'source')
- Presnece of three letter language code in the
language
column. Language code should not exceed two letters except from Undefined (und) - I also extracted the date (yyyy/mm/dd) from the
timestamp
column.
The cleaned data set was Explored using python and then visualized using Power BI. There is no data model except the relationship between my Calendat table and my cleaned dataset. The Report I made is shown below
I carried out sentiment analysis which would tell how people perceived the data cleaning challenge. Positive sentiment could indicate that users are finding the challenge to be engaging, informative or useful, while negative sentiment may suggest that users are not enjoying the challenge or having difficulty with it.
I then created a word cloud to visualize the qualitative data in order find out the most commonly used words. The word cloud was done using Word Art
For the data cleaning challenge, Participants were allowed to use the popular data analytics tools such as Power BI, Python, Excel, SQL, R and Tableau. The number of times data cleaning tools are mentioned in tweets can provide you with valuable insights into which tools are popular among Twitter users, preferred by the users for carrying data cleaning tasks and and trending tools in the data cleaning community
The insights I got from the Analysis and visualization are documented in my GitHub Repository and my Medium Post
Before I make recommendations, I would like to congratulate the Organizers and the speakers for a job well done. On the 11th of March 2023, #DataCleaningChallenge was one of the trending topics in Nigeria.
Some recommendations I can make are
- For wider reach and more online presence, I recommend that the organizers encourages the participants to write their experience, the things they learnt and the struggles they faced when participating in subsequent challenges organized. This way it gives the Organizers the chance to help them and improve on the sentiments from the participants.
- Looking at the most talked about tool, we can see that the users have preference tending towards Excel, SQL and Python for data cleaning. For subsequent data cleaning challenge, I recommend teaching sessions be held on how to effectively use these three tools and take their skillset to the next level. This should be done before commencement of subsequent challenges so that the participants would not be stuck when the challenges begin.
- Concerning tools that are the least frequently talked about, it could be related to the fact that the users find them difficult to use. So to help individuals get familiar with more data analytics tool, The organizers could fix session to train all the participants on one tool at a time, maybe Power BI since it has similarities to Excel.
- I analyzed for the top mentioned Twitter handles and the most common hashtags. For future challenges and to foster collaboration, I would recommend the Organizers collaborate with anyone from the names below. This could help for wider reach and bring in more users to partake in the challenge.
- I lastly recommend that a broadcasting strategy is put in place to make this challenge known to more users. For example, a flier could be made as a means of making the challenge look more official. You can even plead with Data Influencers to spread the word and encourage those who participated in the previous challenge to spread the word as well.
Generally, I would say the challenge was a success and it did well in terms of social media outreach but there is room for improvement.
For subsequent challenges, I recommend a target to be set on the number of users tweeting about the challenge. A total of 502 twitter users were discovered for this challenge, 1000 could be the target for the next challenge.
Thank you for reading!