This repository contains the solution for the skill set assessment task. The task involves creating an analytics engine to produce various reports, insights, and analyses using the provided datasets. The primary goal was to develop a recommendation system based on system scores for multiple users across different categories such as content, contacts, and events.
The datasets provided include:
- Users
- Organizations
- Contents
- Contacts
- Events
- Recommendations
- Kaggle Platform
- Python
- NumPy
- Pandas
-
Users Dataset
- Removed empty columns:
city
,country
,state
,phone_number
,linkedin_url
,description
.
- Removed empty columns:
-
Organizations Dataset
- Removed empty columns:
email
,year_founded
,phone_number
,linkedin_url
.
- Removed empty columns:
-
Contents Dataset
- Removed empty columns:
organisation_id
,creator_id
. - Removed unhelpful column:
content_type
(all records had the same value). - Cast
id
column from float to integer. - Removed leading spaces from column names.
- Dropped records without
id
values.
- Removed empty columns:
-
Events Dataset
-
Removed empty column:
organisation_id
. -
Corrected
location
value forid
854 from ',' to 'online'. -
Processed the row with
id
438, which had 68 concatenated records in thetitle
field, by identifying the erroneous row, extracting and parsing the concatenated string into individual records, splitting them into 68 distinct rows, reinserting these rows back into the dataset, and verifying the consistency and accuracy of all other fields. As seen in the figure below, all 68 records were originally placed in one row in thetitle
field. -
Reformatted
Price
column: converted 'Free' to 0 and string values to float. -
Reformatted
location
into three columns:meeting
,state
,city
.
-
-
Contacts Dataset
- Removed empty columns:
organisation_id
,picture_name
,position
,gender
,phone_number
. - Removed unhelpful column:
role_id
(all records had the same value).
- Removed empty columns:
-
Recommendations Dataset
- Removed empty column:
user_score
.
- Removed empty column:
The recommendations table was divided into three categories based on asset_type
:
- Content Recommendations
- Renamed
asset_id
tocontent_id
.
- Renamed
- Event Recommendations
- Renamed
asset_id
toevent_id
.
- Renamed
- Contact Recommendations
- Renamed
asset_id
tocontact_id
.
- Renamed
The cleaned datasets were exported for further use in building the recommendation system and analytics engine.
- Kaggle Platform
- Python
- NumPy
- Pandas
- Scikit-learn
- SciPy
A collaborative filtering recommendation system was developed for content recommendations based on user_id
, content_id
, and system_score
. The system_score
is assumed to reflect user interactions like clicks or time spent on content. The same algorithm can be applied to events and contacts.
- User-based Recommendations
- Input:
user_id
- Output: Top 5 recommended contents for the user.
- Input:
- Content-based Recommendations
- Input:
content_id
- Output: Top 5 contents similar to the provided content.
- Input:
- Power BI Desktop
Certainly! Here’s the enhanced format for the reports section:
Various reports covering all datasets were created and visualized in Power BI:
- Data Wrangling Code: Kaggle Notebook
- Content Recommendations System Code: Kaggle Notebook
- Power BI Reports PDF: Available in the
reports
directory. - Analytics Engine Files: Power BI files (.pbit, .pbix) for interaction and visualization, available in the repository.
- GitHub Repository: Recommendation System