Certified Data Scientist with 16+ years of cumulative experience; eager to leverage the machine learning, artificial intelligence and data science skills.
Pinned Repositories
Assignment-04-Simple-Linear-Regression-2. Q2) Salary_hike -> Build a prediction model for Salary_hike Build a simple linear regression model by performing EDA and do necessary transformations and select the best model using R or Python. EDA and Data Visualization. Correlation Analysis. Model Building. Model Testing. Model Predictions.
Assignment-05-Multiple-Linear-Regression-2. Prepare a prediction model for profit of 50_startups data. Do transformations for getting better predictions of profit and make a table containing R^2 value for each prepared model. R&D Spend -- Research and devolop spend in the past few years Administration -- spend on administration in the past few years Marketing Spend -- spend on Marketing in the past few years State -- states from which data is collected Profit -- profit of each state in the past few years.
Q 24) A Government company claims that an average light bulb lasts 270 days. A researcher randomly selects 18 bulbs for testing. The sampled bulbs last an average of 260 days, with a standard deviation of 90 days. If the CEO's claim were true, what is the probability that 18 randomly selected bulbs would have an average life of no more than 260 days
Assignment-11-Text-Mining-01-Elon-Musk, Perform sentimental analysis on the Elon-musk tweets (Exlon-musk.csv), Text Preprocessing: remove both the leading and the trailing characters, removes empty strings, because they are considered in Python as False, Joining the list into one string/text, Remove Twitter username handles from a given twitter text. (Removes @usernames), Again Joining the list into one string/text, Remove Punctuation, Remove https or url within text, Converting into Text Tokens, Tokenization, Remove Stopwords, Normalize the data, Stemming (Optional), Lemmatization, Feature Extraction, Using BoW CountVectorizer, CountVectorizer with N-grams (Bigrams & Trigrams), TF-IDF Vectorizer, Generate Word Cloud, Named Entity Recognition (NER), Emotion Mining - Sentiment Analysis.
Splitting data into Linear Model, Exponential, Qaudratic, Additive seasonality , Additive Seasonality Quadratic , Multiplicative Seasonality, Multiplicative Additive Seasonality. Prediction for new time period
Tuning of Hyperparameters :- Batch Size and Epochs. Tuning of Hyperparameters:- Learning rate and Drop out rate. Tuning of Hyperparameters:- Activation Function and Kernel Initializer. Tuning of Hyperparameter :-Number of Neurons in activation layer. Training model with optimum values of Hyperparameters.
vaitybharati's Repositories
Assignment-11-Text-Mining-01-Elon-Musk, Perform sentimental analysis on the Elon-musk tweets (Exlon-musk.csv), Text Preprocessing: remove both the leading and the trailing characters, removes empty strings, because they are considered in Python as False, Joining the list into one string/text, Remove Twitter username handles from a given twitter text. (Removes @usernames), Again Joining the list into one string/text, Remove Punctuation, Remove https or url within text, Converting into Text Tokens, Tokenization, Remove Stopwords, Normalize the data, Stemming (Optional), Lemmatization, Feature Extraction, Using BoW CountVectorizer, CountVectorizer with N-grams (Bigrams & Trigrams), TF-IDF Vectorizer, Generate Word Cloud, Named Entity Recognition (NER), Emotion Mining - Sentiment Analysis.
Assignment-07-Clustering-Hierarchical-Airlines. Perform clustering (hierarchical) for the airlines data to obtain optimum number of clusters. Draw the inferences from the clusters obtained. Data Description: The file EastWestAirlinescontains information on passengers who belong to an airline’s frequent flier program. For each passenger the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for the purpose of targeting different segments for different types of mileage offers.
Data _set: Cars.csv Calculate the probability of MPG of Cars for the below cases. MPG <- Cars$MPG a. P(MPG>38) b. P(MPG<40) c. P (20<MPG<50)
Supervised-ML---Multiple-Linear-Regression---Toyota-Cars. EDA, Correlation Analysis, Model Building, Model Testing, Model Validation Techniques, Collinearity Problem Check, Residual Analysis, Model Deletion Diagnostics (checking Outliers or Influencers) Two Techniques : 1. Cook's Distance & 2. Leverage value, Improving the Model, Model - Re-build, Re-check and Re-improve - 2, Model - Re-build, Re-check and Re-improve - 3, Final Model, Model Predictions.
Supervised-ML-Decision-Tree-C5.0-Entropy-Iris-Flower-Using Entropy Criteria - Classification Model. Import Libraries and data set, EDA, Apply Label Encoding, Model Building - Building/Training Decision Tree Classifier (C5.0) using Entropy Criteria. Validation and Testing Decision Tree Classifier (C5.0) Model
Assignment-06-Logistic-Regression. Output variable -> y y -> Whether the client has subscribed a term deposit or not Binomial ("yes" or "no") Attribute information For bank dataset Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services") 3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed) 4 - education (categorical: "unknown","secondary","primary","tertiary") 5 - default: has credit in default? (binary: "yes","no") 6 - balance: average yearly balance, in euros (numeric) 7 - housing: has housing loan? (binary: "yes","no") 8 - loan: has personal loan? (binary: "yes","no") # related with the last contact of the current campaign: 9 - contact: contact communication type (categorical: "unknown","telephone","cellular") 10 - day: last contact day of the month (numeric) 11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec") 12 - duration: last contact duration, in seconds (numeric) # other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 15 - previous: number of contacts performed before this campaign and for this client (numeric) 16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success") Output variable (desired target): 17 - y - has the client subscribed a term deposit? (binary: "yes","no") 8. Missing Attribute Values: None
Assignment-08-PCA-Data-Mining-Wine data. Perform Principal component analysis and perform clustering using first 3 principal component scores (both heirarchial and k mean clustering(scree plot or elbow curve) and obtain optimum number of clusters and check whether we have obtained same number of clusters with the original data (class column we have ignored at the begining who shows it has 3 clusters)
Unsupervised-ML-t-SNE-Data-Mining-Cancer. Import Libraries, Import Dataset, Convert data to array format, Separate array into input and output components, TSNE implementation, Cluster Visualization
Config files for my GitHub profile.
Assignment-07-DBSCAN-Clustering-Crimes. Perform Clustering for the crime data and identify the number of clusters formed and draw inferences.
Assignment-07-K-Means-Clustering-Airlines. Perform clustering (K means clustering) for the airlines data to obtain optimum number of clusters. Draw the inferences from the clusters obtained. The file EastWestAirlinescontains information on passengers who belong to an airline’s frequent flier program. For each passenger the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for the purpose of targeting different segments for different types of mileage offers.
Association-Rules-Data-Mining-Books. Apriori Algorithm, Association rules with 10% Support and 70% confidence, Association rules with 20% Support and 60% confidence, Association rules with 5% Support and 80% confidence, visualization of obtained rule.
Association Rules Data Mining (Groceries). Converting the data frame into a list of lists, Using Transactionencoder to transform this dataset into a logical data frame, Building the data frame: rows are logical and columns are the items that have been purchased, Print Column names, We need to drop nan column from the data frame, Most popular items, Top 10 Popular items, Barplot visualization of popular items, Apriori Algorithm: Association rules with 5% Support and 70% confidence, Association rules with 1% Support and 80% confidence, Visualization of obtained rule.
Assignment-09-Association-Rules-Data-Mining-my_movies. Apriori Algorithm. Association rules with 10% Support and 70% confidence. Association rules with 5% Support and 90% confidence. Lift Ratio > 1 is a good influential rule in selecting the associated transactions. Visualization of obtained rule.
Assignment-10-Recommendation-System-Data-Mining-books. Recommend a best book based on the ratings: Sort by User IDs, number of unique users in the dataset, number of unique books in the dataset, converting long data into wide data using pivot table, replacing the index values by unique user Ids, Impute those NaNs with 0 values, Calculating Cosine Similarity between Users on array data, Store the results in a dataframe format, Set the index and column names to user ids, Nullifying diagonal values, Most Similar Users, extract the books which userId 162107 & 276726 have watched, extract the books which userId 276729 & 276726 have watched.
NLP: Sentiment Analysis or Emotion Mining on Amazon Product Reviews - Part-1. Let’s learn the NLP techniques to perform Sentiment Analysis or Emotion Mining on extracted Product Reviews from Amazon. Part-1 covers Text preprocessing and Feature extraction, the next part covers Sentiment Analysis or Emotion Mining on text corpus.
Text-Mining-Amazon-Reviews-using-Scrapy. Ever wondered? Life would be easier if there could be ways to know how well your product performs and what do people feel about your product? The Solution -Text Mining Techniques.
Supervised-ML---Multiple-Linear-Regression---Cars-dataset. Model MPG of a car based on other variables. EDA, Correlation Analysis, Model Building, Model Testing, Model Validation Techniques, Collinearity Problem Check, Residual Analysis, Model Deletion Diagnostics (checking Outliers or Influencers) Two Techniques : 1. Cook's Distance & 2. Leverage value, Improving the Model, Model - Re-build, Re-check and Re-improve - 2, Model - Re-build, Re-check and Re-improve - 3, Final Model, Model Predictions.
Supervised-ML---Logistic-Regression---Appointing-Attorney-or-not. EDA, Model Building, Model Predictions, Testing Model Accuracy, ROC Curve plotting and finding AUC value.
Unsupervised-ML---Hierarchical-Clustering-University Data. Import libraries, Import dataset, Create Normalized data frame (considering only the numerical part of data), Create dendrograms, Create Clusters, Plot Clusters.
Unsupervised-ML---K-Means-Clustering-Non-Hierarchical-Clustering-Univ. Use Elbow Graph to find optimum number of clusters (K value) from K values range. The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion WCSS. Plot K values range vs WCSS to get Elbow graph for choosing K (no. of clusters)
Unsupervised-ML---DBSCAN-Clustering-Wholesale-Customers. Import Libraries, Import Dataset, Normalize heterogenous numerical data using standard scalar fit transform to dataset, DBSCAN Clustering, Noisy samples are given the label -1, Adding clusters to dataset.
Unsupervised-ML---Association-Rules-Data-Mining-Titanic. Data Preprocessing: As the data is categorical format, we are using One Hot Encoding to convert into numerical format. Apriori Algorithm: frequent item sets & association rules. A leverage value of 0 indicates independence. Range will be [-1 1]. A high conviction value means that the consequent is highly depending on the antecedent and range [0 inf]. Lift Ratio > 1 is a good influential rule in selecting the associated transactions.
Unsupervised-ML---PCA-Data-Mining-Univ. Import Dataset, Converting data to numpy array, Normalizing the numerical data, Applying PCA Fit Transform to dataset, PCA Components matrix or covariance Matrix, Variance of each PCA, Final Dataframe, Visualization of PCAs, Eigen vector and eigen values for a given matrix.
Unsupervised-ML-Recommendation-System-Data-Mining-Movies. Recommend movies based on the ratings: Sort by User IDs, number of unique users in the dataset, number of unique movies in the dataset, Impute those NaNs with 0 values, Calculating Cosine Similarity between Users on array data, Store the results in a dataframe format, Set the index and column names to user ids, Slicing first 5 rows and first 5 columns, Nullifying diagonal values, Most Similar Users, extract the movies which userId 6 & 168 have watched.