Pinned Repositories
anagram_check
An anagram is a word or phrase formed by rearranging the letters of a different word or phrase. In other words, both strings must contain the same exact letters in the same exact frequency. Write a python script that reads 2 strings from command line and finds out whether they are anagrams or not. If they are not anagrams, then the script should find and print the minimum number of character deletions required to make the two strings anagrams. Otherwise, just print that they are anagrams. **Input Format** - The first line contains a single string, **a**. - The second line contains a single string, **b**. Expected input and output: ``` $ python3 solution.py a: Tom Marvolo Riddle b: I Am Lord Voldemort remove 7 characters from 'Tom Marvolo Riddle' and 8 characters from 'I Am Lord Voldemort' $ python3 solution.py a: tom marvolo riddle b: i am lord voldemort remove 0 characters from 'tom marvolo riddle' and 1 characters from 'i am lord voldemort' $ python3 solution.py a: tom marvolo riddle b: i am lordvoldemort they are anagrams $ python3 solution.py a: tom riddle b: voldemort remove 3 characters from 'tom riddle' and 2 characters from 'voldemort' ```
bigquery_mysql_connect
Create an ETL job with python. The python file has to retrieve data from BigQuery piece by piece (10k, 100k, etc.) Data can be stored in any relational (MySQL.) databases on the locale. o The file contains two date parameters: batch and realtime. 'batch’ parameter should get the past data and write to a database as fast as possible. Please, measure its time and improve the performance (Hint: Parallel Processing). realtime parameter should get the last day. o The file has to be robust in terms of logging and try-except mechanisms (DBs connections, etc.).
clustering_categorical_data
Discover different segments of sessions which differ from each other by their navigational patterns before adding a product to the baskets. You are free to differentiate your segments based on category id or domain name of the products, if you feel necessary.Dimension reduction is also applied.
construct_sentence_with_string
It is used to test whether given sentence can be constructed with available strings or not.
credit_fraud_catboost
Catboost model is applied for imbalanced data set
data_analysis
In this notebook, I applied statistical methods for imbalanced data analysis. In terms of basics, it starts with null check, data description and handling missing values. There exists right skewness in data for numerical columns. Shapiro-Wilk and Anderson darling tests are applied to prove that data is not distributed normally. Outlier detection with IGR is applied for numerical columns. Chi-square test is applied for categorical columns in order to test whether there exist differences between distributions for target columns. Correlation analysis for an imbalanced data set is applied by using undersampling methods.
linkfire_data_analysis
Our goal is to understand this traffic better, in particular the volume and distribution of events, and to develop ideas how to increase the links' clickrates.
navigation_pattern_estimation
Come up with a prescriptive model that is able to give directions on how to maximize the “Purchase Completed” probability of a session. For example, at which state of a session what kind of directions may be given to customers, which patterns contributes at most to “purchase completed” probability etc.
python_hive_connection
Writing pandas df to hive db by using pyhive library. Kerberos authentication is used to reach cluster.
python_hive_sqlalchemy_connection
I will show how to connect kerberized hadoop cluster by using sqlalchemy library. Connection engine will be generated and used to write df to the database.
e181337's Repositories
e181337/python_hive_connection
Writing pandas df to hive db by using pyhive library. Kerberos authentication is used to reach cluster.
e181337/anagram_check
An anagram is a word or phrase formed by rearranging the letters of a different word or phrase. In other words, both strings must contain the same exact letters in the same exact frequency. Write a python script that reads 2 strings from command line and finds out whether they are anagrams or not. If they are not anagrams, then the script should find and print the minimum number of character deletions required to make the two strings anagrams. Otherwise, just print that they are anagrams. **Input Format** - The first line contains a single string, **a**. - The second line contains a single string, **b**. Expected input and output: ``` $ python3 solution.py a: Tom Marvolo Riddle b: I Am Lord Voldemort remove 7 characters from 'Tom Marvolo Riddle' and 8 characters from 'I Am Lord Voldemort' $ python3 solution.py a: tom marvolo riddle b: i am lord voldemort remove 0 characters from 'tom marvolo riddle' and 1 characters from 'i am lord voldemort' $ python3 solution.py a: tom marvolo riddle b: i am lordvoldemort they are anagrams $ python3 solution.py a: tom riddle b: voldemort remove 3 characters from 'tom riddle' and 2 characters from 'voldemort' ```
e181337/bigquery_mysql_connect
Create an ETL job with python. The python file has to retrieve data from BigQuery piece by piece (10k, 100k, etc.) Data can be stored in any relational (MySQL.) databases on the locale. o The file contains two date parameters: batch and realtime. 'batch’ parameter should get the past data and write to a database as fast as possible. Please, measure its time and improve the performance (Hint: Parallel Processing). realtime parameter should get the last day. o The file has to be robust in terms of logging and try-except mechanisms (DBs connections, etc.).
e181337/clustering_categorical_data
Discover different segments of sessions which differ from each other by their navigational patterns before adding a product to the baskets. You are free to differentiate your segments based on category id or domain name of the products, if you feel necessary.Dimension reduction is also applied.
e181337/construct_sentence_with_string
It is used to test whether given sentence can be constructed with available strings or not.
e181337/credit_fraud_catboost
Catboost model is applied for imbalanced data set
e181337/data_analysis
In this notebook, I applied statistical methods for imbalanced data analysis. In terms of basics, it starts with null check, data description and handling missing values. There exists right skewness in data for numerical columns. Shapiro-Wilk and Anderson darling tests are applied to prove that data is not distributed normally. Outlier detection with IGR is applied for numerical columns. Chi-square test is applied for categorical columns in order to test whether there exist differences between distributions for target columns. Correlation analysis for an imbalanced data set is applied by using undersampling methods.
e181337/linkfire_data_analysis
Our goal is to understand this traffic better, in particular the volume and distribution of events, and to develop ideas how to increase the links' clickrates.
e181337/navigation_pattern_estimation
Come up with a prescriptive model that is able to give directions on how to maximize the “Purchase Completed” probability of a session. For example, at which state of a session what kind of directions may be given to customers, which patterns contributes at most to “purchase completed” probability etc.
e181337/python_hive_sqlalchemy_connection
I will show how to connect kerberized hadoop cluster by using sqlalchemy library. Connection engine will be generated and used to write df to the database.
e181337/data_enhancement
Data Quality: How would you improve the data quality of this data set, what are your main conclusions about the data quality? What interventions have you done on the data set before analysing further? What did you learn?
e181337/e181337
Config files for my GitHub profile.
e181337/LSTM_binary_classification
Example LSTM structure for binary classification.
e181337/top_seller_class
Write a python class using pandas that finds and prints: top seller n products in given date range (product name & quantity), top seller n stores in given date range (store name & quantity), top seller n brands in given date range (brand & quantity), top seller n cities in given date range (city & quantity)