ljinstat

France

Pinned Repositories

Adtech
Adtech is short for advertisement technology. Nowadays, statistical methods and Machine Learning are widely used in predicting user behaviors in terms of the interactions with advertisements. This project is a private Kaggle competition of a course Adtech of Master Data Science, Université de Paris Saclay. The goal of this competition is to predict the price that how long users will watch a video advertisement.
Language:Jupyter Notebook00
Algorithmic-Trading
Language:Jupyter Notebook1 2 00
chat-classification
Neural classifiers for chat classification problems.
Language:Python0 2 00
coreference_resolution
Coreference Resolution is a practical and challenging NLP topic . It aims to find out all words or phrases which are associated to the same real-world entities. One of the current state-of-the-art result is provided by End-to-End Coreference Resolution (Lee et al, 2017). This model has been already well implemented by a Python library AllenNLP. However, some features like speaker and ELMo embeddings are not considered in the library. This repository will provide complimentary implementations of the coreference model in AllenNLP.
1 2 20
Crypto-token-analyses
Cryptocurrencies and crypto tokens are widely discussed since the birth of Bitcoin in 2009. Unlike traditional currencies and commodities, crypto currencies and crypto tokens have not only significant meanings in terms of finance, but their promising technical usages and potential influence to global economics. Because of their multi-principle origines and complicated characteristics, analyzing a cryptocurrency or a crypto token is a challenging task. I would like to provide you a mixed vision to understand potentials of a cryptocurrency or a crypto token.
0 2 00
Emergency-visits-analyses
Language:Python0 2 00
Fake_News_Prediction
Fake news is a type of yellow journalism or propaganda that consists of deliberate misinformation or hoaxes spread via traditional print and broadcast news media or online social media. --- Wikipedia. Data are from William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection. 12.8K short texts are abstracted from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. Data are categorized into 6 classes. And they introduce a multi-class classification problem.
Language:Jupyter Notebook2 2 02
Optimization-for-Data-Science
Optimization for Data Science is a course of Master Data Science, Université de Paris Saclay. Three assignments and one final project are included in this repository.
Language:Jupyter Notebook1 2 00
Predicting_Repeated_Buyers_Double11
Merchants sometimes run big promotions (e.g., discounts or cash coupons) on particular dates (e.g., Boxing-day Sales, "Black Friday" or "Double 11 (Nov 11th)”, in order to attract a large number of new buyers. Unfortunately, many of the attracted buyers are one-time deal hunters, and these promotions may have little long lasting impact on sales. What's more, Tmall.com as the creator of Chinese shopping carnival "Double 11 (Nov 11th)” is threatening by other e-commercial companies like Jingdong, Suning, which resluts in an increasingly high customer churn rate. As more and more customers involving in this shopping festival and more and more competitions appearing in the market, Tmall.com has to reinforce user loyalty to avoid customer loss. It is well known that in the field of online advertising, customer targeting is extremely challenging, especially for fresh buyers. However, with the long-term user behavior log accumulated by Tmall.com, we may be able to solve this problem using Machine learning models.
Language:Jupyter Notebook2 2 11
Structured_Data_Random_Features_for_Large-Scale_Kernel_Machines
Kernel machines such as the Support Vector Machine are widely used in solving machine learning problem, since they can approximate any function or decision boundary arbitrary well with enough training data. However, those methods applied on the kernel matrix (Gram matrix) of the data scale poorly with the size of the training dataset. The kernel trick may become intractable to compute as the computation and storage requirements for the kernel trick are exponentially proportional to the number of samples in the dataset. It takes a long time to train a model when training examples have big volume. For some specialized algorithms for linear Support Vector Machines, they operate much more quickly when the dimensionality of data is small because they operate on the covariance matrix rather than the kernel matrix of the training data. This paper we’ve chosen proposes a way to combine the advantages of the linear and nonlinear approaches. This method transformed the training and evaluation of any kernel machine by mapping the input data to a randomized low-dimensional feature space in order to create corresponding opera- tions of a linear machine. Those randomized features are designed to ensure that the inner products of the transformed data are nearly equal to those in the feature space of a user specific shift-invariant kernel. This method gives competitive results with state-of-the-art kernel-based classification and re- gression algorithms. What’s more, random features fix the problem of large scale of training data when computing the kernel matrix. The results have similar or even better testing error.
Language:Jupyter Notebook5 2 01

ljinstat's Repositories

ljinstat/Structured_Data_Random_Features_for_Large-Scale_Kernel_Machines
Kernel machines such as the Support Vector Machine are widely used in solving machine learning problem, since they can approximate any function or decision boundary arbitrary well with enough training data. However, those methods applied on the kernel matrix (Gram matrix) of the data scale poorly with the size of the training dataset. The kernel trick may become intractable to compute as the computation and storage requirements for the kernel trick are exponentially proportional to the number of samples in the dataset. It takes a long time to train a model when training examples have big volume. For some specialized algorithms for linear Support Vector Machines, they operate much more quickly when the dimensionality of data is small because they operate on the covariance matrix rather than the kernel matrix of the training data. This paper we’ve chosen proposes a way to combine the advantages of the linear and nonlinear approaches. This method transformed the training and evaluation of any kernel machine by mapping the input data to a randomized low-dimensional feature space in order to create corresponding opera- tions of a linear machine. Those randomized features are designed to ensure that the inner products of the transformed data are nearly equal to those in the feature space of a user specific shift-invariant kernel. This method gives competitive results with state-of-the-art kernel-based classification and re- gression algorithms. What’s more, random features fix the problem of large scale of training data when computing the kernel matrix. The results have similar or even better testing error.
Language:Jupyter Notebook5 2 01
ljinstat/Fake_News_Prediction
Fake news is a type of yellow journalism or propaganda that consists of deliberate misinformation or hoaxes spread via traditional print and broadcast news media or online social media. --- Wikipedia. Data are from William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection. 12.8K short texts are abstracted from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. Data are categorized into 6 classes. And they introduce a multi-class classification problem.
Language:Jupyter Notebook2 2 02
ljinstat/Predicting_Repeated_Buyers_Double11
Merchants sometimes run big promotions (e.g., discounts or cash coupons) on particular dates (e.g., Boxing-day Sales, "Black Friday" or "Double 11 (Nov 11th)”, in order to attract a large number of new buyers. Unfortunately, many of the attracted buyers are one-time deal hunters, and these promotions may have little long lasting impact on sales. What's more, Tmall.com as the creator of Chinese shopping carnival "Double 11 (Nov 11th)” is threatening by other e-commercial companies like Jingdong, Suning, which resluts in an increasingly high customer churn rate. As more and more customers involving in this shopping festival and more and more competitions appearing in the market, Tmall.com has to reinforce user loyalty to avoid customer loss. It is well known that in the field of online advertising, customer targeting is extremely challenging, especially for fresh buyers. However, with the long-term user behavior log accumulated by Tmall.com, we may be able to solve this problem using Machine learning models.
Language:Jupyter Notebook2 2 11
ljinstat/Algorithmic-Trading
Language:Jupyter Notebook1 2 00
ljinstat/coreference_resolution
Coreference Resolution is a practical and challenging NLP topic . It aims to find out all words or phrases which are associated to the same real-world entities. One of the current state-of-the-art result is provided by End-to-End Coreference Resolution (Lee et al, 2017). This model has been already well implemented by a Python library AllenNLP. However, some features like speaker and ELMo embeddings are not considered in the library. This repository will provide complimentary implementations of the coreference model in AllenNLP.
1 2 20
ljinstat/Optimization-for-Data-Science
Optimization for Data Science is a course of Master Data Science, Université de Paris Saclay. Three assignments and one final project are included in this repository.
Language:Jupyter Notebook1 2 00
ljinstat/Adtech
Adtech is short for advertisement technology. Nowadays, statistical methods and Machine Learning are widely used in predicting user behaviors in terms of the interactions with advertisements. This project is a private Kaggle competition of a course Adtech of Master Data Science, Université de Paris Saclay. The goal of this competition is to predict the price that how long users will watch a video advertisement.
Language:Jupyter Notebook00
ljinstat/chat-classification
Neural classifiers for chat classification problems.
Language:Python0 2 00
ljinstat/Crypto-token-analyses
Cryptocurrencies and crypto tokens are widely discussed since the birth of Bitcoin in 2009. Unlike traditional currencies and commodities, crypto currencies and crypto tokens have not only significant meanings in terms of finance, but their promising technical usages and potential influence to global economics. Because of their multi-principle origines and complicated characteristics, analyzing a cryptocurrency or a crypto token is a challenging task. I would like to provide you a mixed vision to understand potentials of a cryptocurrency or a crypto token.
0 2 00
ljinstat/Emergency-visits-analyses
Language:Python0 2 00
ljinstat/gitbook_algo
My Gitbook for data structures and algorithms
2 0
ljinstat/prediction-electricity-consummation
Language:R

ljinstat

Pinned Repositories

Adtech

Algorithmic-Trading

chat-classification

coreference_resolution

Crypto-token-analyses

Emergency-visits-analyses

Fake_News_Prediction

Optimization-for-Data-Science

Predicting_Repeated_Buyers_Double11

Structured_Data_Random_Features_for_Large-Scale_Kernel_Machines

ljinstat's Repositories

ljinstat/Structured_Data_Random_Features_for_Large-Scale_Kernel_Machines

ljinstat/Fake_News_Prediction

ljinstat/Predicting_Repeated_Buyers_Double11

ljinstat/Algorithmic-Trading

ljinstat/coreference_resolution

ljinstat/Optimization-for-Data-Science

ljinstat/Adtech

ljinstat/chat-classification

ljinstat/Crypto-token-analyses

ljinstat/Emergency-visits-analyses

ljinstat/gitbook_algo

ljinstat/prediction-electricity-consummation