Pinned Repositories
Adtech
Adtech is short for advertisement technology. Nowadays, statistical methods and Machine Learning are widely used in predicting user behaviors in terms of the interactions with advertisements. This project is a private Kaggle competition of a course Adtech of Master Data Science, Université de Paris Saclay. The goal of this competition is to predict the price that how long users will watch a video advertisement.
Algorithmic-Trading
chat-classification
Neural classifiers for chat classification problems.
coreference_resolution
Coreference Resolution is a practical and challenging NLP topic . It aims to find out all words or phrases which are associated to the same real-world entities. One of the current state-of-the-art result is provided by End-to-End Coreference Resolution (Lee et al, 2017). This model has been already well implemented by a Python library AllenNLP. However, some features like speaker and ELMo embeddings are not considered in the library. This repository will provide complimentary implementations of the coreference model in AllenNLP.
Crypto-token-analyses
Cryptocurrencies and crypto tokens are widely discussed since the birth of Bitcoin in 2009. Unlike traditional currencies and commodities, crypto currencies and crypto tokens have not only significant meanings in terms of finance, but their promising technical usages and potential influence to global economics. Because of their multi-principle origines and complicated characteristics, analyzing a cryptocurrency or a crypto token is a challenging task. I would like to provide you a mixed vision to understand potentials of a cryptocurrency or a crypto token.
Emergency-visits-analyses
Fake_News_Prediction
Fake news is a type of yellow journalism or propaganda that consists of deliberate misinformation or hoaxes spread via traditional print and broadcast news media or online social media. --- Wikipedia. Data are from William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection. 12.8K short texts are abstracted from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. Data are categorized into 6 classes. And they introduce a multi-class classification problem.
Optimization-for-Data-Science
Optimization for Data Science is a course of Master Data Science, Université de Paris Saclay. Three assignments and one final project are included in this repository.
Predicting_Repeated_Buyers_Double11
Merchants sometimes run big promotions (e.g., discounts or cash coupons) on particular dates (e.g., Boxing-day Sales, "Black Friday" or "Double 11 (Nov 11th)”, in order to attract a large number of new buyers. Unfortunately, many of the attracted buyers are one-time deal hunters, and these promotions may have little long lasting impact on sales. What's more, Tmall.com as the creator of Chinese shopping carnival "Double 11 (Nov 11th)” is threatening by other e-commercial companies like Jingdong, Suning, which resluts in an increasingly high customer churn rate. As more and more customers involving in this shopping festival and more and more competitions appearing in the market, Tmall.com has to reinforce user loyalty to avoid customer loss. It is well known that in the field of online advertising, customer targeting is extremely challenging, especially for fresh buyers. However, with the long-term user behavior log accumulated by Tmall.com, we may be able to solve this problem using Machine learning models.
Structured_Data_Random_Features_for_Large-Scale_Kernel_Machines
Kernel machines such as the Support Vector Machine are widely used in solving machine learning problem, since they can approximate any function or decision boundary arbitrary well with enough training data. However, those methods applied on the kernel matrix (Gram matrix) of the data scale poorly with the size of the training dataset. The kernel trick may become intractable to compute as the computation and storage requirements for the kernel trick are exponentially proportional to the number of samples in the dataset. It takes a long time to train a model when training examples have big volume. For some specialized algorithms for linear Support Vector Machines, they operate much more quickly when the dimensionality of data is small because they operate on the covariance matrix rather than the kernel matrix of the training data. This paper we’ve chosen proposes a way to combine the advantages of the linear and nonlinear approaches. This method transformed the training and evaluation of any kernel machine by mapping the input data to a randomized low-dimensional feature space in order to create corresponding opera- tions of a linear machine. Those randomized features are designed to ensure that the inner products of the transformed data are nearly equal to those in the feature space of a user specific shift-invariant kernel. This method gives competitive results with state-of-the-art kernel-based classification and re- gression algorithms. What’s more, random features fix the problem of large scale of training data when computing the kernel matrix. The results have similar or even better testing error.
ljinstat's Repositories
ljinstat/Structured_Data_Random_Features_for_Large-Scale_Kernel_Machines
Kernel machines such as the Support Vector Machine are widely used in solving machine learning problem, since they can approximate any function or decision boundary arbitrary well with enough training data. However, those methods applied on the kernel matrix (Gram matrix) of the data scale poorly with the size of the training dataset. The kernel trick may become intractable to compute as the computation and storage requirements for the kernel trick are exponentially proportional to the number of samples in the dataset. It takes a long time to train a model when training examples have big volume. For some specialized algorithms for linear Support Vector Machines, they operate much more quickly when the dimensionality of data is small because they operate on the covariance matrix rather than the kernel matrix of the training data. This paper we’ve chosen proposes a way to combine the advantages of the linear and nonlinear approaches. This method transformed the training and evaluation of any kernel machine by mapping the input data to a randomized low-dimensional feature space in order to create corresponding opera- tions of a linear machine. Those randomized features are designed to ensure that the inner products of the transformed data are nearly equal to those in the feature space of a user specific shift-invariant kernel. This method gives competitive results with state-of-the-art kernel-based classification and re- gression algorithms. What’s more, random features fix the problem of large scale of training data when computing the kernel matrix. The results have similar or even better testing error.
ljinstat/Fake_News_Prediction
Fake news is a type of yellow journalism or propaganda that consists of deliberate misinformation or hoaxes spread via traditional print and broadcast news media or online social media. --- Wikipedia. Data are from William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection. 12.8K short texts are abstracted from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. Data are categorized into 6 classes. And they introduce a multi-class classification problem.
ljinstat/Predicting_Repeated_Buyers_Double11
Merchants sometimes run big promotions (e.g., discounts or cash coupons) on particular dates (e.g., Boxing-day Sales, "Black Friday" or "Double 11 (Nov 11th)”, in order to attract a large number of new buyers. Unfortunately, many of the attracted buyers are one-time deal hunters, and these promotions may have little long lasting impact on sales. What's more, Tmall.com as the creator of Chinese shopping carnival "Double 11 (Nov 11th)” is threatening by other e-commercial companies like Jingdong, Suning, which resluts in an increasingly high customer churn rate. As more and more customers involving in this shopping festival and more and more competitions appearing in the market, Tmall.com has to reinforce user loyalty to avoid customer loss. It is well known that in the field of online advertising, customer targeting is extremely challenging, especially for fresh buyers. However, with the long-term user behavior log accumulated by Tmall.com, we may be able to solve this problem using Machine learning models.
ljinstat/Algorithmic-Trading
ljinstat/coreference_resolution
Coreference Resolution is a practical and challenging NLP topic . It aims to find out all words or phrases which are associated to the same real-world entities. One of the current state-of-the-art result is provided by End-to-End Coreference Resolution (Lee et al, 2017). This model has been already well implemented by a Python library AllenNLP. However, some features like speaker and ELMo embeddings are not considered in the library. This repository will provide complimentary implementations of the coreference model in AllenNLP.
ljinstat/Optimization-for-Data-Science
Optimization for Data Science is a course of Master Data Science, Université de Paris Saclay. Three assignments and one final project are included in this repository.
ljinstat/Adtech
Adtech is short for advertisement technology. Nowadays, statistical methods and Machine Learning are widely used in predicting user behaviors in terms of the interactions with advertisements. This project is a private Kaggle competition of a course Adtech of Master Data Science, Université de Paris Saclay. The goal of this competition is to predict the price that how long users will watch a video advertisement.
ljinstat/chat-classification
Neural classifiers for chat classification problems.
ljinstat/Crypto-token-analyses
Cryptocurrencies and crypto tokens are widely discussed since the birth of Bitcoin in 2009. Unlike traditional currencies and commodities, crypto currencies and crypto tokens have not only significant meanings in terms of finance, but their promising technical usages and potential influence to global economics. Because of their multi-principle origines and complicated characteristics, analyzing a cryptocurrency or a crypto token is a challenging task. I would like to provide you a mixed vision to understand potentials of a cryptocurrency or a crypto token.
ljinstat/Emergency-visits-analyses
ljinstat/gitbook_algo
My Gitbook for data structures and algorithms
ljinstat/prediction-electricity-consummation