- Master's Program in Data Science, University of Mannheim, Mannheim, Germany.(website)
- Master's Program in Statistics, National Chengchi University, Taipei, Taiwan. (website)
- Programming
Python
(numpy, scipy, pandas, statsmodels, matplotlib, seaborn, scikit-learn, pytorch, nltk),R
(dplyr, ggplot2, shiny),SQL
,Excel VBA
,Linux Bash
- Database
relational databases
(PostgreSQL, MySQL, MS Access, etc.)
- Visualization
Tableau
,Power BI
- data cleaning
- feature extraction
- modeling
- validation
- classification
- clustering
- k-nearest neighbor (KNN)
- Bayesian network
- support vector machine (SVM)
- link (Please open the file with "Google Caloboratory" and run on Jupyter Notebook.)
(Click on the project name for reports, presentations, and analyses.)
-
From Allianz Global Benefit in Germany
- Apply latent semantic analysis to recover 83% of the dataset
- Due to a great amount of missing values for disease codes, further analyses are not applicable. Using MIMIC from MIT Lab, latent semantic analysis (LSA) was applied to recover the missing values. With this approach, the amount of available data increased by 83%.
- Apply text similarity algorithm to integrate 95% of data
- Due to high-level data inconsistency in clients' demographic information and medical records, i.e., same meaning but different words or spelling, manpower for correction process was in high demand. Text similarity algorithms were applied; then 95% of data were integrated.
- Script in VBA and Python to 100% automatize format transformations
- Reporting process involved extracting data from database, transforming into reporting format, and uploading to re-porting API. To improve efficiency, Scripting in VBA and Python was applied to 100% automatize format trans-formations.
- Build international databases in healthcare related topics to strengthen the capacity for client anal-ysis and marketing strategy
- To strengthen the capacity for client analysis and marketing strategy, databases were built with topics in health ex-penditure, claim amount, disease, and industry, according to multiple industrial classification standards and public health databases, i.e., WHO, OECD, MIMIC from MIT Lab.
- Apply latent semantic analysis to recover 83% of the dataset
-
Discount Rate Generator for IFRS 17 and Insurance Capital Standards (ICS 2.0) (
Python
) (Industrial Implementation)- IFRS 17 is a principles-based standard that requires significant interpretation before it can be implemented in practice. A key consideration is the discount rate to be used in measuring liabilities, among other related financial assumptions. As for Insurance Capital Standards (ICS 2.0), liability portfolios are separated into three “buckets” of decreasing degrees of asset-liability cash-flow matching and consequent recognition of spread.
- I developed an application to help all insurance companies to address IFRS / ICS requirements by modeling and implementing risk-free interest rate with Python.
- How to use it?
- Download the files “sw_col.xls” and “generator.exe” using the LINK, and have them in the same folder.
- Execute file “generator.exe”, and follow the instruction in each step.
- While being asked "Please enter the file name for Smith-Wilson risk free rate and then press ENTER:", enter “sw_col.xls” and then press ENTER.
- IFRS 17, Insurance Capital Standards (ICS 2.0), Model Building, Discount Rate, GUI
- Asset and Liability Modeling for Solvency Test (
MATLAB
/ MATLAB) (Industrial Implementation)
- Build models such as interest rate, exchange rate, stock price, and rent for asset and liability performance, and forecast and analyze the future performance.
- Follow with analyses on product providing for different types of guaranteed benefits, and various items of solvency tests.
- Solvency Test, Model Building, Forecast, GUI
- SEM-based Customer Contentment Analysis on Mass Merchandiser (
SAS
/ SAS)
- Mass merchandisers need to provide the correct information about their products and services to customers. Thus, it is necessary to have the information to enable customers to meet their real needs and to discover the best way to satisfy and retain customers, as well as to follow consumer sentiment, which can provide early warnings of market conduct and performance. This study seeks to understand ways to retain customers and to identify their levels of contentment with mass merchandisers. The research also focused on helping managers assess and identify the major strengths of the critical success factors of merchandisers, so that the company can sustain and maintain the success it has achieved in the market.
- In the present work, we study the causal effects from buildings and service to satisfaction, contentment and impression for Costco, Carrefour, and a.mart, using reliability analysis, confirmatory factory analysis (CFA), and structural equation modeling (SEM) analysis.
- Forecasting of Time Series based on VAR and Maximum Cross-correlation (
R
) (Master Thesis - Statistics)
- Master Thesis - Forecasting of Time Series based on Vector Autoregression Model and Maximum Cross-correlation
- The selection of methods plays an important role in the prediction based on time-series data. In most literature reviews, the vector autoregression model (VAR) has been a popular choice for prediction for many years. There are some disadvantages of this method: (i) the model selection procedure can be really complex; (ii) the model assumptions are difficult to validate; (iii) it requires a large amount of data for model building. The objective of this thesis is to provide an new multivariate-time series prediction method based on the concept of maximum cross-correlation. It requires merely the assumption of “fair linearity” between two time series under investigation. This thesis also compares the proposed method to the vector autoregressive (VAR) model which is widely used in time series analysis with the expectation to provide a new prediction method in practical data analysis. We use data from the Taiwan equity funds and the portfolio of those funds to compare the prediction performances of these two methods. Using the mean prediction squared errors (MPSE) as assessment criterion, the prediction method based on the maximum cross-correlation best performs under all prediction periods.
- Granger causality, Vector Autoregression model (VAR model), Cross-correlation, Wald test, mean prediction squared error (MPSE)
-
Enhancement of LibKGE (
Python
) (Master Thesis - Data Science)- Master Thesis - Revisiting Ensembles for Knowledge Graph Embeddings
- We study the ensembles in KGEs with better trained baselines. Additionally, fine-tune and joint learning were further experimented on ensembles. The study shows that, although ensembles generally outperformed single KGE models with better trained baselines, fine-tuning has shown minor progress comparing to established ensembles. Also, the approaches by joint learning lead to outputs inferior to established ensembles with the same training specification.
- Open source LibKGE framework is a PyTorch-based library for efficient training, evaluation, and hyperparameter optimization of knowledge graph embeddings (KGE). The key goal is to foster reproducible research into (as well as meaningful comparisons between) KGE models and training methods. As the authors argue in ICLR 2020 paper, the choice of training strategy and hyperparameters are very influential on model performance, often more so than the model class itself.
- I implement new functions to LibKGE, which will allow the New package to run training and validation, not just on single KGE model, but multiple models in one process, so to be able to performe joint training and alternative training models.
-
Profitability - Statistical Analysis with Excel VBA (
VBA
)- Our team of actuaries play a vital role in our organization. Our daily works relate to actuarial analysis, including Modeling, Predicting, Profit and Surplus Analysis, Risk and Uncertainty, and Validation. And Excel has been applied to parts of these tasks along with the use of VBA. Here are some examples from my work.
- Click on the images to see the originals. From left to right, up to down: asset sheet - assumption - profit analysis - sensitivity test
- Integrating Web Data on Video Games and Companies (
Python
,Java
,XML
/ MapForce)
- In this project, we worked with Python and Java libraries BeautifulSoup, Selenium WebDriver, and Jsoup3. And data analytical skills related to Data Translation, Identity Resolution, and Data Fusion are applied.
- We focus on building an integrated database of video games and video game developers which will be informative to video game players and professionals working in the industry alike. Our combined data can offer interesting new visions that can assist on business decision making and drive video game businesses to a success. Simply by exploring and mining this data, one will be able to gain a better understanding of current video game trends.
- For example, our integrated data can provide answers on manifold questions such as ‘Which game platform currently generates the most revenue?’, ‘Which genre types are most popular among the users?’, ’How does game experts’ judgment affect the market sales‘, or ‘How does the revenue of games differ from the worldwide regions?’.
- AI-Based Insurance Broker (
Java
,JSON
,JavaScript
, Java AWT, NetBeans, MongoDB, React, Flask)
- In this project, data knowledge related to Ontology engineering and Multiple-criteria decision analysis are applied.
- We want to shed light onto the current technological state of the insurance broker industry and how AI may transform it. Furthermore, we provide an Recommender System for dental insurances using the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), a popular Multiple Criteria Decision Making Method (MCDM). In addition we design an architectural model which may serve as an example of how to implement an insurance recommender system as a web application with state of the art technology.
- Data Infrastructure (
SQL
/ PostgreSQL)
- While working in the data team of Risk Management department, we deal with financial products and the transactions. Our work is to maintain and improve the data infrastructure of the settlement process. This involves data pipeline, data checks, data modeling, and data warehouse. Here are the examples for data pipeline and data model from our work.
- Recipe Finder (
Java
,RDFS
,SPARQL
,SQL
,JavaScript
/ Apache Jena)
- "Throw-away society" is a term often used to describe today's society. The German center for nutrition estimates that every year eleven million tonnes of food is thrown away in Germany alone. For the U.S., the figures are even worse: 150,000 tonnes of food is thrown away every day.
- The biggest reason why ingredients are thrown away is that the quantity initially bought exceeded the actually required quantity for a given period of time. This might be either because of a special promotion, the product not being available in smaller units or just due to a lack of planning for the purchase. One way to reduce some of that waste would be to provide a way to use leftover food.
- In our project, we achieved just that! We build an API using Apache Jena providing households with an easy way to find recipes incorporating food they need to either consume today or throw away tomorrow. We tapped into the power of the Semantic Web and developed an application which allows its users to browse recipes based on leftovers they might have in their kitchen, ultimately reducing food waste.
- Apache Jena, a Java framework, is a well-known Semantic Web programming framework. In this project, data knowledge related to Linked Open Data and Ontology engineering are applied.
-
Information Retrieval on NLTK corpora (
Python
)- In this project, we discuss the implementation of a Latent Semantic Indexing-based information retrieval model, and evaluates its performance against the Vector Space Model on a collection with 18,828 documents.
- We implement a LSI-based retrieval system. We expected it to outperform a traditional model like VSM due to its ability to uncover latent structures in the collection of documents. Unfortunately, our results largely did not agree with our expectations, which prompted a deeper analysis of the collection and the evaluation methodology. We found out that LSI heavily depends on the structure of the collection that it is applied on and it would be interesting to repeat the experiment with a different collection with exisiting relevance judgments.
-
Automated ICD Coding (
Python
)- To reduce coding errors and cost, this is my attempt applying Latent Semantic Indexing to build an ICD coding machine which automatically and accurately translates the free-text diagnosis descriptions into ICD codes.
- Company Name Matching (
Python
)
- Some input data are built by handwriting and scanning afterwards or by typing, which might cause data inconsistency and wrong inputs. This will lead to tremendous problems for end users, such as product managers and analysts. We want to avoid any operational inefficiency involving manual correction or misleading statistics or analysis at the further end of the reporting process. Regarding company names, besides typo, different name suffix, such GmbH or Ltd, may cause issue of distinguishability.
- This is my attempt applying different type of Identity Resolution approaches, such as jaccard, jaro winkler, hamming, levenshtein, and ratcliff obershelp, to analyse the company name dataset. we also apply Block methods with criteria country to avoid unnecessary comparisons and reduce quadratic runtime complexity. The goal is to choose an approach to 1.)have the most similar with the similarity up to a threshold, and 2.)make sure the most similar has a certain level of difference between itself and the second most similar, so to perform precise classification.
- Data & Matrix (
R
)
- In these tasks, data analytical skills related to Matrix Completion, Non-Negative Matrix Factorization, and Singular Value Decomposition are applied.