introduction to data mining
contents
description
data mining studies the algorithms and computational paradigms that allow computers to discover knowledge and perform decision automatically using large and complex datasets. in this course, we will explain the fundamental principles, technical details, and real life applications of data mining techniques through lectures, case studies, and course projects. the core topics to be covered include data preproprocessing, classification, cluster analysis, association analysis, anomaly detection, neural networks, model evaluation, and applications like recommernder systems. the learning goals are to think sysematically of how data mining can solve analytical problems, to make better infromed decisions using various real-world data. as well as understand data mining process, algorithm development, and system deign to build a pathway to the career of a data scientist.
schedule
introduction to data mining
august 21 2023covers what is the challenge and why should we do data mining. what is data mining: definition, process, and examples. data mining tasks: classification, clustering, association rules, anomaly detection
data
august 25 2023- reading chapter 2.1 types data
reading chapter 2.2 data quality
covers continue on introduction of association rules, anomaly detection
data
august 28 2023reading
chapter 2.3 data preprocessing
chapter 2.4 measures of similarity and dissimilarity
covers
- data preprocessing - aggregation, samping, dimensionality reduction, feature selection, feature creature, discretization and binarization, attribute transformation
data
september 01 2023reading
chapter 2.4 measures of similarity and dissimilarity
covers
- general distance and similarity - euclidean and monkowski distance. smc / jaccard and cosine similarity, pearson's correlation. information based measures - mutual information, kl-divergence. ranking distance. distance of sets of data point
decision tree classifier
september 04 2023reading
chapter 3.1 basic concepts
chapter 3.2 general framework for classification
chapter 3.3 decision tree classifier
covers
- supervised learning setup - regression, classification. decision tree - goal, algorithm, split, criterion, node impurity (gini index)
decision tree classifier
september 08 2023- chapter 3.4 olap and multidimensional data analysis
- chapter 3.5 model selection
- decision tree - node impurity (entrypy, misclassification error)
- split for continuous attributes
- model overfitting - definition, example, generalization error, model pruning
evaluation metrics
september 11 2023- python and pandas tutorial code
- chapter 3.6 model evaluation
- chapter 3.7 presence of hyper parameters
- chapter 3.8
- evaluation and metrics for classification models - confusion matrix, accuracy, precision, recall, F-measure, roc, auc, corss-validation
- tutorial of developing data mining project and setting up python enviroment using conda
logistic regression
september 15 2023- scikit-learn tutorial: code
- chapter 4.6
- linear regression review - model, sum of squared errors loss function, gradient descent optimization
- logistic regression classifier - model, cross-entropy loss function, regularization
naive bayes classifier
september 18 2023bayesian network, k-nearest neighbor classifiers
september 22 2023support vector machine (SVM)
september 25 2023neural networks
september 29 2023neural networks
october 02 2023midterm review
october 06 2023midterm exam (part 1)
october 09 2023midterm exam (part 2)
october 13 2023no class due to fall break
october 16 2023neural networks
october 20 2023clustering & k-means
october 23 2023hierarchical clustering
october 27 2023DBSCAN & cluster evaluation
october 30 2023association rule mining
november 03 2023association rule mining
november 06 2023association rule mining
november 10 2023association rule mining
november 13 2023anomaly detection
november 17 2023deep learning framework
november 20 2023no class due to thanksgiving break
november 24 2023ensemble methods and boosting
november 27 2023final review
december 01 2023data mining application: recommender systems
december 06 2023stop day
december 08 2023outline
1. introduction
- why big data? what is big data mining? why data mining? data mining processes, relation to budiness intelligence techniques
- introduction to data mining tasks (classification, clustering, association analysis, anomaly detection). what is a model? basic terminologies, predictive modeling
- real world data mining applications
2. data and preproprocessing
- understanding of data, what is data? type of attributes, properties of attribute values, types of data, data quality
- sampling, data normalization, data cleaning, similarity measures
- feature selection / instance selection, the importance of feature selection / instance selectin in various big data scenarios
3. classification
- decision-tree based approach (e.g. C4.5)
- rule-based approach (e.g. Ripper)
- instance based classifiers (e.g. k-nearest neighbor)
- support vector machines (svms)
- ensemble learning
- classification model selection and evaluation
- applications: b2b customer buying stage prediction, recommender systems
4. association analysis
- apriori algorithm and its extensions
- association pattern evaluation
- sequantial patterns and frequent subgraph mining
- applications: b2b customer buying path analysis, medical informatics, telecommunication alarm diagnosis
5. clustering
- partitional and hierarchical clustering methods
- graph based methods
- density based methods
- cluster validation
- applications: customer profiling, market segmentation
6. anomaly detection
- statistical based and density based methods
7. neural networks
- neorons and network topology
- multi layer feed forward network
8. data mining case studies
- big data analytics in mobile enviroments
- fraud detection and prevention with data mining techniques
- big data analystics in real world business
grading
unit | weight |
---|---|
exam 1 | 20% |
exam 2 | 20% |
programming project | 20% |
assignments | 30% |
attendance | 10% |