Korean Thematic Analysis is a project to help understand all of the sequencial steps from collecting valuable data by crawling web sites to applying serveral models and testing with reperesentative machine learning algorithms
- Python3
- python on the PATH (make sure it's Python 3)
- JDK 1.7+
- The Selenium Library (pip3 install selenium)
- The Pymongo Library (pip3 install pymongo)
- The Numpy Library (pip3 install numpy)
- The Matplotlib Library (pip3 install matplotlib)
- The Sklearn Library (pip3 install sklearn)
- The KoNLPy Libarary (pip3 install konlpy)
- The Tensorflow Library (pip3 install tensorflow)
- Chrome v70-72 (using ChromeDriver 2.45)
We are gonna run into Naver blog website, classified by some categories. You don't need any authorization like login or something.
Let's access that page using chromedriver! Since the driver depends on OS, you need to give the information to initialize driver. In this repository, we have drivers only for mac or linux. If you have to run scraping on another system, copy your driver in the directory ./driver
. Add configuration like below:
OS_CONFIG = {
'os' : 'your os' # mac or linux
}
We used mongodb to collect documents. You need to add configuration to connect your mongodb in collect/cinfig/config.py
like below:
MONGODB_CONFIG = {
'host': 'your host',
'dbname': 'your dbname',
'port': your port
}
To prevent the data is being skewed, needless launguages in this project like English or Japanese will be filtered by regular expression. Also, morphological analysis is required to anaylze Korean. Twitter Korean Text is one of well-know morphological analyzer for korean. We will save natural language processed data while scraping and saving scraped raw data. And all you have to do is just running crawler using this code below!
python3 ./collect/blog-crawling.py
Run the code below and it will save natural language processed data into your local files in the directory of /data/blogs
.
python3 ./analyze/batch.py
The default value of limit for each category is 1000. If you wanna change that fix the value at the top of the that file.
NUM_OF_SAMPLES_PER_CLASS = 1000
Before we dive into training and predicting something amazing, it is essential to explore the features of your data. Just run the code below and you can check the number of samples
, the number of classes
, the number of samples per class
, and medain number of words per sample
. Also you can see the plot of length distribution
and unigrams distribution
.
python3 ./analyze/explore_data.py
We are gonna run some experiments to figure out which model fits our data the most.
First we tokenized and vectorized using unigram and bigram, and built multi-layer perceptron model. Tune the hyperparameters using the plot of learning curve
you can get at the end of training and the guideline to avoid underfitting or overfitting.
python3 ./analyze/train_mlp.py
You can easily tune the hyperparameters by editing the constant values at the top of the file /analyze/train_mlp.py
LEARNING_RATE = 1e-3
EPOCHS = 1000
BATCH_SIZE = 128
LAYERS = 2
UNITS = 64
DROPOUT_RATE = 0.5