hanja-graph: A Python repository from pabloem

Hanja Graph Project

Author: Pablo Estrada < pablo (at) snu (dot) ac (dot) kr >

This repo contains the code resulting from the Hanja Graph Project, developed by Pablo Estrada, as a sideproject.

###Folders

Crawlers - This is the folder containing the crawlers to download the data. At the moment of this writting, there is just one crawler implemented.
Formatters - This is the folder containing the small python scripts that take the files created by the crawlers and output an acceptable graph-format file.
Analysis - This folder contains the scripts that do analysis over the graph.
Test_data - This folder contains some data provided for test if anyone would just want to have the data after all the processing
- graph.graphml - This contains the full graph, with links between hanja and korean words. No bipartite distinction.
- hanja_list.json - This contains the list of hanjas as returned by the crawler.
- words.nospace.json - This contains the list of korean words, as returned by the crawler.
- korean_unip_projection.graphml - This file contains the projection of the korean words from the bipartite graph. In the current version, the edge weights are 1 or 2, depending on how many chinese characters are shared between two words.

These are the utilities to scrape the Kanji information in 'http://www.manythings.org/kanji/d/'. They all serve different purposes.

scrape_kanji.py - This is the main scraper. It gets the data and outputs a JSON file with words, and Kanjis. This JSON file can be used to generate the graphml file.
make_kanji_graph.py - This takes the JSON output from scrape.py, and makes it into a Graphml file.

Not yet available : )

To generate the synonyms training set we need to follow these steps:

(1) Use the graph dataset to obtain the features of each node pair

$> nohup ./bin/generate_csv_p.py data/hanja_unip.graphml res.csv 4

(2) Obtain the 'zeros' in the training set. We do this through random sampling from the main CSV file

$> shuf -n 1000 data/res.csv > data/training_zeros

$> ./bin/removeFirstColumns training_zeros data/training_non_related.csv

(3) Obtain the 'synonyms' in the training set * Obtain a random set of hanjas from the res.csv file

$> shuf -n 1000 data/res.csv | awk -F "," '{print $3}' > data/tmp

$> cat data/tmp | sort | uniq > data/random_hanjas.txt

$> ./bin/scrapeSynonyms data/random_hanjas.txt data/antonyms_hanja.txt data/synonyms_hanja.txt

$> ./bin/extractPairsFromCsv data/synonyms_hanja.txt data/res.csv data/synonyms.csv

(4). Use the result to run a classification scheme ; )

(1) Run the classification script

$> ./bin/get_synonyms.py data/res.csv data/synonyms_training.csv data/training_non_related.csv data/guess_syn1.txt

(2) Verify the results

$> ./bin/checkSynonyms data/guess_syn1.txt

(3) Verify the data by hand // Since Naver does not know all the Hanja synonyms

$> ./bin/get_pairs_meanings.py data/guess_syn1.txt output [amount]