/UyghurTextResource

uyghur text resource crawled from website

Apache License 2.0Apache-2.0

UyghurTextResource

uyghur text resources crawled from website, every root folder name represent the crawled website domain and each root folder contains three sub folder and one txt file, details as follow:

###data folder:

original text content crawled from web page(warning: this is raw text from web site)

###content folder:

original uyghur text from the web page(a line text that split by space)

###dic folder:

original web page words list handled by word tokenization

###unique.txt file:

unique word list crawled from the entire website