MirasText: An Automatically Generated Text Corpus for Persian

This repository contains MirasText corpus and description along side with what it has been used for and what it can be used for. A sample of the dataset is provided in MirasText_sample.txt which contains 1000 documents. The full dataset is uploaded in google drive and you can download it here. In order to use the complete dataset you need to submit a request to behnam@miras-tech.com and we will provide you with the password needed to extract MirasText.zip.

MirasText Description

MirasText is the result of crawling more than 250 persain news websites. Each article in MirasText contains the following attributes:

Content: The article content
Description: The summary provided for each article (may not be available in some articles (nan))
Keywords: The keywords associated with each article (may not be available in some articles (nan))
Title: Title of the article
Website: The website from which the article is crawled
URL: The article absolute URL

MirasText has more than 2.8 million articles and over 1.4 billion content words. The following table demonstrates the statistics of the corpus:

Total Documents	2,835,414
Total Content Words	1,429,878,960
Average Content Length	504.3
Average Keywords	8.4
Average Description Length	19.8
Average Title Length	9.5

What it has been used for

At Miras Technologies International we are using MirasText to develop some NLP applications. including:

Document Classification
Word Embedding Extraction
Summarization
Keyword Extraction

Please inform us if you have used MirasText for any porpuses to be added to this list.

What it can be used for

MirasText can be used for a variety of NLP tasks, besides from the applications mentioned above it can also be used for:

Language Modeling
Title Extraction
Named Entity Recognition (for an unsupervised approach)

Dataset Description

The dataset is provided in MirasText.zip. You will have to extract it first, then simply use parser.py to read the dataset line by line. Each line contains one article. The attributes of each article are delimited using an special delimiter to avoid conflicts. (delimiter = ***)

Each article is provided in the following format: content *** description *** keywords *** title *** website *** url (note that *** is the delimiter used to separate the attributes)

Cite

Please cite the following paper in your publication if you are using MirasText in your research: