/urdu_ocr_dataset_generation

Dataset generation for Urdu OCR.

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

urdu_ocr_dataset_generation

Dataset generation for Urdu OCR.

Requirenments:

  1. Jupyter Notebook
  2. Scrapy
  3. Pandas
  4. Sellenium

How to Use:

  • Download Repository
  • Go into the folder named 'bbcurdu'
  • open command prompt and enter command `scrapy crawl bbc -o filename.csv`. It will scrape bbcurdu news titles for current page and save it in filename.csv
  • Copy this filename.csv in main directory
  • Open jupyter Notebook in main directory, in ln[28] you can change the column names to either "content_news" or "title_headlines".
  • run all cells
  • once done with running all cells open "data_set.py" file and copy paste your jupyter notebook token URL in "data_set.py". You will need drivers for the purticular browser you are using sellenium. Drivers For [Firefox](https://github.com/mozilla/geckodriver/releases) For [Chrome](https://sites.google.com/a/chromium.org/chromedriver/downloads) others can be found [here](https://www.seleniumhq.org/download/). Download driver and place it in main directory.
  • Then run "python data_set.py"
  • It will create two directories "images" and "texts" with dataset.