urdu_ocr_dataset_generation

Dataset generation for Urdu OCR.

Requirenments:

Download Repository
Go into the folder named 'bbcurdu'
open command prompt and enter command `scrapy crawl bbc -o filename.csv`. It will scrape bbcurdu news titles for current page and save it in filename.csv
Copy this filename.csv in main directory
Open jupyter Notebook in main directory, in ln[28] you can change the column names to either "content_news" or "title_headlines".
run all cells
once done with running all cells open "data_set.py" file and copy paste your jupyter notebook token URL in "data_set.py". You will need drivers for the purticular browser you are using sellenium. Drivers For [Firefox](https://github.com/mozilla/geckodriver/releases) For [Chrome](https://sites.google.com/a/chromium.org/chromedriver/downloads) others can be found [here](https://www.seleniumhq.org/download/). Download driver and place it in main directory.
Then run "python data_set.py"
It will create two directories "images" and "texts" with dataset.