Dataset generation for Urdu OCR.
- Jupyter Notebook
- Scrapy
- Pandas
- Sellenium
- Download Repository
- Go into the folder named 'bbcurdu'
- open command prompt and enter command `scrapy crawl bbc -o filename.csv`. It will scrape bbcurdu news titles for current page and save it in filename.csv
- Copy this filename.csv in main directory
- Open jupyter Notebook in main directory, in ln[28] you can change the column names to either "content_news" or "title_headlines".
- run all cells
- once done with running all cells open "data_set.py" file and copy paste your jupyter notebook token URL in "data_set.py". You will need drivers for the purticular browser you are using sellenium. Drivers For [Firefox](https://github.com/mozilla/geckodriver/releases) For [Chrome](https://sites.google.com/a/chromium.org/chromedriver/downloads) others can be found [here](https://www.seleniumhq.org/download/). Download driver and place it in main directory.
- Then run "python data_set.py"
- It will create two directories "images" and "texts" with dataset.