/Indicwiki-internship

Creating intelligent programs for generating Telugu and Hindi content using Artificial Intelligence, Machine Learning, Natural Language toolkits and openly available structured data sources in a specific domain like Programming languages, Software companies, cities, scientists, BSE companies, hospitals, universities/colleges, Movies, Actors, etc.

Primary LanguageJupyter Notebook

Indicwiki-internship

Creating intelligent programs for generating Telugu and Hindi content using Artificial Intelligence, Machine Learning, Natural Language toolkits and openly available structured data sources in a specific domain like Programming languages, Software companies, cities, scientists, BSE companies, hospitals, universities/colleges, Movies, Actors, etc.

Requirements :

We need specific package versions for the setup of our virtual environment for the project.
Link for the requirements : here

Domain : INDIAN COMPANIES

There more than 2 Million+ companies exist in India. Our aim is to generate articles for 50000+ companies in Telugu so that we divided the companies into 4 categories based on their nature and organization.

BSE & NSE Companies

The BSE SENSEX (also known as the S&P Bombay Stock Exchange Sensitive Index or simply SENSEX) is a free-float market-weighted stock market index of 30 well-established and financially sound companies listed on the Bombay Stock Exchange.

Public Sector Undertakings (PSU)

A government entity which is also known as government-owned enterprise or government-owned corporation or statutory corporation or government-owned-company or nationalised company in India established by the government with the objective of development, aim to control monopoly by the private sector entities, offer products and services at an affordable price to the citizens along with the role to earn profit for the government is called a Public Sector Undertaking (PSU) or a Public Sector Enterprise (PSE).

Companies in Wikipedia

There are several Indian companies present in wikipedia and most of them are common in BSE & NSE, PSU companies. wiki companies are those which are not present in above 2 categories.

Other Indian Companies

We scraped several websites and collected 60000+ companies data based on their capital.

Data collection & Scraping

Most of the scraping part is done using selenium and BeautifulSoup.
For the installation use the following commands in your command prompt or anaconda prompt.

pip install beautifulsoup4
pip install selenium 

Translation and Transliteration

Translation and Transliteration was an important part of the project. We need to test several packages like deep-translator, anuvaad and deeptranslit. The translator may be works well for some kind of data, so that we need to use the best fit for our data. For installation of above packages run the following commands in your command prompt or anaconda prompt.

pip install deep-translator
pip install anuvaad
pip install deeptranslit

Jinja Templating

We use Jinja templating tool for the creating articles in Telugu. We use randomized the sentence formation to maku sure that each article has different kind of sentenses.
Jinja templates for all four categories of companies :

  • BSE & NSE companies : here
  • PSU companies : here
  • wiki companies : here
  • other companies : here

Article(XML) Generation

We generated 4 different XML files for each of the category.

The code for the generating the XML file is provided here : XML generator

  • BSE & NSE companies xml file : here
  • PSU companies xml file : here
  • companies present in wikipedia xml file : here
  • Other Indian companies xml file : here