This repository contains the source code of Tamil Actors Search Engine implemented using Python and Elasticsearch.
├── analyzers : Custom Elasticsearch filters (Stems, Stopwords, Synonyms)
├── data : Data scraped from the website (https://ta.wikipedia.org/)
├── data_actors.json : Contains all the scraped data of actors
├── data_actors_links.csv : Contains web urls of actors to scrape data from.
├── data_actors_with_info.json : Contains scraped data of actors having non-null 'date_of_birth'
├── data_actors_with_movies.json : Contains scraped data of actors having non-empty 'movies'
├── data_actors_with_info_movies.json : Contains scraped data of actors having both non-null 'date_of_birth' and non-empty 'movies'
├── app : Frontend of the web app
├── app.py : Backend of the web app created using Flask
├── data_uploading.py : Python file to convert JSON data to bulkdata, and upload to Elasticsearch Bulk API
├── query_searching.py : Search API functions
├── requirements.txt : Python dependencies required for the project
├── search_queries.txt : Example search queries
├── setup_queries.txt : Some queries to setup the elasticsearch "tamilactors" index
├── web_scraping.py : Python file to scrape data, preprocess and store them in a JSON file
- Clone the repo and install the required Python dependencies.
git clone https://github.com/thuvarahan97/Tamil-Actors-Search-Engine.git cd Tamil-Actors-Search-Engine virtualenv -p python3 envname source env/bin/activate pip3 install -r requirements.txt
- Download and install Elasticsearch
- Create 'analyze' folder inside the config folder of Elasticsearch and copy files from analyzers.
- Create index "tamilactors" in Elasticsearch along with Custom filters (refer "Custom stemmer, stopper and synonyms" section in the setup_queries.txt file) using Postman.
- Add mapping to "tamilactors" Elasticsearch index (refer "Aggregation" section in the setup_queries.txt file) using Postman.
- If you want to scrape new data, run
python web_scraping.py
in a terminal. - Start the elasticsearch.
- Run
python data_uploading.py
to upload data to the Elasticsearch index "tamilactors" in a terminal. - Run
python app.py
in a terminal. - Go to http://127.0.0.1:5000/
- Search for actors (Works only for basic search queries at the moment using this web app)
- For advanced queries, try the given Sample Queries using Postman or Kibana.
The data have been scraped in Tamil language from the Wikipedia website https://ta.wikipedia.org/w/index.php?title=பகுப்பு:தமிழ்த்_திரைப்பட_நடிகர்கள் using the HTML/XML parsing library BeautifulSoup. This web page contains a list of names of tamil actors under the section "தமிழ்த் திரைப்பட நடிகர்கள்" பகுப்பிலுள்ள கட்டுரைகள் in which each name contains a web link to the main page of the actor.
Each actor contains the following data fields.
- name - Name of the actor
- date_of_birth - Date of birth of the actor
- date_of_death - Date of death of the actor (if the actor is already dead)
- place_of_birth - Place of birth of the actor
- spouse - List of spouse of the actor
- children - List of children of the actor
- movies - List of movies acted by the actor
(for each movie in the list)- year - Release year of the movie
- movie - Name of the movie
- other_occupations - List of occupations carried out by the actor other than acting
- awards - List of awards received by the actor
- description - Description of the actor
- Search for actors by any of the listed data fields.
E.g.- "கமல்ஹாசன்"
{
"query": {
"query_string": {
"query":"கமல்ஹாசன்"
}
}
}
- Search for actors specifying the field when you just know any of the listed data fields.
E.g.- "விருது பத்மஸ்ரீ"
{
"query" : {
"match" : {
"awards" : "பத்மஸ்ரீ"
}
}
}
- Search with WildCard when you are not sure about the spelling of the word.
E.g.- "கமல்*" for "கமல்ஹாசன்"
{
"query" : {
"wildcard" : {
"name" : "கமல்*"
}
}
}
- Search when you think one term might show up in multiple fields
E.g.- "தேசிய விருது"
{
"query" : {
"multi_match" : {
"query" : "தேசிய விருது",
"fields": ["awards", "description"]
}
}
}
- Search for 20 young actors who are directors where young is decided based on "date_of_birth"
E.g. - 20 இளைய இயக்குநர்
{
"size": 20,
"sort": [
{ "date_of_birth": {"order" : "desc"}}
],
"query": {
"multi_match": {
"fields":["other_occupations"],
"query" : "இயக்குநர்",
"fuzziness": "AUTO"
}
}
}
- Search for actors who acted in year 2020 (Nested query)
E.g. - 2020 ஆம் ஆண்டில் நடித்த நடிகர்
{
"query": {
"nested" : {
"path" : "movies",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{ "match" : {"movies.year" : "2020"} }
]
}
}
}
}
}
- Search with query spanning multiple fields
E.g.- சிறந்த நடிகர் விருது பெற்ற தமிழ்நாடு நடிகர்
{
"query": {
"bool": {
"must": [
{ "match": { "awards": "சிறந்த நடிகர் விருது" }},
{ "match": { "place_of_birth": "தமிழ்நாடு" }}
]
}
}
}
- Seach for actors who are singers died recently (Range Query) where died recently is based on "date_of_death"
E.g.- சமீபத்தில் இறந்த பாடகர்
{
"query": {
"bool": {
"must": [
{
"match": {
"other_occupations": "பாடகர்"
}
},
{
"range": {
"date_of_death" : {
"gte" : "2021"
}
}
}
]
}
}
}
- Search for actors who are lyricists and were born in Chennai (Filtered query)
E.g. - சென்னை பாடலாசிரியர்
{
"query": {
"bool": {
"must": {
"match": {
"other_occupations": "பாடலாசிரியர்"
}
},
"filter": {
"term": {
"place_of_birth": "சென்னை"
}
}
}
}
}
- Get only prefered fields when searching with other field
E.g.- 20 இளைய இயக்குநர் துணைவர்/பிள்ளைகள்
{
"size": 20,
"sort": [
{ "date_of_birth": {"order" : "desc"}}
],
"query": {
"multi_match": {
"fields":["other_occupations"],
"query" : "இயக்குநர்",
"fuzziness": "AUTO"
}
},
"_source":{
"includes":["spouse", "children"]
}
}
- Search for details only with description of actors (Text Mining)
{
"query": {
"more_like_this": {
"fields": [
"description"
],
"like": "தமிழ்நாட்டு மாதிரி நடிகர், குறும்பட நடிகர் மற்றும் தொலைக்காட்சி நடிகர். சிறப்பாக நடித்ததற்காக சிறந்த துணை நடிகருக்கான இந்திய தேசிய திரைப்பட விருது கிடைத்தது.",
"min_term_freq": 1,
"max_query_terms": 20
}
}
}
- Can do aggregated bucket querying with terms
{
"aggs": {
"Occupations": {
"terms": {
"field": "other_occupations",
"size": 10
}
}
}
}