CS4642 - Data Mining & Information Retrieval
title
- song title in Englishartist
- artist name in Englishalbum
- album name in EnglishreleasedYear
- released year of the songlyricist
- lyricist name in Englishlyrics
- song lyrics in Sinhalametaphor
- part of the song which contains a metaphor in Sinhalameaning
- meaning of the metaphor in Englishsource
- source domain of the metaphor in Englishtarget
- target domain of the metaphor in English
- Searching songs using the Title, Artist, Lyricist, Album
- Limiting search ( ex :- 2 songs of .... )
- Search songs by metaphors ( ex :- metaphors for girl )
- Search songs by metaphor meaning ( ex :- meaning beautiful girl )
- Tokenization
- Whitespace tokenizer - The text being analyzed will be split into tokens (individual words) based on whitespace. It means that any sequence of whitespace characters (such as spaces, tabs, or line breaks) will be used as delimiters between tokens in the text.
- Edge n-gram filter - Breaks the text down into n-grams of a given size, with n-grams created from the start (
front
) of the text. Here, themin_gram
is set to 4 andmax_gram
is set to 18. This means that the tokenizer will break down the text into n-grams of size 4 to 18 and only create n-grams from the start (front) of the text. This filter can be used to improve search performance for prefix matching queries.
- Stop word filtering
- A custom filter is used for stop word handling. Apart from the default english stop words used by Elasticsearch it is customized to remove certain common stop words that are relevant to the application. he
ignore_case
option is set totrue
which means that the filter will match the words in thestopwords
array regardless of the case.
- A custom filter is used for stop word handling. Apart from the default english stop words used by Elasticsearch it is customized to remove certain common stop words that are relevant to the application. he
- Field boosting
- Certain keywords and named entities have been used to boost the relevant fields such that they count more towards the relevance score of the query.
- Using words like 'written' in the search will boost the
lyricist
field. - Using words like 'sung', 'performed' will boost the
artist
field. - Using words like 'metaphors' in the search will boost the
source
and thetarget
fields. - Using words like 'meaning' will boost the
meaning
field. - Using names in the named entity set will boost the
artist
and/or thelyricist
fields accordingly. For example using the word 'Bathiya' in the search query will boost theartist
field.
- Using words like 'written' in the search will boost the
- Certain keywords and named entities have been used to boost the relevant fields such that they count more towards the relevance score of the query.