Problem Statement

Recurrence: Group similar incidents together and find similar incidents of any particular incident.

Model’s Parameters:

  • TopRecSent: This is used to set the threshold value to show top recurred sentences, by default it is set to 100.
  • SimilarityPer: This is used to set the threshold value for similarity percentage. By default is it (0.6) which means sentence having similarity more than 60% will be grouped together.
  • DateColumn: This is required to prepare the final dataframe.
  • IncidentDescription: This is the text column on which we are trying to find the recurrence pattern.

Model Preparation: Steps:

    1. Data Cleaning
    1. Sentence Embedding( TFIDF)
    1. Cosine Similarity b/w vectors generated by TFIDF (matrix size NxN)
    1. Creating new matrix having more than 60% similarity between elements.
    1. Dataframe creation having details as below:

Sentence Text, Repeated Sentences Index, Repeated Count

Studied Approach:

    1. After 5 steps we are taking whole data having repeated count less than 10.
    1. Sentence Embedding creation using TDFIDF.
    1. Cosine Similarity b/w vectors.
    1. Creating new matrix having more than 60% similarity between elements.
    1. Creating data frame as per step 5.
    1. Taking data having repeated count less than 10 and adding rest in dataframe created in step 5.
    1. Repeating 6 to 10 steps in loop.

Final Approach:

    1. Sentence Embedding( TFIDF) of above column (Sentence Text)
    1. Cosine Similarity b/w vectors generated by TFIDF (matrix size NxN)
    1. Creating new matrix having more than 60% similarity between elements.
    1. Final dataframe creation having details as below:

Sentence Text, Repeated Sentences Index, Repeated Count, Recent Date Date

Similarity Search (Prediction):

    1. Taking input from user( text data).
    1. Creating embedding of input sentence.
    1. Cosine similarity b/w input sentence and sentence text of final data frame( step 9).
    1. Showing top 3 similar sentence as below format:

Sentence Text, Repeated Sentences Index, Repeated Count, Recent Date Date