  The proposed project is a chatbot responsible for screening in hospitals to differentiate COVID cases from other similar symptomatic cases, such as allergy, cold, and flu. Additionally, at the end, it generates a medical record of the patient with personal information, such as name, age, occupation, symptoms, the number of days the person has been sick, and the result of the prediction made using artificial intelligence techniques.
  To use the chatbot, it is necessary to run randomforest_knn.py and trainingBot.py before running chatBot.py.


  I used a dataset available on Kaggle, containing various illness cases along with their symptoms and the classification of each case as allergy, cold, COVID, or flu. This database is imbalanced, with many cases of allergy and flu compared to COVID and cold. This is a problem as it can lead to many "false alarms," with the diagnosis leaning heavily towards allergy and flu due to their higher sample count. The symptoms among the four illnesses are quite similar, making it challenging to assess effectiveness. Therefore, I employed the under-sampling technique using the RandomUnderSampler() method from the imblearn.under_sampling library.

Without under sampling

With under sampling
  To address the classification problem, I employed two techniques: Random Forest and K-Nearest Neighbor. For Random Forest, I set the number of decision trees to 10, with separation criterion as entropy, and a random_state of 42. For K-Nearest Neighbor, I chose a number of neighbors as 10 and used the Euclidean metric. The training of the models is in the file randomforest_knn.py. After training, the program automatically saves the models, with the assistance of the joblib library, as they will be used for classification in the chatbot.py file.
  For the development of the chatbot, I created a simple neural network with three dense layers and two dropout layers. I used Adam as the optimizer with a learning rate of 0.01, as it was the best optimizer I had found when working on assignment 4 of this course (I also tested with SGD, but Adam showed better performance). The training of the chatbot is done in the trainingBot.py file, and the application of the chatbot is done in the chatBot.py file. I had to implement the chatbot in English because I used the WordLemmatizer method from the nltk library, which groups inflected forms of a word so that they can be analyzed as a single item. For example, in English, we have various variations of the word 'work', such as 'works,' 'worked,' 'working.' This method combines all these forms to be analyzed as 'work.' I didn't find an easy way to use this method in Portuguese, so I apologize if you find any errors in English.
  To conclude, the chatbot.py program, after collecting all the data, such as personal information and symptoms of the patient, generates a docx file, which serves as the patient's medical record. .


   Regarding KNN and Random Forest, I chose multiple metrics to analyze effectiveness, ensuring we don't rely on just one and potentially have a false impression of the model's performance. Remembering that: {0: Allergy, 1: Cold, 2: COVID, 3: Flu}
  Random Forest:With an average model accuracy of 93%, we can observe in the confusion matrix that true positives and true negatives occur much more frequently than false positives and false negatives. Cases of flu, which have a higher tendency to be easily confused with COVID, are more prone to errors.

---------------- Random Forest ---------------- precision recall f1-score support
       0       0.96      0.98      0.97       211
       1       0.93      0.94      0.93       211
       2       0.88      0.93      0.91       211
       3       0.96      0.88      0.92       211

accuracy                           0.93       844

macro avg 0.93 0.93 0.93 844 weighted avg 0.93 0.93 0.93 844

Confusion Matrix Random Forest

  K-Nearest neighbor:TWith an average model accuracy of 86%, we can observe in the confusion matrix that true positives and true negatives occur much more frequently than false positives and false negatives. Cases of flu, which have a higher tendency to be easily confused with COVID and colds, are more prone to errors.
---------------- K-Nearest neighbor ---------------- precision recall f1-score support

       0       1.00      0.87      0.93       211
       1       0.78      1.00      0.88       211
       2       0.77      0.96      0.85       211
       3       1.00      0.60      0.75       211

accuracy                           0.86       844

macro avg 0.89 0.86 0.85 844 weighted avg 0.89 0.86 0.85 844

Confusion Matrix KNN

  For the neural network used in building the chatbot, we achieved an accuracy of 98.25% after training for 200 epochs.
Training complete


   Despite the models having high accuracies, there are many cases of flu and colds that are confused with COVID. Since COVID is a highly contagious disease that caused a pandemic affecting the entire world, it might not be advisable to rely too heavily on the diagnosis of COVID, flu, and colds, as they are confused more frequently than I expected. However, the models were excellent at distinguishing these cases from allergies, which is beneficial because allergies have symptoms very similar to COVID.


