preferences_detection_from_text: A Python repository from WrathXL

This project is part of a larger investigation focused on detecting privacy leaks in a human-written text in social media. In this case, we aim to detect preferences and things that a person does or likes, using text conversations of the subject. The repository presents :

Annotated dataset
ML models to recognize defined Entities

Although below is a little more context, the project is still a bunch of investigation notes and tests. If you wish to know more about the investigation feel free to contact the author.

Dataset

Text data was taken from Topical-chats. The conversations there, are very similar to the style of social-networks such as Reddit. The sentences were annotated using the web tool Webanno and the below Scheme

Annotation Scheme

Subject
Preference
Activity
Object

Examples:

Data Processing

Various steps for cleaning, organizing, and formatting data were made and can be found in scripts and corpus_porcess file

Models

The first approach is CRF that serve as a baseline model at an Entity level evaluation
The second approach is a model based on fine-tuning BERT pre-trained. The results of this model are very promising and results can be found in the logs folder

WrathXL/preferences_detection_from_text

Dataset

Annotation Scheme

Data Processing

Models