/entss

Primary LanguageC++GNU General Public License v3.0GPL-3.0

Version Language License Twitter

Entss: Entailment Classification and Semantic Scaling

ents2

Entss is a library for inferring political beliefs from text using transformers and Bayesian IRT. It provides tools to simplify the process of cleaning, labeling, and modeling your data so that you don't have to be an expert in NLP or Bayesian statistics, and can save you a lot of programming even if you are!

Under the hood, Entss uses zero-shot stance detection via entailment classification to label documents based on the beliefs they express, and estimates the ideology of document authors with semantic scaling.

Entss is a modular package designed around three classes, the Cleaner(), Classifier(), and Scaler(). Sensible defaults are provided for each class to enable fast inference, but users can also pass their own cleaning functions, models, etc. to the classes.

Installation

Entss relies on PyTorch to label documents and CmdStan to estimate ideal points. It's highly recommended that you set these up in a conda environment to use Entss (and I highly recommend the Mamba version of Conda). If you want to use the Scaler() you will need an install of CmdStan. The installation instructions for CmdStanPy recommend installing CmdStanPy with Conda/Mamba, which will automatically install CmdStan. For example, the following command will create a new environment called 'entss' with CmdStanPy and CmdStan installed.

conda create -n entss -c conda-forge cmdstanpy

If you want to use GPU acceleration for labeling documents, make sure to follow the PyTorch installation instructions for your version of CUDA and install it to your newly created environment.

Entss can be installed with:

pip install git+https://github.com/MLBurnham/entss.git#egg=entss

A Minimal Example

import entss as en
df = en.load_newsletters()
df.head()
targets = ['biden', 'trump']
dimensions = ['supports', 'opposes', 'neutral']

# Clean
mrclean = en.Cleaner(keyword_list = targets)
df = mrclean.clean(df, synonyms = False, scrub = True, split = True, keywords = True)

# Label
mturk = en.Classifier(targets = keywords, dimensions = dimensions)
df = mturk.label(df, aggregate_on = 'Last Name')

# Model
lizardy = en.Scaler()
fit, summary = lizardy.stan_fit(df, targets = targets, dimensions = ['supports', 'opposes'], left_init_cols = 'trump_opposes', right_init_cols = 'trump_supports', summary = True)

Getting Started

Entss uses zero-shot entailment classification to detect the expressed beliefs in a document. Entailment is a classification task that determines the logical relationship between two sentences. For example, the sentence:

Cats like all sausages.

Entails the sentence:

Cats like salami as a treat.

Contradicts the sentence:

Cats hate pepperoni.

and is neutral to the sentence:

Dogs like salami.

By pairing documents about political topics (e.g. "I'm voting for Biden in 2024") with statements about the authors belief ("The author of this text supports Biden") we can use entailment classification to infer how many times someone expressed support for a political position in our dataset.

Ents comes with a sample dataset of newsletters sent by members of congress:

import entss as en
df = en.load_newsletters()
df.head()
text Last Name
0 News from Congressman John Moolenaar My team ... Moolenaar
1 News from Congressman Brian Mast HONORING THE... Mast
2 A message from Congresswoman Ann Wagner About... Wagner
3 Dear , Happy New Year! I hope your holiday s... O’Halleran
4 Commitment to service is part of the tapestry... Palazzo

The first step is to determine which issues or "targets" you are interested in (e.g. biden, trump, abortion etc.) and along which dimensions of belief you want to label documents (e.g. support, oppose, neutral).

targets = ['biden', 'trump']
dimensions = ['supports', 'opposes', 'neutral']

Data Cleaning

Entailment classification can require a lot of data preparation. Entss streamlines this process with the Cleaner() class. The class can scrub text of URLs and text artifacts, split sentences, tag documents for targets or keywords they contain, and locate synonyms for your keywords.

In this example, we split newsletters into sentences, scrub the text, and label each sentence that contains a mention of Biden or Trump.

mrclean = en.Cleaner(keyword_list = targets)

df = mrclean.clean(df, synonyms = False, scrub = True, split = True, keywords = True)

df.head()
doc_num text Last Name biden trump
0 0 News from Congressman John Moolenaar My team a... Moolenaar 0 0
1 0 Starting on December 27 , the new phone number... Moolenaar 0 0
2 0 Then, on January 3 , the new office address wi... Moolenaar 0 0
3 0 We look forward to continuing that work in the... Moolenaar 0 0
4 0 You can also submit your information using thi... Moolenaar 0 0
The resulting dataframe contains an index indicating which document a sentence belongs to and a binary column indicatig if the sentence mentioned our keywords.

Document Labeling

The classifier uses a template to generate belief statements about our targets and dimensions that we use for entailment classification. The default template is "The author of this text {{dimension}} {{target}}", but you can also supply your own. In our example, the Classifier() will generate the following statements about Biden:

The author of this text supports Biden
The author of this text opposes Biden
The author of this text is neutral towards Biden

Each sentence that contains the word Biden will be paired with these statements and a zero-shot entailmetn classifier will choose the best label.

# belief statements are automatically generated when the class is instantiated. 
mturk = en.Classifier(targets = keywords, dimensions = dimensions)
# if you pass a column name to the aggregate_on argument, the classifier will group the data on that column and produce aggregate counts. Otherwise a dataframe with document labels is returned.
df = mturk.label(df, aggregate_on = 'Last Name')

You can pass any model from the HuggingFace Hub to the classifier, but it is recommended you use a model trained for zero-shot entailment classification.

We now have a dataframe with how many times each person in our dataset expressed certain opinions about Biden or Trump:

df.head()
Last Name doc_num biden trump biden_neutral biden_opposes biden_supports trump_neutral trump_opposes trump_supports
0 Adams 5264 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 Allen 11997 1 0 0.0 0.0 1.0 0.0 0.0 0.0
2 Amodei 22262 1 0 0.0 1.0 0.0 0.0 0.0 0.0
3 Armstrong 12587 0 0 0.0 0.0 0.0 0.0 0.0 0.0
4 Auchincloss 6720 1 1 0.0 0.0 1.0 0.0 1.0 0.0

Scaling

Entss uses a Bayesian IRT model to estimate ideology based on how many of the total documents generated expressed a particular belief. When we instantiate the model we can specify the number of chains, how many parallel chains to run, and whether we want to run a multi-threaded model. Here we will just use the defaults.

You can pass a dataframe to the Scaler(), or if you have data already formatted for Stan you can pass that as a dictionary. If passing a dataframe you need to specify columns to calculate the initial values. These are columns that you expect people on the left end of the scale to have higher values for (left_init_cols) and columns you expect people on the right end of the scale to have higher values for (right_init_cols).

lizardy = en.Scaler()

fit, summary = lizardy.stan_fit(df, targets = targets, dimensions = ['supports', 'opposes'], 
                              left_init_cols = 'trump_opposes', right_init_cols = 'trump_supports', 
                              summary = True)

stan_fit() will output a stan fit model object you can use to evaluate the model and extract ideal point estimates. If summary = True it will also return a dataframe of parameter estimates, standard deviations, and R-hats.