/gzipr

What the Package Does (One Line, Title Case)

Primary LanguageRCreative Commons Attribution 4.0 InternationalCC-BY-4.0

gzipr

Lifecycle: experimental Codecov test coverage R-CMD-check test-coverage lint

The goal of gzipr is to implement a generalized R version for the “Less is More: Parameter-Free Text Classification with Gzip” by Zhiying Jiang, Matthew Y.R. Yang, Mikhail Tsirlin, Raphael Tang, Jimmy Lin (https://arxiv.org/abs/2212.09410) that works for tabular data, text data, and general lists of arbitrary (structured similar) observations.

Citing the original paper abstract:

Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that’s easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.

Features

  • {gzipr} works with data frames, lists, and text data.

Installation

You can install the development version of {gzipr} like so:

# install.packages("devtools")
devtools::install_github("CorradoLanera/gzipr")

Example

This is a basic example which shows you how to solve a common problem:

library(gzipr)

## tabular data
train <- mtcars[1:20, ]
test <- mtcars[21:32, ]

train_x <- train[, -2]
train_y <- train[[2]]

test_x <- test[, -2]
test_y <- test[[2]]

model <- gzipr(train_x, train_y)
result <- predict(model, test_x)

# accuracy
mean(result == test_y)
#> [1] 0.75


## Text data
train_text_x <-  c(
  "I love to play football",
  "Basketball is my favorite sport",
  "Football is amazing",
  "Basketball rules"
)

train_text_y <- c(
  "A",
  "B",
  "A",
  "B"
)

test_text_x <- c("I enjoy football", "Basketball is the best")

model_text <- gzipr(train_text_x, train_text_y)
res_text <- predict(model_text, test_text_x)
res_text
#>       I enjoy football Basketball is the best 
#>                    "A"                    "B"

Code of Conduct

Please note that the gzipr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.