DaTikZ Dataset

DaTikZ is a dataset containing a wide variety of TikZ drawings. It is intended to support research and development of machine learning models that can generate or manipulate vector graphics in L^AT_EX.

There are two main distributions publicly available: DaTikZ_v1 (introduced in AutomaTikZ) and DaTikZ_v2 (introduced in DeTikZify). In compliance with licensing agreements, certain TikZ drawings are excluded from these public versions of the dataset. This repository provides tools and methods to recreate the complete dataset from scratch.

Note

The datasets you produce might vary slightly from the originally created ones, as the sources used for crawling are subject to continuous updates.

Installation

DaTikZ relies on a full TeX Live installation and also requires ghostscript and poppler. Python dependencies can be installed as follows:

pip install -r requirements.txt

For processing arXiv source files (optional), you additionally need to preprocess arXiv bulk data using arxiv-latex-extract.

Usage

To generate the dataset, run the main.py script. Use the --help flag to view the available options. The commands for the official distributions are as follows:

DaTikZ_v1: main.py --arxiv_files "${DATIKZ_ARXIV_FILES[@]}" --size 334 --captionize
DaTikZ_v2: main.py --arxiv_files "${DATIKZ_ARXIV_FILES[@]}" --size 386

In this example, the DATIKZ_ARXIV_FILES environment variable should contain the paths to either the jsonl files obtained with the arxiv-latex-extract utility, or archives that include these files.

When executed successfully, the script generates the following output files:

datikz-raw.parquet: The raw, unsplit dataset without additional augmentation.
datikz-train.parquet: The training split of the DaTikZ dataset.
datikz-test.parquet: The test split consisting of 1k items.

potamides/DaTikZ

DaTikZ Dataset

Installation

Usage