Webis-STEREO-21 Corpus

This is the code repository containing all resources used to build the Webis-STEREO-21 corpus on scientific text reuse in open access publications.

It consists of general-purpose spark jobs for scalable text reuse detection in large document collections.

Organization

Each stage of the pipeline is defined as a separate file in the jobs directory. Alongside, each job has an associated submit script to handle resource alloaction on a spark cluster in the the scripts directory. Located in the tools directory, the source code for standalone alignment component (written in Go) a standalone converter for Grobid output is located. The analysis-example.ipynb notebook features an exemplary analysis of the Webis-STEREO-21 corpus as a starting point to facilitate data reuse.

Usage

Each job can be invoked to run either locally or on the cluster. See the makefile for available predefined targets.

Resource allocation is handled by the submit script associated with each job.

Jobs

1. Preprocess

make preprocess-cluster (cluster mode using the corresponding submit script) or make preprocess-local (local mode)

Reads the stereo document collection from S3 and converts them to a standardized (id, content) format.

Parameter	Description	Default
input_path	Path to read data from	`<YOUR INPUT GROBID DUMP HERE>`
output_path	Path to write data to	`stereo-grobid-preprocessed.parquet`

2. Filter (Optional)

make filter-cluster (cluster mode using the corresponding submit script) or make filter-local (local mode)

Reads the preprocessed stereo document collection and filters the subset specified by the supplied list of DOIs.

Parameter	Description	Default
input_path	Path to read data from	`stereo-grobid-preprocessed.parquet`
output_path	Path to write data to	`stereo-filtered.parquet`

3. Vectorize

make vectorize-cluster (cluster mode using the corresponding submit script) or make vectorize-local (local mode)

Splits each document into sequential fixed-size chunks and represents each chunk as binary term vector.

Parameter	Description	Default
input_path	Path to read data from	`stereo-filtered.parquet/*`
output_path	Path to write data to	`stereo-vectorized.parquet`
ngram_length	Length of chunks documents are split into	`50`
num_features	Dimension of word feature vector for each chunk	`2**18`

4. Hash

make hash-cluster (cluster mode using the corresponding submit script) or make hash-local (local mode)

Calculates a set of hashes for each feature vector to enable MinHash similarity detection.

Parameter	Description	Default
input_path	Path to read data from	`stereo-vectorized.parquet/*`
output_path	Path to write data to	`stereo-hashed.parquet`
num_hashes	Number of hashes for the MinHash calculation. Allows for 1/n Jaccard distance precision with n hashes	`5`

5. Reduce

make reduce-cluster (cluster mode using the corresponding submit script) or make reduce-local (local mode)

Encodes the hashset of each chunk as one-hot binary vector. Reduces chunks of one document into one vector using logical OR on the binary vectors.

Parameter	Description	Default
input_path	Path to read data from	`stereo-hashed.parquet/*`
output_path	Path to write data to	`stereo-reduced.parquet`
num_features	Number of dimensions for the binary document vector	`2**18`

6. Partition

make partition-cluster (cluster mode using the corresponding submit script) or make partition-local (local mode)

Builds an inverted list of hash->doi pairs and partitions it by hash. Allows for efficient batching of the pairing job.

Parameter	Description	Default
input_path	Path to read data from	`stereo-reduced.parquet/*`
output_path	Path to write data to	`stereo-partitioned.parquet`
num_partitions	Number of partitions to split the index into	`5000`

7. Pair

make pair-cluster (cluster mode using the corresponding submit script) or make pair-local (local mode)

Transforms document vectors into document pairs. Each pair denotes documents that share at least one hash across all chunks. Works by filtering the cartesian product of documents by all pairs that have at least one "1" in the logical AND of their binary document vectors.

Operates in batches. Recommended is using 100 batches at least ("00" to "99").

Parameter	Description	Default
input_path	Path to read data from	`stereo-partitioned.parquet/*`
output_path	Path to write data to	`stereo--paired.parquet`
batch	Unix file wildcard to identify the `part-*` files used in this batch	`"00"`

8. Join

make join-cluster (cluster mode using the corresponding submit script) or make join-local (local mode)

Joins the pair dataframe with the corresponding texts in each row.

Parameter	Description	Default
input_path	Path to read data from	`stereo-paired.parquet/*`
output_path	Path to write data to	`stereo-joined.parquet`
batch	Batch to join on (recommended 100 batches)	`00`

9. Align

make align-cluster (cluster mode using the corresponding submit script) or make align-local (local mode)

Produces the exact alignment of all given document pairs. Operates in batches similar to the join job.

Parameter	Description	Default
pair_path	Path to read data from	`stereo-paired.parquet/*`
text_path	Path to read data from	`stereo-filtered.parquet/*`
output_path	Path to write data to	`stereo-aligned.parquet`
batch	Batch prefix from the join job	`"00"`
NGRAM_LENGTH		8
NGRAM_OVERLAP		7
THETA		250

10. Metadata

make metadata-cluster (cluster mode using the corresponding submit script) or make metadata-local (local mode)

Extracts metadata from the Microsoft Open Academic Graph Dataset and maps them to the DFG classification of scientific disciplinces.

Parameter	Description	Default
input_path	Path to read data from	`file:/mnt/ceph/storage/corpora/corpora-thirdparty/corpus-microsoft-open-academic-graph-v1/*.txt`
output_path	Path to write data to	`stereo-oag.parquet`

11. Unify

make unify-cluster (cluster mode using the corresponding submit script) or make unify-local (local mode)

Joins metadata and reuse cases.

Parameter	Description	Default
case_path	Path to read case data from	`stereo-core-aligned.parquet//`
metadata_path	Path to read metadata from	`stereo-metadata.parquet/*`
output_path	Path to write data to	`stereo-corpus.jsonl`

12. Finalize

make finalize-cluster (cluster mode using the corresponding submit script) or make finalize-local (local mode)

Transforms each data record into its final form, filters publication metadata, and assigns unique IDs to each case.

Parameter	Description	Default
case_path	Path to read case data from	`stereo-core-aligned.parquet//`
text_path	Path to read publication text data from	`stereo-core-aligned.parquet//`
metadata_path	Path to read metadata from	`stereo-metadata.parquet/*`
output_cases_full	Path to write case data to	`webis-stereo21/cases-full`
output_cases_metadata_only	Path to write metadata-only case data to	`webis-stereo21/cases-metadata-only`
output_publications_full	Path to write publication data to	`webis-stereo21/publications-full`
output_publications_metadata_only	Path to metadata-only publication data to	`webis-stereo21/publications-metadata-only`

webis-de/scidata22-stereo-scientific-text-reuse

Webis-STEREO-21 Corpus

Organization

Usage

Jobs