/DEXTER

Dataset collected by DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web

GNU General Public License v2.0GPL-2.0

DEXTER

DEXTER is a research project designed to discover and extraction product specifications on the Web.

This repository provides information to access the DEXTER Dataset described in VLDB2015 Research paper:

DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web link

In this repository you can find the output dataset generated by DEXTER, the dataset if organized as follows:

  • Output XML: dump with all the attribute/value pairs collected
  • URLs of product pages: list of product URLs build by DEXTER
  • Pages dump: a dump of the processed pages

Specification Output XML:

We provide under output-xml a dump of the specifications of the discovered products. Each file is a compressed (.7z) file that contains an XML dump with all the discovered products for a specific category.

The XML dump follows this structure:

<products>
	<product>
		<site>www.amazon.com</site>
		<category>camera</category>
		<url>http://www.amazon.com/...</url>
		<attribute_1>value_attribute_1</attribute_1>
		...
	</product>
	<product>
		...
	</product>
	...
</products>

To each product we have added three additional attributes: URL from which the specification has been extracted, the category associated to the page and the website.

Dataset

The dataset presents HTML pages collected by our focused crawler. The dataset is organised under the bucket dexter-pages in the following folders:

  1. data
  2. dexter_sources
  3. dataset_local_categories.json

Data

Under /data/*

The folder is organised in subfolder, a subfolder for each crawled website. Pages of a given website are stored as .gz files. Pages are stored with an incremental file number <i>.txt.gz and the mapping between dumped file and original url is under an index.txt file.

The index.txt file stores in each line a tab separated pair. Pairs are organised in <i>.txt and <file_url>.

An example is:

1.txt 	http://www.sample_website.com/productAAAA
2.txt 	http://www.sample_website.com/productBBBB
3.txt 	http://www.sample_website.com/productCCCC

Example of index file Link

Dexter Sources

Under /dexter_sources/*

We provide also the output of the Dexter classification. Page urls are grouped in sources (pair <category,website>), the folder contains a single json file for each DEXTER classified source.

Files are named with the following pattern: <category>_<site>.json

File contains for each website a map with the following information:

  1. "<website_name>": list of pages urls
  2. "entry_page": list of category entry page
  3. "pages_number": number of pages

Example of Dexter category file Link

Dataset Local Categories Link

In dataset_local_categories.json

We present the locale categories crawled directly from the discovered websites. The file is a nested json that is organised as follows:

{
	"site1": {
		"category_1": [
			url1,
			url2,
			...
		]
		...
	}, 
	"site2": {
	...
	}