Java library for parsing various datasets:
- DBLP dataset
- Reuters-21578 dataset
- Text file dataset
- ENRON email dataset
- Wikipedia web pages
- Synthetic gaussian mixture
- Scale-invariant feature transform (SIFT) dataset
These parsers are implemented using an iterator, which make them suitable for processing large datasets. You may also use it to process subparts of the dataset, as you can process items "on-the-fly", without reading the entire data.
Using maven:
<dependency>
<groupId>info.debatty</groupId>
<artifactId>java-datasets</artifactId>
<version>RELEASE</version>
</dependency>
Or check the GitHub releases.
Usually, you simply have to:
- Initialize the Dataset object, using the path to the file or directory containing the data
- Iterate over the dataset items as long as you want...
import info.debatty.java.datasets.reuters.*;
public class MyClass {
public static void main(String[] args) {
// We will use reuters news dataset
Dataset reuters_dataset = new Dataset("/path/to/reuters/folder");
// Iterate over news
for (News news : reuters_dataset) {
System.out.println(news.title);
}
}
}
One of the datasets allow to easily produce random data according to a gaussian mixture:
package info.debatty.java.datasets.examples;
import info.debatty.java.datasets.gaussian.Dataset;
import java.awt.RenderingHints;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
public class GaussianMixtureExample {
private static final int DIMENSIONALITY = 2;
private static final int CENTERS = 10;
private static final int SIZE = 10000;
public static void main(String[] args) throws IOException, ClassNotFoundException {
Dataset dataset = new Dataset.Builder(DIMENSIONALITY, CENTERS)
.setOverlap(Dataset.Builder.Overlap.MEDIUM)
.varyDeviation(true)
.varyWeight(true)
.setSize(SIZE).build();
// You can serialize and save your Dataset.
// This will not save all the points, but only the Dataset oject
// (including eventual random seeds),
// which allows to reproduce the dataset using only a small amount of
// memory
File file = File.createTempFile("testfile", ".ser");
dataset.save(new FileOutputStream(file));
// Read the dataset from memory
Dataset d2 = (Dataset) Dataset.load(new FileInputStream(file));
// You can also save to complete data to disk if needed
// (e.g. for plotting with Gnuplot)
d2.saveCsv(new BufferedOutputStream(
new FileOutputStream(File.createTempFile("gaussian", ".dat"))));
// Get all the data at once (can be very large!!)
double[] data = d2.getAll();
}
}
For the other datasets, check the examples, or the documentation.