
Sample subgraphs from RDF Graphs stored as HDT Documents.

Primary LanguagePythonMIT LicenseMIT

HDT Sampler

Sample subgraphs from RDF Graphs stored as HDT Documents.



  • Install Python
  • Follow instruction of pyHDT and RDFLib to install them
  • Installation in a virtualenv is advised



python hdt_sampler.py -f myHDTFile.hdt -s 0.1 -m unweigthed

CLI Arguments:

  -h, --help            show this help message and exit
  -f FILE, --file FILE  HDT File to be sampled from (required)
  -s SIZE, --size SIZE  Percentage of subjects to be sampled, range: [0,1]
  -n NUMBER, --number NUMBER
                        Number of samples to be created (default=1)
  -m {unweighted,weighted,hybrid}, --method {unweighted,weighted,hybrid}
                        Sampling method to be used (required: unweighted,
                        weigthed, hybrid)
  -r RATIO, --ratio RATIO
                        Ratio for hybrid sampling, range: [0,1] (default=0.5)
                        Set logging level (optional)


In the scripts directory we provide additional scripts:

  • compute_CSPF_proto.py: Prototypical implementation to compute the CSPF for an n-triples file. The script takes the filepath of an N-Triples file as a single argument. It shuffles the triples, sorts them, and computes the CSPF. It prints the stats of computing the CSPF

Related Publication

Heling, Lars, Acosta, Maribel. 
"Estimating Characteristic Sets for RDF Dataset Profiles based on Sampling." 
European Semantic Web Conference 2020.