/icgc

International Cancer Genome Consortium

Primary LanguageShellMIT LicenseMIT

ICGC RDF

The aim of this repository is to generate a RDF dataset from the table-format data publicly available at the ICGC Data Portal, for better reusability and interoperability in the Semantic Web.

How to generate RDF data

Get scripts

  • Ubuntu 18.04
  • 4GB or more memory capacity is recommended (In AWS, t2.medium has 4GB memory)
  • For locating the all RDF datasets along with the original datasets, 250GB disk space is required

Get scripts

Pull this repository

$ git clone https://github.com/med2rdf/icgc.git

Install Docker

Add the OS user to docker group and login again to the console.

$ sudo groupadd docker
$ sudo usermod -aG docker $USER
$ exit

Install docker via apt.

$ sudo apt-get update
$ sudo apt install docker.io
$ sudo systemctl start docker
$ docker --version
Docker version 20.10.7, build 20.10.7-0ubuntu1~18.04.1

Install Oracle Database

Get Dockerfile to build Docker image of Oracle Database.

$ mkdir oracle
$ cd oracle
$ git clone https://github.com/oracle/docker-images.git

Build docker image (needs 4GB memory).

$ cd ~/oracle/docker-images/OracleDatabase/SingleInstance/dockerfiles/18.4.0/
$ docker build -t oracle/database:18.4.0-xe -f Dockerfile.xe .

Launch Oracle Database on a docker container.

$ docker run --name oracle \
  -p 1522:1521 -e ORACLE_PWD=Welcome1 \
  -v $HOME:/host-home \
  oracle/database:18.4.0-xe

Once you got the message below, you can quit with Ctl+C.

#########################
DATABASE IS READY TO USE!
#########################

Configure Oracle Database

Start the conteiner.

$ docker start oracle

Configure the database as a triplestore.

$ docker exec -it oracle \
  sqlplus sys/Welcome1@XEPDB1 as sysdba @/host-home/icgc/scripts/setup.sql

Create a user.

$ docker exec -it oracle \
  sqlplus sys/Welcome1@XEPDB1 as sysdba @/host-home/icgc/scripts/00_user.sql

Download Data

For downloading the latest project list, access Data Portal, click Available Data Type > SSM, and click "Export Table as TSV" icon.

Create project list: project.tsv

$ cd scripts/download/
$ bash 01_projects.sh projects_2021_08_23_03_37_00.tsv > projects.tsv

Download test datasets.

$ bash 02_download_all.sh projects_test.tsv

Download all files. Use projects_test.tsv instead for testing.

$ bash 02_download_all.sh projects.tsv

Generate RDF Data

Use projects_test.tsv for test datasets.

$ docker exec -it oracle \
  sh /host-home/icgc/scripts/00_run.sh download/projects_test.tsv \
  > ~/icgc/log/main.log

For all datasets.

$ docker exec -it oracle \
  sh /host-home/icgc/scripts/00_run.sh download/projects.tsv \
  > ~/icgc/log/main.log

How to test RDF data

Run Virtuoso.

$ docker run \
  --name virtuoso_docker_test \
  --env DBA_PASSWORD=dba \
  --publish 1111:1111 \
  --publish 8890:8890 \
  --volume `pwd`:/database \
  --env VIRT_Parameters_NumberOfBuffers=680000 \
  --env VIRT_Parameters_MaxDirtyBuffers=500000 \
  --env VIRT_Parameters_DirsAllowed="., ../vad, /usr/share/proj, /database" \
  openlink/virtuoso-opensource-7:latest

Run a test.

$ cd scripts/
$ bash test.sh download/projects.tsv

Ontologies

Ontologies referenced

Guideline