/biosamples-v4

The source code for the new version of the EMBL-EBI BioSamples database

Primary LanguageJavaApache License 2.0Apache-2.0

Codacy code quality Docker Repository on Quay

BioSamples

BioSamples https://www.ebi.ac.uk/biosamples/ stores and supplies descriptions and metadata about biological samples used in research and development by academia and industry. Samples are either 'reference' samples (e.g. from 1000 Genomes, HipSci, FAANG) or have been used in an assay database such as the European Nucleotide Archive (ENA) or ArrayExpress.

BioSamples also synchronizes data with the NCBI BioSample database and imports data from ENA.

This document provides information about the local installation, development environment setup instructions of BioSamples database.

Table of contents

Softwares

  • Git 2.17.1

  • Java 8

  • JDK 8

  • Maven 3.6.0

  • Docker 18.6

Setup

  1. Run this in terminal to install the dependent softwares.

    sudo apt-get update
    sudo apt-get install openjdk-8-jdk maven git docker
  2. Please make sure the software versions are correct.

    mvn -v
    # Output:
    # Apache Maven 3.6.0
    
    docker -v
    # Output
    # Docker version 18.06.1-ce
    
    java -version
    # Output
    # openjdk version "1.8.0_222"
  3. Install BioSamples on your computer.

    This process sets up a local compiled version of all biosamples tools. It requires a large download of Spring dependencies and uses up to two threads per core of your machine. The installation might take several minutes.

    git clone https://github.com/EBIBioSamples/biosamples-v4.git
    cd biosamples-v4
    mvn -T 2C package
  4. Start Biosamples on your own machine

    docker-compose up

    If it returns ERROR: Couldn't connect to Docker daemon - you might need to run docker-machine start default. Please run sudo docker-compose up instead.

  5. Now you can access the public interface at http://localhost:8081/biosamples/. So far, there is no data in the local sample.

  6. Creat AAP account for API authentication and data upload

    An AAP account is required to upload data through API. The API account can be registered at https://explore.aai.ebi.ac.uk/registerUser. A detailed instruction about user account and authentication can be found on https://www.ebi.ac.uk/biosamples/docs/guides/authentication.

    Please replace ALL 'https://aai.ebi.ac.uk' in the authentication guide with 'https://explore.aai.ebi.ac.uk' to use the local BioSamples API.

  7. Upload first test data

    TOKEN=$(curl -u Username https://explore.api.aai.ebi.ac.uk/auth)
    
    curl 'http://localhost:8081/biosamples/samples' -i -X POST -H "Content-Type: application/json;charset=UTF-8" -H "Accept: application/hal+json" -H "Authorization: Bearer $TOKEN" -d '{
     "name" : "FakeSample",
     "update" : "2019-07-16T09:47:20.003Z",
     "release" : "2019-07-16T09:47:20.003Z",
     "domain" : "self.ExampleDomain"
    }'

Data import

NCBI

Download the XML dump (~400Mb) to the current directory:

Run the pipeline to send the data to the submission API via REST

docker-compose up biosamples-pipelines-ncbi

Note: You will need to mount the location that the XML dump was downloaded to within the docker container. A docker-compose.override.yml file is the easiest way to do that.

ENA

You can run the pipelines-ena to import ENA samples. In order to do that you will need to add some security settings to maven to get access to oracle private driver repository.

MongoDB notes

Cross-platform easy to use mongodb management tool http://www.mongoclient.com

Developing

Docker can be run from within a virtual machine e.g VirtualBox. This is useful if it causes any problems for your machine or if you have an OS that is not supported.

You might want to mount the virtual machines directory with the host, so you can work in a standard IDE outside of the VM. VirtualBox supports this.

If you ware using a virtual machine, you might also want to configure docker-compose to start by default.

As you make changes to the code, you can recompile it via Maven with:

mvn -T 2C package

And to get the new packages into the docker containers you will need to rebuild containers with:

docker-compose build

If needed, you can rebuild just a single container by specifying its name e.g.

docker-compose build biosamples-pipelines

To start a service, using docker compose will also start and dependent services it requires e.g.

docker-compose up biosamples-webapp-api

will also start solr, neo4j, mongo, and rabbitmq

To run an executable file in a docker container, and start its dependencies first use something like:

docker-compose run --service-ports biosamples-pipelines

If you want to add command line arguments note that these will entirely replace the executable in the docker-compose.yml file. So you need to do something like:

docker-compose run --service-ports biosamples-pipelines java -jar pipelines-4.0.0-SNAPSHOT.jar --debug

If you want to connect debugging tools to the java applications running inside docker containers, see instructions at http://www.jamasoftware.com/blog/monitoring-java-applications/

Note that you can bring maven and docker together into a single commandline like:

mvn -T 2C package && docker-compose build && docker-compose run --service-ports biosamples-pipelines

Beware, Docker tar’s and copies all the files on the filesystem from the location of docker-compose down. If you have data files there (e.g. downloads from ncbi, docker volumes, logs) then that process can take so long that it makes using Docker impractical.

As docker-compose creates new volumes each time, you may fill the disk docker is working on. To delete all docker volumes use:

docker volume ls -q | xargs -r docker volume rm

To delete all docker images use:

docker images -q | xargs -r docker rmi

Note
this will remove everything not just things for this project

Client useage

There is a spring client, and a spring-boot starter module, for use with BioSamples. To use these in a maven project, add the following to the appropriate sections:

<dependencies>
    <dependency>
        <groupId>uk.ac.ebi.biosamples</groupId>
        <artifactId>biosamples-spring-boot-starter</artifactId>
        <version>4.0.4</version>
    </dependency>
</dependencies>
<repositories>
    <repository>
      <id>spotnexus</id>
      <url>https://www.ebi.ac.uk/spot/nexus/repository/maven-releases/</url>
    </repository>
</repositories>

This can then be configured by several spring application.properties including biosamples.client.uri to specify the base URI of the BioSamples instance to use.

Issues and troubleshooting

Problems with spring-data-rest

This was originally using spring-data-rest to expose rest API for the repositories. But there are a number of problems with this (see below) and that was scrapped in favor of implementing custom HATEOAS compliant endpoints.

Content type negotiation is not possible as it can’t overlap with the URLs for the Thymeleaf controllers and it can’t serve XML even with the appropriate converters supplied.

When repeatedly sending JSON because it is a list of things with optional components, the optional parts can become mixed if the list ordering changes. Maybe this can be remedied by using map of attribute types instead?

Known issues

Solr has a limit on the field size (technically the term vector). Therefore the attribute values over 255 characters are not indexed in solr.

License