/APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples

Primary LanguagePythonMIT LicenseMIT

APAeval

All Contributors

Welcome to the APAeval GitHub repository.

APAeval is a community effort to evaluate computational methods for the detection and quantification of poly(A) sites and the estimation of their differential usage across RNA-seq samples.

logo

What is APAeval?

APAeval consists of three benchmarking events, each consisting of a set of challenges for bioinformatics methods (=participants) that:

  1. Identify polyadenylation sites
  2. Quantify polyadenylation sites
  3. Calculate differential usage of polyadenylation sites

For more info, please refer to our landing page.

How to get involved?

If you would like to contribute to APAeval, the first things you would need to do are:

  • Drop us an email at apaeval@irnacosi.org
  • Please use this form to provide us with your user names/handles of the various service accounts we are using (see below) so that we can add you to the corresponding repositories/organizations
  • Wait for us to reach out to you with an invitation to our Slack space, then use the invitation to sign up and post a short intro message in the #general channel

Overview

schema

  1. APAeval consists of three benchmarking events to evaluate the performance of different tasks that the methods of interest (=participants) might be able to perform: PAS identification, quantification, and assessment of their differential usage. A method can participate in one, two or all three events, depending on its functions.
  2. Raw data: For challenges within the benchmarking events, APAeval is using data from several different selected publications. Generally, one dataset (consisting of one or more samples) corresponds to one challenge (here, datasets for challenges x and y are depicted). All raw RNA-seq data is processed with nf-core/rna-seq for quality control and mapping. For each dataset we provide a matching ground truth file, created from 3’ end seq data from the same publications as the raw RNA-seq data, that will be used in the challenges to assess the performance of participants.
  3. Sanctioned input files: The processed input data is made available in .bam format. Additionally, for each dataset a gencode annotation in .gtf format, as well as a reference PAS atlas in .bed format for participants that depend on pre-defined PAS (not shown), are provided.
  4. In order to evaluate each participant in different challenges, a re-usable β€œexecution workflow” has to be written in either snakemake or nextflow. Within this workflow, all necessary pre- and post-processing steps that are needed to get from the input formats provided by APAeval (see 3.), to the output specified by APAeval in their metrics specifications (see 5.) have to be performed.
  5. To ensure compatibility with the OEB benchmarking events, specifications for file formats (output of execution workflows = input for summary workflows) are provided by APAeval.
  6. Within a benchmarking event, one or more challenges will be performed. A challenge is primarily defined by the input dataset used for performance assessment. A challenge is computed within a summary workflow, which is run on the OEB infrastructure, for each participant. The summary workflow will compute all metrics relevant for the challenge.
  7. In order to compare the performance of participants, OEB will collect the respective output files from all eligible participant summary workflows and will visualize all results per challenge, such that performance of participants can be compared for each metric.

What is there to do?

The bulk of the work falls into roughly two tasks, writing participants' execution workflows and benchmarking events' summary workflows.

Execution workflows

Execution workflows contain all steps that need to be run per method:

  1. Pre-processing: Convert the input files the APAeval team has prepared into the input files your participant consumes, if applicable.
  2. Method execution: Execute the method in any way necessary to compute the output files for all benchmarking events the participant qualifies for.
  3. Post-processing: Convert the output files of the method into the formats consumed by the summary workflows as specified by the APAeval team, if applicable.

Execution workflows should be implemented in either Nexflow or Snakemake, and individual steps should be isolated through the use of either Conda virtual environments (deprecated; to run on AWS we need containerized workflows) or Docker/Singularity containers.

Summary workflows

Summary workflows contain all steps that need to be run per challenge, using outputs of the invididual participant execution workflows as inputs. They follow the OpenEBench workflow model, described here, implemented in Nextflow. OpenEBench workflows consist of the following 3 steps:

  1. Validation: Output data of the various execution workflows is validated against the provided specifications to data consistency.
  2. Metrics computation: Actual benchmarking metrics are computed as specified, e.g., by comparisons to ground truth/gold standard data sets.
  3. Consolidation: Data is consolidated for consistency with other benchmarking efforts, based on the OpenEBench/ELIXIR benchmarking data model.

Following the OpenEBench workflows model also ensures that result visualizations are automatically generated, as long as the desired graph types are supported by OpenEBench.

Miscellaneous

Apart from writing execution and summary workflows, there are various other smaller jobs that you could work on, including, e.g.:

  • Pull request reviews
  • Pre-processing RNA-Seq input data via the nf-core RNA-Seq analysis pipeline
  • Writing additional benchmark specifications
  • Housekeeping jobs ( improve documentation, helping to keep the repository clean, enforce good coding practices, etc.)
  • Work with our partner OpenEBench on their open issues, e.g., by extending their portfolio of supported visualization

If you do not know where to start, simply ask us!

How do we work?

To account for everyone's different agendas and time zones, we are organized such that contributors can work, as much as possible, in their own time.

Open Science, licenses & attribution

Following best practices for writing software and sharing data and code is important to us, and therefore we want to apply, as much as possible, FAIR Principles to data and software alike. This includes publishing all code open source, under permissive licenses approved by the Open Source Initiative and all data by a permissive Creative Commons license.

In particular, we publish all code under the MIT license and all data under the CC0 license. An exception are all summary workflows, which are published under the GPLv3 license, as the provided template is derived from an OpenEBench example workflow that is itself licensed under GPLv3. A copy of the MIT license is also shipped with this repository.

We also believe that attribution, provenance and transparency are crucial for an open and fair work environment in the sciences, especially in a community effort like APAeval. Therefore, we would like to make clear from the beginning that in all publications deriving from APAeval (journal manuscript, data and code repositories), any non-trivial contributions will be acknowledged by authorship. All authors will be strictly listed alphabetically, by last name, with no exceptions, wherever possible under the name of The APAeval Team and accompanied by a more detailed description of how people contributed.

We expect that all contributors accept the license and attribution policies outlined above.

Communication

Chat

We are making use of Slack (see above to see how you can become a member) for asynchronous communication. Please use the most appropriate channels for discussions/questions etc.:

  • #general: Introduce yourself and find pointers to get you started. All APAeval-wide announcements will be put here!
  • #admin: Ask questions about the general organization of APAeval.
  • #tech-support: Ask questions about the technical infrastructure and relevant software, e.g., AWS, GitHub, Nextflow, Snakemake.
  • #execution_workflows: Discussions channel for all execution workflows.
  • #oeb: Discussions channel for all OEB related matters and summary workflows.
  • #random: Post anything that doesn't fit into any of the other channels.
  • #github-ticker: Get notified about activities in the APAeval github repo.

Video calls

Despite the event taking place mostly asynchrounously, we do have a few video calls to increase the feeling of collaboration. In particular, we have a bi-weekly meeting on Wednesday 9am EDT/3pm CET.

This calendar contains all video call events, including the necessary login info, and we would like to kindly ask you to subscribe to it:

  • Calendar ID: 59bboug9agv30v32r6bvaofdo4@group.calendar.google.com
  • Public address

Please do not download the ICS file and then import it, as any updates to the calendar will not be synced. Instead, copy the calendar ID or public address and paste it in the appropriate field of your calendar application. Refer to your calendar application's help pages if you do not know how to subscribe to a calendar.

Video calls usually take place in the following Zoom room:

  • Direct link
  • Meeting ID: 656 9429 1427
  • Passcode: APAeval

There is also a meeting agenda.

For more lively meetings, participants are encouraged to switch on their cameras. But please mute your microphones if you are not currently speaking.

Social coding

We are making extensive use of GitHub's project management resources to allow contributors to work independently on individual, largely self-contained, issues. There are several Kanban project boards, listing relevant issues for different kinds of tasks, such as drafting benchmarking specifications and implementing/running execution workflows.

The idea is that people assign themselves to open issues (i.e., those issues that are not yet assigned to someone else). Note that in order to be able to do so, you will need to be a member of this GitHub repository (see above to see how you can become a member). Once you have assigned yourself, you can move/drag the issue from the To do to the In progress column of the Kanban board.

When working on an issue, please start by cloning (preferred) or forking the repository. Then create a feature branch and implement your code/changes. Once you have made some progress, please create a pull request against the main branch, making sure to fill in the provided template (in particular, please refer to the original issue you are addressing with this pull request) and to assign two reviewers. If you're not quite happy with your solution yet and would like to have some help, you can mark the pull request as a draft, and lead discussions with other members directly on your code.
Pull request reviews are also always a welcome contribution to APAeval. For some guidelines on PR reviews you can refer to Sam's PR review guide.
This workflow ensures collaborative coding and is sometimes referred to as GitHub flow. If you are not familiar with Git, GitHub or the GitHub flow, there are many useful tutorials online, e.g., those listed below.

Cloud infrastructure

AWS kindly sponsored credits for their compute and storage infrastructure that we can use to run any heavy duty computations in the cloud (e.g., RNA-Seq data pre-processing by or method execution workflows).

This also includes credits to run Seqera Lab's Nextflow Tower, a convenient web-based platform to run Nextflow workflows, such as the nf-core RNA-Seq analysis workflow we are using for pre-processing RNA-Seq data. Seqera Labs has kindly held a workshop on Nextflow and Nextflow Tower during the hackathon, and still continues to provide technical support.

Setting up the AWS organization and infrastructure is still ongoing, and we will update this section with more information as soon as that is done.

OpenEBench

We are partnering with OpenEBench, a benchmarking and technical monitoring platform for bioinformatics tools. OpenEBench development, maintenance and operation is coordinated by Barcelona Supercomputing Center (BSC) together with partners from the European Life Science infrastructure initiative ELIXIR.

OpenEBench tooling will facilitate the computation and visualization of benchmarking results and store the results of all benchmarking events and challenges in their databases, making it easy for others to explore results. This should also make it easy to add additional participants to existing benchmarking events later on. OpenEBench developers are also advising us on creating benchmarks that are compatible with good practices in the wider community of bioinformatics challenges.

Software

Here are some pointers and tutorials for the main software tools that we are using at APAeval:

Note that you don't need to know about all of these, e.g., one of Conda (deprecated; to run on AWS we need containerized workflows), Docker and/or Singularity will typically be enough. See below, for a discussion of the supported workflow languages/management systems. Again, working with one will be enough for most issues.

In addition to these, basic programming/scripting skills may be required for most, but not for all issues. For those that do, you are generally free to choose your preferred language, although for those people who have experience with Python, we recommend you to go with that. It just makes it easier for others to review your code, and it typically integrates better with our templates and the general bioinformatics ecosystem/community.

Note that even if you don't have experience with any of these tools/languages, and you don't feel like or have no time learning them, there is surely still something that you can help us with. Just ask us and we will try to put your individual skills to good use! πŸ’ͺ

Nextflow or Snakemake?

As mentioned further above, we would like execution workflows to be written in one of two "workflow languages": Nextflow or Snakemake. Specifying workflows in such a language rather than, say, by stringing together Bash commands, is considered good practice, as it increases reusability and reproducibility, and so is in line with our goal of adhering to FAIR software principles.

But why only Nextflow and Snakemake, and not, e.g., the Common Workflow Language, the Workflow Definition Language or Galaxy? There are no particular reasons other than that APAeval organizers have experience with these workflows languages and are thus able to provide technical support. If you are an experienced workflow developer and prefer another workflow language, you are welcome to use that one instead, but note that we have no templates available and will not be able to help you much in case you encounter issues during development or execution.

As for summary workflows, we are bound to implement these in Nextflow, as they are executed on OpenEBench, which currently only accepts Nextflow workflows.

For this reason, as well as the fact that we will provide Nextflow Tower for convient execution of Nextflow workflows on AWS cloud infrastructure (see above) and use a Nextflow analysis pipeline for pre-processing RNA-Seq data sets, we recommend novices without any other considerations (e.g., colleagues already working with Snakemake) to use Nextflow.

Conda environment file

In order to execute scripts with either Nextflow or Snakemake in a reproducible manner, we need to ensure the versions of these software are specified. In order to do that, we created a Conda environment file that contains specific versions of Nextflow, Snakemake and some core libraries. To use this environment, you first need to create it by using:

conda env create -f apaeval_env.yaml`

You then need to activate the environment with:

conda activate apaeval_execution_workflows

NOTE: If you're working on Windows or Mac, you might have to google about setting up a virtual machine for running Singularity. Alternatively, you could remove Singularity installation from the apaeval_env.yaml and work with conda environments only (deprecated, as we need containers for Cloud execution).

You can now execute the workflows!

Code of Conduct

Please be kind to one another and mind the Contributor Covenant's Code of Conduct for all interactions with the community. A copy of the Code of Conduct is also shipped with this repository. Please report any violations to the Code of Conduct to either or both of CJ and Alex via Slack.

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Chelsea Herdman

πŸ“† πŸ“‹ πŸ€” πŸ‘€ πŸ“’ πŸ“–

CJ Herrmann

πŸ’» πŸ”£ πŸ“– 🎨 πŸ“‹ πŸ§‘β€πŸ« πŸ“† πŸ’¬ πŸ‘€ πŸ“’ πŸ€”

Euan McDonnell

πŸ’» πŸ€” πŸ§‘β€πŸ«

Alex Kanitz

πŸ› πŸ’» πŸ“– πŸ’‘ πŸ“‹ πŸ€” πŸš‡ 🚧 πŸ§‘β€πŸ« πŸ“† πŸ’¬ πŸ‘€ πŸ“’

Yuk Kei Wan

πŸ› πŸ“ πŸ’» πŸ”£ πŸ“– πŸ’‘ πŸ“‹ πŸ€” πŸ§‘β€πŸ« πŸ“† πŸ’¬ ⚠️ βœ…

Ben

πŸ”£ πŸ€” πŸ“†

pjewell-biociphers

🚧

mzavolan

πŸ”£ πŸ“– πŸ“‹ πŸ’΅ πŸ€” πŸ§‘β€πŸ« πŸ“† πŸ’¬ πŸ‘€ πŸ“’

Mervin Fansler

πŸ› πŸ’» πŸ“– πŸ“‹ πŸ€” πŸ§‘β€πŸ« πŸ“† πŸ’¬ πŸ‘€

Maria Katsantoni

πŸ’» πŸ€” πŸ§‘β€πŸ« πŸ’¬

daneckaw

πŸ’» πŸ”£ πŸ“‹ πŸ€” πŸ“† βœ…

Dominik Burri

πŸ› πŸ’» πŸ”£ πŸ“– πŸ’‘ πŸ“‹ πŸ€” πŸš‡ πŸ§‘β€πŸ« πŸ“† πŸ’¬ ⚠️ βœ…

mrgazzara

πŸ’» πŸ“– πŸ”£ πŸ“‹ πŸ€” πŸš‡ 🚧 πŸ“† πŸ§‘β€πŸ« πŸ“’

Christina Fitzsimmons

πŸ“– πŸ“‹ πŸ€” πŸ“† πŸ“’

Leo SchΓ€rfen

πŸ’» πŸ€” πŸ“’

poonchilam

πŸ’» πŸ€” πŸ’¬

dseyres

πŸ’» πŸ“– πŸ€”

Pierre-Luc

πŸ”£ πŸ“– πŸ“‹ πŸ€” πŸ“†

SamBryce-Smith

πŸ’» πŸ€”

Pin-Jou Wu

πŸ’» πŸ€”

yoseopyoon

πŸ’» πŸ€”

This project follows the all-contributors specification. Contributions of any kind welcome!