This is a work in progress - mostly for personal use playing with using chatGPT for python package development. Feel free to explore and try out the code but don't expect it to work out of the box. If you are interested in toy datasets for use in bioinformatic pipeline development and would like to contribute to this project feel free to contact me at nolson@nist.gov.
The Bioinformatics Test Data Generator is a Python package that provides utilities for generating test datasets for use in developing bioinformatic pipelines for analyzing human genome sequencing data. It allows you to generate multi-chromosome reference sequences, simulated sequencing reads, alignments (BAM files), and variant calls (VCF files) based on different sequencing technologies.
- Generate multi-chromosome reference sequences in FASTA format.
- Simulate reads based on different sequencing technologies, such as Illumina, PacBio, and Oxford Nanopore Technologies.
- Generate alignments (BAM files) from simulated reads and reference sequences.
- Generate variant calls (VCF files) from alignments.
You can install the package using pip:
pip install genosim
The package provides a command-line interface (CLI) for generating the test datasets. Here are some example commands:
Generate a multi-chromosome reference sequence:
genosim --technology illumina --output_prefix my_test_data --num_chromosomes 2
Generate sequencing reads from a reference FASTA:
genosim --technology pacbio --output_prefix my_test_data --num_reads 1000 --read_length 100
Generate a BAM file from reads and reference:
genosim --technology ont --output_prefix my_test_data --read_group "RG1:ONT:1:lib1:sample1"
For more information and options, use the --help
flag with the respective command.
For detailed usage instructions and API documentation, please refer to the documentation.
Contributions are welcome! Please see the contribution guidelines for more information.
This project is licensed under the MIT License. See the LICENSE file for details.