Genome-Viz
3300 Term Project for Processing and Visualization of Genomic Data
MergePackage
TSV file data handler operating using argparse command line.
Example
python3 Genome_Viz.py test_data/ test -d MPXVInfected2wks MPXVInfected8wks -b 0 1
Tasks:
- Import and utilize Argparse Module to allow flexibility in file selection.
- Evaluate Program efficiency using tm.time.
- Implement file identification, selection, and sorting function under title: loadData().
- Based on file selection, create and identifier row denote sample status as infected or naive.
- Implement a labeling function, getNames(), to appropriately track and identify files in the final data frame.
- Initialize an empty dictionary to hold concatenated data.
- Trim the dictionary filling function fillDict().
- FillDict() should ignore empty sequence entries and disregard non-data columns while processing data for entry.
- FillDict() should be able to process flawed files with unexpected formating.
- FillDict() should store each sequence in a key relevant to the specific file the sequence was read from.
- FillDict() should return a dictionary of dictionaries upon completion.
- Sprint 1 package should print a running time and completely dictionary of dictionaries.
Tasks:
- For current coherence, single cohort data frames should be recorded and expressed as .csv files.
- For coherence the prototype should notify the user when single cohort files have been processed into a singular dictionary.
- Argparse functionality should be added to determine location and naming of output single cohort data file.
- The .csv should be written to by line, with column headers of filenames serving as the first written row.
- Column specific exposure based labels should be assigned to each header.
- In an effort to prevent overpopulation of irrelevant data, the .csv writer should require a certain amount of samples expressed for a TCR sequence to be recognized.
- As in projected dataframe expression, each row should be declared as an empty list and filled from relevant dictionary entries.
- Relevant columns lacking data dictionary entries for a certain sequence should be represented with 0’s in the row.
- Relevant efficiency and timekeeping functions should be updated to accommodate new workloads.
Sprint 3 Burndown
Tasks: (6 Tasks, 2 Weeks)
- Line separation should be mutable between \t and ,.
- Column specification should be mutable through argparse specification.
- Filetype should be mutable through argparse specification.
- Institute filtering of relevant data for classification.
- Mutable line separation variable is employed on a line by line basis when processing data.
- Dataset specification will allow for multiple datasets to be joined from a single directory specified in the command line.
Sprint 4 Burndown
Tasks: (9 Tasks, 2 Weeks)
- Cohort data frames should be processed and interpreted as floats.
- Cohort representations should be created through concatenation of all labeled files.
- The joining of this genomic data should be done by averaging sequence expression over the cohort.
- For each shared sequence between the two cohorts I want to calculate a percentage deviation.
- To avoid skewed data the percentage deviation should be normalized between 0 and 100%/
- The mean of these normalized percentages will be displayed as the overall deviation between cohorts.
- These normalized percentages should be appended to the dataframe.
- The sequences with the largest deviations should be presented as high-impact characteristic sequences.
- The program should display the 50 highest impact sequences and their mean representation in each cohort.
Sprint 5 Burndown
Tasks: (12 Tasks, 1 Week)
- The program needs to import dash and dash_bio for their visualization capabilities.
- The program should have an html template and dash app for server creation.
- The dash app object should be created with coherent text and color.
- The visualization process should receive the data of the 50 highest impact sequences, sorted by cohort deviation.
- The cohorts should be visualized side by side in a clustergram to properly visualize differences.
- The clustergram colors should be distinct and conductive towards differentiation.
- The clustergram should identify possible clusters and recurrent behavior.
- Cohort columns should be labelled appropriately.
- Rows should be identified by sequence.
- The graph should be fitted to the screen while being large enough to distinguish individual rows.
- The graph should be added to the template in the app object.
- The server should be run without debugging to maintain an interactive single-run visualization.