cgat-developers/cgat-flow

No transcript_info table with pipeline_genesets

Closed this issue · 4 comments

ORIGINALLY POSTED @ CGATOxford/CGATPipelines:

pipeline_annotations used to make a table called transcript_info https://github.com/CGATOxford/CGATPipelines/blob/94b975256fa5142f15e82511f4b62a6c19ae38b9/obsolete/pipeline_annotations.py#L1049-L1054 which is used in multiple pipeline, for example in a tracker for rnaseqdiffexpression https://github.com/CGATOxford/CGATPipelines/blob/94b975256fa5142f15e82511f4b62a6c19ae38b9/CGATPipelines/pipeline_docs/pipeline_rnaseqdiffexpression/trackers/Genelists.py#L5-L30.

The config.ini for pipeline_genesets still contains an option for this table https://github.com/CGATOxford/CGATPipelines/blob/8ebe37408aa512ec910b767e07ade4d4b1733177/CGATPipelines/pipeline_genesets/pipeline.ini#L318 but the pipeline doesn't use this option and doesn't generate the table. This breaks at least one of my pipelines which integrates with pipeline_annotations/genesets. Is there any reason pipeline_genesets doesn't generate this table?

REPLY FROM @Acribbs:
When developing the geneses pipeline the intention was to initially support the bare minimum needed for the CGAT-flow pipelines to work. It was removed because it wasn't used in pipelines and CGATreports isn't supported going forward. If you think you need it then feel free to add it back in. I am using the new cgat-developers code now though.

Just checked and it's being used in PipelineGO.py as well so I guess this should go back in assuming PipelineGO.py is being retained?

cc.execute("SELECT DISTINCT gene_name, gene_id FROM transcript_info").fetchall())

[EDIT] Note, two functions in the PipelineGeneset.py module use transcript_info but don't appear to be used in any pipelines currently. Whether we keep these functions depends on whether they are in use elsewhere I guess?

"FROM transcript_info")]))

FROM transcript_info

There was a discussion as to whether pipeline GO should be kept or not. I think it was to be removed because katys pipeline enrichment covered all of it. However, I haven't managed to get round to it yet.

OK. In that case, it appears the transcript_info table should not be created by pipeline_genesets.py if the idea is to only provide the minimum set of inputs required for the cgatflow pipelines.

Seen as the loadEnsemblTranscriptInformation function is still in PipelineGeneset, I'll just use this function to create the table in my pipeline.