List parameters and their annotations in `global` section with `-h` option

Question

List parameters and their annotations in `global` section with `-h` option

gaow opened this issue 7 years ago · 34 comments

So suppose I have this workflow:

$ sos run workflows/fastqtl_to_mash.ipynb -h -v3


Available workflows
  export
  default
  convert

Sections
  convert_0:           Generate utility functions
  convert_1:           Convert summary stats gzip format to HDF5
  convert_2:           Merge single study data to multivariate data
  default_3:           Extract data to fit MASH model
  default_4:           Compute for MASH EZ model and save to RDS
  export:              Export notebook to HTML file

it neatly lists all steps in the workflow. But under [global] I have something like this:

parameter: cwd = path('./gtex6_workflow_output')
parameter: data_list = path("data/eQTL_summary_files.txt")
parameter: gene_list = path()
parameter: msg = "GTEx eQTL summary statistics"
# maximum number of groups per HDF5 file
parameter: maxsize = 1000 
parameter: cols = [8, 9, 7]
parameter: keep_ensg_version = 0
# number of null samples per gene to train MASH mixture
parameter: null_per_gene = 9 
# a list of effect names (SNP names) to exclude from mash analysis
parameter: exclude_mash_effects_list = 'NULL' 
# a list of condition names (tissue names) to include in mash output. 
# Default includes all conditions.
parameter: include_mash_condition_list = 'NULL' 
# size of MASH train set
parameter: mash_train_size = 20000

Correct me if I am wrong, but I thought we can list parameters with -v? I do not think it works, though. It would be great if we can list parameters along with what they are, eg:

--exclude_mash_effects_list: a list of effect names (SNP names) to exclude from mash analysis

Not sure if # is the best identifier for parameter comments to show, but we can also do ## or #' (roxygen convention). I'm just saying that showing parameter options could be useful when I make quick software-type of workflows for others to run without looking into the details.

Answer 1 · 2018-05-25T21:11:11.000Z

We had something like this when all parameters were defined in a [parameters] section. It is more difficult to do now because parameters are scattered in the script, and are parsed only when the steps are called. In the extreme case, parameters are only invoked with dynamically created subworkflows. That is to say, we cannot determine in advance which parameters belong to which workflow, etc.

So the best that can be done, but perhaps already good enough, is to list all parameter definitions (and their comments) in the script, regardless the workflow. Then there can be weird things such as parameters with the same names but different type/meaning that were designed for different workflows.

Answer 2 · 2018-05-25T21:24:45.000Z

I see, that is why I thought we had that option before.

Then there can be weird things such as parameters with the same names but different type/meaning that were designed for different workflows.

I'm a bit confused here: for this workflow for example:

[a]
parameter: t=1

[b]
parameter: t=path('/tmp')

Then in the command line --parameter 1 will it work? I assume it will, for a but not b? Or, i am assuming a and b never got executed in one workflow, that is, sus_run('a+b') should quit on some error due to parameter conflicts? I'm just wondering what the behaviors will be and see if they are intuitive.

Answer 3 · 2018-05-25T21:27:49.000Z

The behavior is that sos run a or sos run b will include relevant sections and create workflows so hopefully only one version of the parameter will be used. sos_run('a+b') will lead to undefined behaviors, and the current behavior is that whichever parameter t= statement will consume the command line argument and the next parameter t= will be ignored.

Answer 4 · 2018-05-25T21:49:35.000Z

I see, okay I think this is fair enough. Banning it entirely may be a bit restrictive.

Answer 5 · 2018-05-29T15:36:16.000Z

Let me know how this works.

Note that in the early versions of SoS when all parameters are defined in a separate section, we include comments before parameter definition as comments for the parameter. This is deprecated because it can be confused with section comment, which we still honor. E.g.

[10]
# this steps does this
parameter: a = 1

We could enforce something like

[10]
# this is section comment
parameter: a = 1    # for whatever purpose

if it is helpful.

Answer 6 · 2018-05-29T15:56:25.000Z

Just noticed that comment after parameter: will be displayed in help message due to the way parameter value is saved during parsing stage.

parameter: a = 1    # for whatever purpose

Answer 7 · 2018-05-30T18:19:50.000Z

This is great! Here is what I have for a workflow I'm on right now, for example:

$ sos run gtex6_mash_analysis.ipynb -h


Usage:
sos-runner gtex6_mash_analysis.ipynb [workflow_name] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  options:              Single-hyphen sos parameters (see "sos-runner -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  sfa_download
  default
  flash
  mash-paper
  mash_scripts_download
  mash-fast
  sfa
  export

Global Parameters:
  cwd                   path('./gtex6_workflow_output')
  data                  path("data/MatrixEQTLSumStats.Portable.Z.rds")
  mash_src              file_target("/opt/mash-paper/main.R")
  sfa_exe               file_target("/opt/sfa/bin/sfa_linux")
  algorithm             'paper'
  vhat                  1
  alpha                 1

Sections
  sfa_download:         Download / install SFA (no need if running from docker `gaow/mash-paper`)
  sfa:                  Perform SFA analysis (time estimate: <1min)
    Parameters:
      K                 5
  mash_scripts_download: Download / install MASH scripts (no need if running from docker `gaow/mash-paper`)
  mash-paper_1:         Compute data-driven prior matrices (time estimate: 40min to 4hrs depending on the
                        machine power)
  mash-paper_2:         Add in canonical configurations and single rank SFA priors (time estimate: <1min)
  mash-paper_3:         Fit MASH mixture model (time estimate: ~2.5hr)
  mash-paper_4:         Posterior inference on the "top" set of gene-snp pairs (time estimate: ~5hr on single
                        thread)
  flash:                Perform FLASH analysis (time estimate: 20min)
  mash-fast_1:          Compute data-driven / canonical prior matrices (time estimate: ~2h)
  mash-fast_2:          Fit MASH mixture model (time estimate: <5min)
  mash-fast_3:          Posterior inference on the "top" set of gene-snp pairs (time estimate: <3min)
  export:               Export notebook to HTML file
  default:              Run all analysis in this notebook

Now some suggestions:

Replace sos-runner with sos run because most people would use sos run command?
How are Workflows ordered? They are not ordered the same way as appeared in the notebook, which is a bit less readable ...
For parameters, global or Section specific: can we display -- so that users will know how to change them? eg:

Global Parameters:
  --cwd                   path('./gtex6_workflow_output')
  --data                  path("data/MatrixEQTLSumStats.Portable.Z.rds")
  --mash_src              file_target("/opt/mash-paper/main.R")
  --sfa_exe               file_target("/opt/sfa/bin/sfa_linux")
  --algorithm             'paper'
  --vhat                  1
  --alpha                 1

Sections
  sfa_download:         Download / install SFA (no need if running from docker `gaow/mash-paper`)
  sfa:                  Perform SFA analysis (time estimate: <1min)
    Parameters:
      --K                 5

and makes it very clear that those are parameters?

Other than that, this is mostly perfect, making SoS now very much like an executable, particularly when dockerized.

Answer 8 · 2018-05-30T19:28:01.000Z

Try again.

Answer 9 · 2018-05-30T19:47:40.000Z

Cool, although when I tried to add comments:

Global Parameters:
  --cwd                 path('./gtex6_workflow_output')
  --data_list           path("data/eQTL_summary_files.txt")
  --gene_list           path()
  --msg                 "GTEx eQTL summary statistics"
  --maxsize             1000  # maximum number of groups per HDF5 file
  --cols                [8, 9, 7]
  --keep_ensg_version   0
  --random_per_gene     9 # number of random SNPs to draw per gene for fitting MASH mixture
  --random_snp_size     20000 # size of mash random SNPs. This specifies the size of `random` SNP set and the rest of random SNPs go into `random_test` SNP set.
  --effects_list        path('NULL') # a list of effect names (SNP names) to include in mash analysis.
  --conditions_list     path('NULL') # a list of condition names (tissue names) to include in mash analysis

This is fine I guess, because I figure it might be hard to properly add line-wrap etc unless trying to do some serious parsing ...

Answer 10 · 2018-05-30T19:49:54.000Z

We have not decided how to add comments to parameters yet. # comment is right now a side effect of having # part of the default value (before parsing/evaluation). If we agree on this convention, I can display it properly with line wrapping etc.

Answer 11 · 2018-05-30T19:51:14.000Z

Well, I would like this more:

# number of random SNPs to draw per gene for fitting MASH mixture
parameter: random_per_gene  = 9

if possible :)

Answer 12 · 2018-05-30T19:52:25.000Z

But we have the ambiguity of

[10]
# step or parameter comment?
parameter: a = 1

Answer 13 · 2018-05-30T19:53:42.000Z

Unless we can introduce some convention such as

[10]
# step comment
## parameter comment
parameter: a = 1

Answer 14 · 2018-05-30T20:03:51.000Z

Indeed. I was originally proposing the roxygen convention #', not sure how you feel about it.

Answer 15 · 2018-05-30T20:09:12.000Z

My gut feeling is that sos should try to "understand what users write" instead of "force users to write in sos way". I mean, instead of introducing some complex rules such as

[10]
#! step comment
# free comment
#' parameter comment
parameter: a = 1

I would rather do

[10]
# step comment

# free comment

# parameter comment
parameter: a = 1

Answer 16 · 2018-05-30T21:03:49.000Z

[10]
# step comment

# free comment

# parameter comment
parameter: a = 1

is implemented. Let me know how it works.

Answer 17 · 2018-05-30T21:49:44.000Z

Okay you are saying that comments immediately before parameter will be parameter comment? That is reasonable to me. And the way to distinguish is to use line breaks?

I checked the current behavior. It is good!

Answer 18 · 2018-05-30T21:50:58.000Z

Yes. Consecutive comments immediately after section header will be section comment, consecutive comments immediately before parameter definition will be parameter comment.

Answer 19 · 2018-05-30T22:04:17.000Z

Okay maybe another finally minor point, sorry for being a nuisance: in the help message, those parameters are termed workflow_options: Double-hyphen workflow-specific parameters. So I think maybe we also change Global Parameters to Global workflow options and Parameters under Sections to workflow options?

BTW this looks nice to me:

usage: sos run fastqtl_to_mash.ipynb
               [workflow_name] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  convert
  default
  export

Global Parameters:
  --cwd path('./gtex6_workflow_output')
  --data_list path("data/eQTL_summary_files.txt")
                        A text file containing data-set names
  --gene_list path()    Optionally, a list of gene names.
  --msg "eQTL mapping summary statistics"
                        Meta-info tag for HDF5 database
  --cols [8, 9, 7]      Columns for betahat, se(betahat) and p-value (1-based indexing)
  --keep_ensg_version 0

Sections
  convert_0:            Generate utility functions
  convert_1:            Convert summary stats gzip format to HDF5
  convert_2:            Merge single study data to multivariate data
  default_3:            Extract data to fit MASH model
    Parameters:
      --random_per_gene 9
                        Number of random SNPs to draw per gene for fitting MASH mixture
  default_4:            Compute for MASH EZ model and save to RDS
    Parameters:
      --random_snp_size 20000
                        Size of mash random SNPs. This specifies the size of `random` SNP set and the rest of
                        random SNPs go into `random_test` SNP set.
      --effects_list path('NULL')
                        A list of effect names (SNP names) to include in mash analysis.
      --conditions_list path('NULL')
                        A list of condition names (tissue names) to include in mash analysis
  export:               Export notebook to HTML file

Answer 20 · 2018-05-30T22:08:53.000Z

Need to go. I also tried to change

--data_list path("data/eQTL_summary_files.txt")

to

--data_list data/eQTL_summary_files.txt (as path)

You can change this or other formats at the end of utils.py.

Answer 21 · 2018-05-30T22:46:01.000Z

Great! I find the 24 character rule does not always work well, so I changed it to new line for all comments,

usage: sos run fastqtl_to_mash.ipynb
               [workflow_name] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  convert
  default
  document_it

Global Workflow Options:
  --cwd gtex6_workflow_output (as path)
  --data-list data/eQTL_summary_files.txt (as path)
                        A text file containing data-set names
  --gene-list . (as path)
                        Optionally, a list of gene names.
  --msg 'eQTL mapping summary statistics'
                        Meta-info tag for HDF5 database
  --cols 8 9 7 (as list)
                        Columns for betahat, se(betahat) and p-value (1-based
                        indexing)
  --keep-ensg-version 0 (as int)

Sections
  convert_0:            Generate utility functions
  convert_1:            Convert summary stats gzip format to HDF5
  convert_2:            Merge single study data to multivariate data
  default_3:            Extract data to fit MASH model
    Workflow Options:
      --random-per-gene 9 (as int)
                        Number of random SNPs to draw per gene for fitting MASH
                        mixture
  default_4:            Compute for MASH EZ model and save to RDS
    Workflow Options:
      --random-snp-size 20000 (as int)
                        Size of mash random SNPs. This specifies the size of
                        `random` SNP set and the rest of random SNPs go into
                        `random_test` SNP set.
      --effects-list NULL (as path)
                        A list of effect names (SNP names) to include in mash
                        analysis.
      --conditions-list NULL (as path)
                        A list of condition names (tissue names) to include in
                        mash analysis
  document_it:          Export workflow to HTML document

I'm quite happy with the outcome.

Answer 22 · 2018-05-31T00:42:13.000Z

The output is actually not accurate because the workflows could be triggered by targets. These are more difficult to collect and display though.

Answer 23 · 2018-05-31T01:10:06.000Z

This update lists the targets the script provides and another way to start the script -t targets. Please close the ticket if you are ok with the end result.

Answer 24 · 2018-05-31T16:58:04.000Z

After more thought, I think having section comment before the section makes more sense and is more consistent. I mean

# step comment
[10]
# free comment

# parameter comment
parameter: a = 1

That is to say, anything immediately before section header or parameter will be their comment.

One additional rule could be the second comment block would be script description

#!/usr/bin/env sos-runner

# this is script comment
#
# this is script comment

# this is private stuff

# global parameter
parameter: a = 1

# this is first section
[10]
# this is implementation details

# parameter
parameter: b = 2

Answer 25 · 2018-05-31T18:50:21.000Z

Documented here. The advantage of the new convention is that you can do

#global par
parameter: gp = 1

[10]
# section par
parameter: sp = 2

without worrying about conflicts with script and section comments.

Answer 26 · 2018-06-14T19:53:56.000Z

I want to demonstrate this feature with one of the SoS examples but I noticed this issue:

sos-docs/src/examples$ sos run Process_Oriented.sos -h
    Workflow Options:
      --N 40 200 (as tuple)
                        training and testing samples
      --rstd 3 (as int)
                        training and testing samples
      --replicate 1 2 3 4 5 (as list)
                        training and testing samples

The script has

# training and testing samples
parameter: N = ..
parameter: rstd = ..
...

apparently because I did not have a comment for rstd etc, those parameters incorrectly uses comments from the first parameter N and displays them.

Answer 27 · 2018-06-14T21:27:07.000Z

Try again.

Answer 28 · 2018-06-16T06:26:06.000Z

In my RNA-seq DE example, I noticed that for comment strings starting at the begining of a cell then the content will not be displayed with -h. For example:

$ sos run RNASeqDE.ipynb -h
usage: sos run RNASeqDE.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  sra
  hg19_reference
  star
  align

Global Workflow Options:
  --cwd /home/gaow/GIT/LargeFiles/RNAseqDE (as path)
  --samples SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521 (as list)
  --ncpu 4 (as int)

Sections
  sra:
  hg19_reference_1:
  hg19_reference_2:     Use `twoBitToFa` to extract `hg19.fa` from `hg19.2bit`
  hg19_reference_3:     Download `Homo_sapiens.GRCh38.91.gtf.gz` from Ensembl
  star:                 Prepare STAR reference data
  align_1:
  align_2:

you see that sra, hg19_reference_1 and align_1 and align_2 has no documentation, but they in fact do have in the notebook! I guess it is because they appear at the first line of a cell and somehow got ignored?

Answer 29 · 2018-06-16T12:21:52.000Z

Yes, I changed the comment style for plain sos, but forgot to change sos notebook.

Perhaps you have not noticed, the sos comment style has been changed to "before content". That is to say, comments immediately before the section header (and parameter definition) will be comment for the header (and parameter). This style is more consistent than, for example, having comments after section header and before parameter etc, and avoids confusion of a common scenario

[section]
# comment
parameter: a = 1

When comes to sos notebook, currently sos extracts lines after section headers so that we do not extract magics in cases such as

%run
[section]
# comment

This works before since #comment after [section] counts, but not now. Therefore, the code needs to be changed to include comments immediately before [section].

We still have problem with script comment though. Currently the plain format takes the second comment block as script comment, but there is no such concept in the notebook format.

Answer 30 · 2018-06-16T12:55:36.000Z

Perhaps you have not noticed, the sos comment style has been changed to "before content"

Yes, and this is now I currently implements in SoS Notebook. I think with the exception of the first comment in the cell everything works as expected even though you said you did not change it for notebooks. That is, in SoS Notebook lines before section header are actually extracted, but not when they are the first lines of cell.

Answer 31 · 2018-06-16T13:17:35.000Z

8d8ee5f

Please check if everything works ok now (extract from sos, not sos-notebook, which is still broken), and if not, change the sample notebook and test case to demonstrate the problem.

Answer 32 · 2018-06-16T13:28:02.000Z

It works now for my example:

usage: sos run RNASeqDE.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  sra
  hg19_reference
  star
  align

Global Workflow Options:
  --cwd /home/gaow/GIT/LargeFiles/RNAseqDE (as path)
  --samples SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521 (as list)
  --ncpu 4 (as int)

Sections
  sra:                  decrypt SRA files
  hg19_reference_1:     Download `hg19.2bit` and `twoBitToFa` from UCSC
  hg19_reference_2:     Use `twoBitToFa` to extract `hg19.fa` from `hg19.2bit`
  hg19_reference_3:     Download `Homo_sapiens.GRCh38.91.gtf.gz` from Ensembl
  star:                 Prepare STAR reference data
  align_1:              Align to SAM file
  align_2:              Convert SAM to BAM file & zap the input

and this is extracted from SoS Notebook (am I missing something again?)! So i'm not sure if I should close the ticket ...

Answer 33 · 2018-06-16T13:36:14.000Z

The extraction was implemented in sos so that it does not have to depend on sos ntoebook. It is re-implemented as a converter in sos-notebook, and again in JS for sos-notebook for magics to work...

Answer 34 · 2018-06-16T13:40:20.000Z

I see what you meant now! Okay this ticket is about -h and since it also seems you've fixed ipynb related JS code I guess we can close the ticket now and use new tickets for future issues.