ratt-ru/shadeMS

Cannot plot arbitrarily-named data columns

Opened this issue · 11 comments

I've added a DIR1 column for DD-cal purposes. I think this check is preventing me from plotting it:

https://github.com/ratt-ru/shadeMS/blob/master/shade_ms/data_mappers.py#L135

I could call it DIR1_DATA as a workaround I guess, but I think it's friendlier to change the plotting tool rather than the plotting tool asking everyone to change their behaviour.

Possible workaround is to flip this check on its head, and check for non-data and non-spectrum columns (TIME, UVW, etc.). That still denies ultimate freedom to the plot-loving user, but I can't think of many use cases where a user would add a non-standard column to the MS that wasn't a data column.

but I can't think of many use cases where a user would add a non-standard column to the MS that wasn't a data column.

...but then again, I couldn't think of many use cases where a user would add a visibility column not ending with DATA!

No, that just feels like pushing the lump in the carpet around... gotta be a better way...

Come to think of it, how does dask-ms decide that arbitrarily-named column is of shape nrow,nfreq,ncorr, @sjperkins? Does it just assume every 3D column has that shape? Shadems should be copacetic.

Come to think of it, how does dask-ms decide that arbitrarily-named column is of shape nrow,nfreq,ncorr, @sjperkins? Does it just assume every 3D column has that shape?

It shouldn't. You have to specify the dimension schema for non-standard columns otherwise it sets up the schema based on the column name. The appropriate point in the docs is the table_schema kwargs https://dask-ms.readthedocs.io/en/latest/api.html#daskms.xds_from_table.

Without this, I'd expect the dimension schema for MYCOL to be something like ('row', 'MYCOL-1', 'MYCOL-2')

Right, so non-standard columns need to be specified in the schema up front, or things will fall down anyway.

Is there some way to look up the default schema via the API? I'm just trying to nail down what logic shadems should use to answer the question "is this a data-style column". Something like

  1. Look up column name in default schema. If it exists, the schema has the answer.

  2. If it doesn't exist, open table and check the column description. If this is a fixed-shape column, we have an answer.

  3. If it is a variable-shaped column, read the cell in row 0 and look at its shape for an answer.

  4. If there's no cell in row 0, assume it's a data-style column. Or just give up and curl into a ball.

Right, so non-standard columns need to be specified in the schema up front, or things will fall down anyway.

Is there some way to look up the default schema via the API?I'm just trying to nail down what logic shadems should use to answer the question "is this a data-style column". Something like

1. Look up column name in default schema. If it exists, the schema has the answer.

2. If it doesn't exist, open table and check the column description. If this is a fixed-shape column, we have an answer.

3. If it is a variable-shaped column, read the cell in row 0 and look at its shape for an answer.

4. If there's no cell in row 0, assume it's a data-style column. Or just give up and curl into a ball.

In fact, there's a fair amount of this lurking in undocumented internal APIs
Part of this is implemented here: https://github.com/ska-sa/dask-ms/blob/master/daskms/columns.py#L95.

At a more high level what might be more appropriate is the (undocumented) Table Descriptor Builder API in this directory:

https://github.com/ska-sa/dask-ms/tree/master/daskms/descriptors
https://github.com/ska-sa/dask-ms/blob/master/daskms/descriptors/ms.py
https://github.com/ska-sa/dask-ms/blob/master/daskms/descriptors/ratt_ms.py

Could do a zoom tomorrow to discuss?

...but then again, I couldn't think of many use cases where a user would add a visibility column not ending with DATA!

If you've got a code snippet for renaming a column I'm all eyes.

I think this could simply be done in shadems as follows:

for dataset in datasets:
    user_data_cols = {}
  
    for column, variable in dataset.data_vars.items():
        if column.endswidth("DATA") and len(variable.dims) == 3:
            user_data_cols[column] = (("row", "chan", "corr"), variable.data, variable.attrs)

    dataset = dataset.assign(user_data_cols)

The above from memory, but should work.

As simple at that!

Cheers.

Seems to have failed in a different way when I fall into line with my column names:

2020-07-16 14:39:11 - shadems - INFO - :   1953/1953 baselines present
2020-07-16 14:39:11 - shadems - INFO - :   corrs/Stokes XX XY YX YY I Q U V
2020-07-16 14:39:11 - shadems - INFO - ------------------------------------------------------
2020-07-16 14:39:11 - shadems - INFO - : Data selected for plotting:
2020-07-16 14:39:11 - shadems - INFO - Antenna(s)       : all
2020-07-16 14:39:11 - shadems - INFO - Baseline(s)      : all except autocorrelations
2020-07-16 14:39:11 - shadems - INFO - Field(s)         : all
2020-07-16 14:39:11 - shadems - INFO - SPW(s)           : all
2020-07-16 14:39:11 - shadems - INFO - Scan(s)          : all
2020-07-16 14:39:11 - shadems - INFO - Channels         : all
2020-07-16 14:39:11 - shadems - INFO - Corr/Stokes      : XX XY YX YY
2020-07-16 14:39:11 - shadems - INFO - ------------------------------------------------------
2020-07-16 14:39:11 - shadems - INFO - loading minmax cache from 1561266559_sdp_l0_1024ch_CDFS_2_4-minmax-cache.json
2020-07-16 14:39:11 - shadems - INFO - axis: UV(UVW), range (None, None), discretization None
2020-07-16 14:39:11 - shadems - INFO - axis: amp(DIR1_DATA), corr 0, range (None, None), discretization None
2020-07-16 14:39:11 - shadems - INFO -                  : you have asked for 1 plots employing 2 unique datums
2020-07-16 14:39:13 - shadems - INFO - : Indexing MS and building dataframes (5952868 rows, chunk size is 5000)
Traceback (most recent call last):
  File "/mnt/home/ianh/venv/shadems/bin/shadems", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/home/ianh/Software/shadeMS/bin/shadems", line 8, in <module>
    main.main([a for a in sys.argv[1:]])
  File "/mnt/home/ianh/Software/shadeMS/shade_ms/main.py", line 620, in main
    row_chunk_size=options.row_chunk_size)
  File "/mnt/home/ianh/Software/shadeMS/shade_ms/data_plots.py", line 191, in get_plot_data
    value = axis.get_value(group, corr, extras, flag=flag, flag_row=flag_row, chanslice=chanslice)
  File "/mnt/home/ianh/Software/shadeMS/shade_ms/data_mappers.py", line 383, in get_value
    return dama.masked_array(coldata, da.logical_or(flag, bad_data))
  File "/mnt/home/ianh/venv/shadems/lib/python3.6/site-packages/dask/array/ma.py", line 229, in masked_array
    "%s." % (repr(data.shape), repr(mask.shape))
numpy.ma.core.MaskError: Mask and data not compatible: data shape is (5952868, 1024), and mask shape is (5952868, 1024, 1024).

The DIR1_DATA column is a clone of DATA in terms of the coldesc.

CASA <5>: mod = tb.getcol(columnname='MODEL_DATA',nrow=1000)

CASA <6>: mod.shape
Out[6]: (4, 1024, 1000)

CASA <7>: dir1 = tb.getcol(columnname='DIR1_DATA',nrow=1000)

CASA <8>: dir1.shape
Out[8]: (4, 1024, 1000)

CASA <9>: fg = tb.getcol(columnname='FLAG',nrow=1000)

CASA <10>: fg.shape
Out[10]: (4, 1024, 1000)