cdisc-org/cdisc-rules-engine

Rule blocked: CORERULES-254/1186/3701 - PANDAS Implementation

Closed this issue · 3 comments

see #557

https://jira.cdisc.org/browse/CORERULES-254
https://jira.cdisc.org/browse/CORERULES-1186
https://jira.cdisc.org/browse/CORERULES-3701

Describe the bug

  • Target_is_sorted_by needs to be validated to work with Dask, currently the operator is rooted in Pandas in its present implementation
  • this needs to be made more robust and able to handle partial dates, date and time comparisons for sorting

this should convert to the broadest contiguous date/time part for comparison

If a time part is missing is should be substituted by "00:00:00" to be able to do the sorting, afterwards it gets cut off again.
When comparing a date and a time, only those parts that are available will be compared. If a date a date is compared with a date/time, only the data part will be compared.

I have returned this to In Progress after my discussion with Gerry
The addition of blank numbers for normalizing before sort may not be needed as ISO allows for sorting without this, I will look at the behavior to ensure it behaves correctly.
Also included in the AC:

  • ensure that missing date components (represented by hyphen e.g. 2005---10 are properly addressed, This would be the tenth of some month in 2005. The precision for sorting being only the date. Gerry believes the sort should move this before any month as - comes at ASCII 45 and digits start at 48 for string sorts.
  • timezone, durations should also be accounted for--we may need to talk to @nickdedonder when he returns regarding this.

image.png

per STDMIG 4.4.1 and 4.4.2

@SFJohnson24

2 potential Algorithms

Without grouping

  1. Sort all records by grouping vars (USUBJID) + order vars (--SEQ)
  2. For each record, n, flag an issue if:
    • n>0 (0-indexed)
    • grouping vars n matches grouping vars n-1
    • comparator var n (--STDTC) is date_less_than comparator var n-1
    • or comparator var n (--STDTC) is date_equal_but_different_precision comparator var n-1

With grouping

  1. Group all records by grouping vars, sorted by order vars
  2. For each record within each group, flag an issue if:
    • n>0
    • comparator var n (--STDTC) is date_less_than comparator var n-1
    • or comparator var n (--STDTC) is date_equal_but_different_precision comparator var n-1

date_less_than

I'm not sure if we already have a function date_less_than, but if not, it should do the following:

  • Take as input date n, date n-1
  • Truncate both dates to the least precision
    • (2024-01-01, 2023-01) -> (2024-01, 2023-01)
    • (2024---01, 2023-01-01) ->(2024, 2023)
  • string compare the resulting dates. Note that this should be strictly less than, not LTE

date_equal_but_different_precision

(Is this part necessary?)

For comparing across adjacent rows, you could use pandas/dask functions like shift or map_overlap:
https://docs.dask.org/en/latest/generated/dask_expr._collection.Series.map_overlap.html#dask_expr._collection.Series.map_overlap