worldbank/ietoolkit

`iedorep`: Re-write `recursive` option to expand sub-do-files in place for proper execution

luisesanmartin opened this issue · 12 comments

The command iedorep dofile1.do works as expected with these two do-files,

  • dofile1.do
set obs 100
do "dofile2.do"
  • dofile2.do
count
gen var5 = runiform()

returning this error message:

. iedorep try_iedorep.do
 
Processing: try_iedorep.do
Entering try_iedorep.do run....
Done with try_iedorep.do!

  +------------------------------------------------------+
  | Line |           Data |        Seed | Sort | Subfile |
  |------+----------------+-------------+------+---------|
  |    2 | ERROR! Changed | ERROR! Used |      |     Yes |
  +------------------------------------------------------+

. 

However, using the option recursive does not flag the reproducibility error in dofile2.do

. iedorep try_iedorep.do, recursive
 
Processing: try_iedorep.do
Entering try_iedorep.do run....
Done with try_iedorep.do!

  +------------------------------------------------------+
  | Line |           Data |        Seed | Sort | Subfile |
  |------+----------------+-------------+------+---------|
  |    2 | ERROR! Changed | ERROR! Used |      |     Yes |
  +------------------------------------------------------+
 
Errors detected in the following sub do-files; starting recursion.

  +--------------------------+
  | Line |              Path |
  |------+-------------------|
  |    2 | "try_iedorep2.do" |
  +--------------------------+
 
Processing: "try_iedorep2.do"
Entering "try_iedorep2.do" run....
Done with "try_iedorep2.do"!
 
No errors detected in sub do-files; recursion halted.

. 

Shouldn't line 2 of dofile2.do be flagged in a third table in the console?
Note that when I add the options alldata, allsort, and allseed there's a third table noting that the data changed in line 2 of dofile2.do

. iedorep try_iedorep.do, alldata allsort allseed recursive
 
Processing: try_iedorep.do
Entering try_iedorep.do run....
Done with try_iedorep.do!

  +------------------------------------------------------+
  | Line |           Data |        Seed | Sort | Subfile |
  |------+----------------+-------------+------+---------|
  |    1 |        Changed |             |      |         |
  |    2 | ERROR! Changed | ERROR! Used |      |     Yes |
  +------------------------------------------------------+
 
Errors detected in the following sub do-files; starting recursion.

  +--------------------------+
  | Line |              Path |
  |------+-------------------|
  |    2 | "try_iedorep2.do" |
  +--------------------------+
 
Processing: "try_iedorep2.do"
Entering "try_iedorep2.do" run....
Done with "try_iedorep2.do"!

  +----------------------------------------+
  | Line |    Data | Seed | Sort | Subfile |
  |------+---------+------+------+---------|
  |    2 | Changed |      |      |         |
  +----------------------------------------+
 
No errors detected in sub do-files; recursion halted.

. 

The issue is arising from the fact that dofile2.do, standing alone, has no reproducibility problems! When it is run a second time, there are no observations -- therefore it produces no data changes and does not draw any seeds, so no errors are caught. Obviously, that's not the way it will function in practice. Great catch on this.

I am not sure how to fix this right away, I need to think about it, as it is a memory management issue. It would require saving the Stata state right before sub-do-files are run, so that they can be restored at the recursive runs. I think I can figure it out with some time.

@luisesanmartin @kbjarkefur @luizaandrade What do you all think of this? It is rather challenging to resolve, as I am not aware of a way to freeze the entire Stata environment at a given point and reload it later, especially given the need of this command to clear nearly everything before each run. It is kind of fundamental here that each sub-do-file "stand alone".

Is this something that can be sufficiently resolved by adding flags to the execution, or to the help-file? Or should we have a technical solution? For example, I could imagine tracking that an error was expected and then throwing an error if it is not detected in a sub-file run, therefore it must be this type of file structure.

@luisesanmartin I am assuming this type of thing hasn't caused any issues with passing globals down etc -- I can see how it would if they were coded clumsily or used (dynamically) as locals, since globals are not wiped between runs.

Proposed partial fix in cc562cb. Result here is now:

  +---------------------+
  | Line |         Path |
  |------+--------------|
  |    2 | "dofile2.do" |
  +---------------------+
 
Processing: "dofile2.do"
Entering "dofile2.do" run....
Done with "dofile2.do"!
 
No errors detected in sub do-files; recursion halted.
 
!!------ WARNING ------!!
An error was expected in this sub do-file but none was found.
  This can occur when sub do-files are not independent.
  Check that this sub do-file does not depend on the calling file.
  This includes both the data state and global macros;
    the data is reset on each run, but macros are not.
 
something that should be true of your data is not
r(459);

@bbdaniels have you considered saving the Stata state before the run of a sub do-file as a tempfile? I understand the challenge would be that running iedorep on a sub do-file would delete tempfiles out of the memory, but then it's probably possible to modify it so when it's run recursively for a sub do-file it does not do that. I think this would be somewhat similar to the change you introduced for do-files using clear or clear all when you found a way to control what to delete from the memory so the command would work.

I also see that these might be challenging to implement. For v1 I think it's okay to go ahead with the change of cc562cb. @kbjarkefur , @luizaandrade do you have any thoughts?

Before digging into this and to try to better understand what is going on, here is feedback the solution that is based on explaining the warning with my level of understanding.

It is not obvious to me what is meant exactly about not independent in This can occur when sub do-files are not independent.. Or what stand alone means in It is kind of fundamental here that each sub-do-file "stand alone". I think they mean the same thing, but I can think of many ways this can exactly mean in practice.

I think that is what you intend to do in here, but still not clear to me, and I think we can assume that I have a better understanding of these topics than the average intended user.

Check that this sub do-file does not depend on the calling file.
This includes both the data state and global macros;
the data is reset on each run, but macros are not.

What is a "calling file"? It would be the master script right? (or a sub-file calling a sub-sub-file). So what does depend mean here? If a file calls a sub-file then there is always some dependency relation there.

If the data is reset each run, why does it matter if it depends on the calling script? Or how is that different in terms of "not depend on the calling file" from the macros that are not reset.

I think this warning needs an explanation on what makes something "dependent" in a way that iedorep does not allow instead of (or in addition to) what is being reset between runs. Right now the warning seems to more be a message to the developer of iedorep rather than a user of it. A mistake I commit all the time myself, so no fingers pointed. Just giving this feedback before I dig into the details and also get a developers pair of eyes on this issue.

Ok, I have gone through the command and not sure I still get it. Let me first check if I understand the pseudi-recursive flow.

But first some feedback on terminology. I think what the command now calls "errors" should be called "mismatches". I associate "errors" with the command not running properly. If iedorep ran my file twice and found that the RNG state was different the two runs and it reported that to me, then I think iedorep ran without error, but it identified a "mismatch", "instability", "discrepancy" or whatever we want to call it. Just not "error". This was confusing to me when reading the output before having analyzed the code in the command.

I will below refer to testing and identify a discrepancy in the data state, the RNG state or the sortseed as "test for mismatch in the 3 states" below.

My understanding from helpfile and code analysis is this. Let's assume:

  • The project has three dofiles: A.do, B1.do and B2.do.
  • A.do is the master script that includes the lines do B1.do and do B2.do.
  • When using iedorep, the user is in this case expected to specify the command as iedorep A.do.

The my understanding of the intended work flow is:

  • iedorep runs on A.do treating the lines do B1.do and do B2.do test for mismatch the 3 states after those lines in the first and in the second line just like the command would for any other line of code.
  • If there were no mismatches in the 3 states when running A.do, then we assume there are no line-by-line mismatches in the 3 states in the sub-dofiles do B1.do and do B2.do. (a fair assumption to make in real life code, but maybe possible to design some extreme case this is possible)
    • If a user do not care about the details, then if it perfectly fine for them to be content and consider all three do-files to be stable and end here.
  • However, if there were a mismatch and it was in a sub-file,, then to find out where in the sub-file, the user can run the command iedorep A.do, recursive.
  • Then the command first repeats the step above where do B1.do and do B2.do are treated like any other code.
  • Then the command makes the recursive call iedorep B1.do, recursive, if and only if, there is was a mismatch in the 3 states for the line do B1.do. (And the same for B2.do if needed and any third level deep nested file C1.do etc.).

The implied assumption of this work flow (and the source of the error in this issue) is that all files are expected to be written such that it does not matter if they run as a part of the master script of independently.

I see two problems with this workflow and assumption.

  1. The error itself the Luise produced. And potentially there are more. And I do not think we can or want to have to assume that a user has, or request that a user will code a sub-dofile a certain way.
  2. It took me a couple of hours of code analysis to understand the recursive workflow and only then did I fully understand the output I thought I mostly understood a few hours ago. Meaning I think the output for a real life project codebase run from a master script wont make sense to the user.

I think this is the solution. When creating the _temp file for the iedorep A.do file, then it the line with do B1.do is simply replace with the content of B1.do and then once that is done recursively for all sub-files of A.do and all of their sub-files then, and only then, is all lines in the single _temp file tested for mismatches in the 3 states a single time. And this is the default behavior. Maybe there should be an option norecurse where the lines are not expanded and the line do B1.do is treated like any code.

The challenge of this is to output a line number in the table that makes sense, but it should be doable. It could then be presented as:

  +-----------------------------------------------------+
  | Entering: A.do
  | Line |           Data |        Seed |          Sort |
  +-----------------------------------------------------+
  |    5 |        Changed |             |               |        
  |   11 |        Changed |             |               |
  +-----------------------------------------------------+
  | Entering: B1.do                                     |
  | Line |           Data |        Seed |          Sort | 
  +-----------------------------------------------------+
  |   1 |                |             |        Sorted  |
  |   3 |                |             | ERROR! Sorted  |
  | Done: B1.do                                         |
  +-----------------------------------------------------+
  | Returning to: A.do                                  |
  | Line |           Data |        Seed |          Sort |
  |   19 | ERROR! Changed |             |               |
  |   20 | ERROR! Changed | ERROR! Used |               |
  |   22 |                |        Used |               |
  +-----------------------------------------------------+
  | Entering: B2.do                                     |
  | Line |           Data |        Seed |          Sort |
  +-----------------------------------------------------+
  | Done: B1.do                                         |
  +-----------------------------------------------------+
  |   1 |                |             |        Sorted  |
  |   3 |                |             | ERROR! Sorted  |
  | Done: B1.do                                         |
  +-----------------------------------------------------+
  | Returning to: A.do                                  |
  | Line |           Data |        Seed |          Sort |
  +-----------------------------------------------------+
  |   24 | ERROR! Changed |             | ERROR! Sorted | 
  |   26 | ERROR! Changed |             | ERROR! Sorted |
  +-----------------------------------------------------+

or

  +------------------------------------------------------------+
  | File | Line |           Data |        Seed |          Sort |
  +------------------------------------------------------------+
  | 1    | 5    |        Changed |             |               |        
  | 1    | 11   |        Changed |             |               |
  | 2    | 1    |                |             |        Sorted |
  | 2    | 3    |                |             | ERROR! Sorted |
  | 1    | 19   | ERROR! Changed |             |               |
  | 1    | 20   | ERROR! Changed | ERROR! Used |               |
  | 1    | 22   |                |        Used |               |
  | 3    | 1    |                |             |        Sorted |
  | 3    | 3    |                |             | ERROR! Sorted |
  | 1    | 24   | ERROR! Changed |             | ERROR! Sorted | 
  | 1    | 26   | ERROR! Changed |             | ERROR! Sorted |
  +------------------------------------------------------------+
  | File legend:
  | File 1: /path/A.do
  | File 2: /path/B1.do
  | File 3: /path/B2.do
  +------------------------------------------------------------+

I like the latter better. And I think instead of displaying ERROR! (or whatever we will call it) it should be a red X.

I do not know how to best implement this as I am not as familiar with the command. But it should be possible to keep track of in a local, or by knowing the number of lines in each file be able to convert the line number in the _temp file to which file that should be and what line in that file it corresponds to.

Finally, I understand this would be a bit of work and I understand that you want to release. I am afraid that I do not think this implementation of the recursive option is ready to be released, and the main reason is that the output does not make enough sense with nested files. My compromise suggestion is that you drop recursive for this release and if there is a state mismatch in a sub-file output an instruction first saying this will be possible in future versions and then provide the suggestion that the user themselves can run the command on the sub-file if it is stand-alone.

Let me know if there is something I do not understand.

@kbjarkefur, this is a super thorough and useful comment, and I think it is exactly on the nose about the underlying issue and is a very useful solution. However! I do not have time for it now. Are you OK with releasing 1.0 as it is, or with minor modifications to the current error message / results report? Then I will come back and work on implementing something like this in the new year.

The compromise solution I am suggesting is to drop everything related to the recursive call, and add a note in help file that sub-dofiles is a feature that will come. That should be quick.

As in remove this line.

[Recursive] ///

and these

/*****************************************************************************
Pseudo-recursion
*****************************************************************************/
if "`recursive'" != "" {
qui keep if !(Err_1 == "" & Err_2 == "" & Err_3 == "") & (Path != "")
if `c(N)' == 0 {
di as err " "
di as err "No errors detected in sub do-files; recursion completed."
di as err " "
}
else {
di as err " "
di as err "Errors detected in the following sub do-files; starting recursion."
li Line Path , noobs divider
qui duplicates drop Path, force
sort Line
forvalues i = 1/`c(N)' {
local file = Path[`i']
iedorep `file' , `debug' `qui' `recursive' `alldata' `allsort' `allseed'
if `r(errors)' == 0 {
di as err "!!------ WARNING ------!!"
di as err "An error was expected in this sub do-file but none was found."
di as err " This can occur when sub do-files are not independent."
di as err " Check that this sub do-file does not depend on the calling file."
di as err " This includes both the data state and global macros;"
di as err " the data is reset on each run, but macros are not."
di as err " "
error 459
}
}
}
}

And you can write a quick blurb in help file

I have left it undocumented instead so we can still use it in the released version for testing. Done in 4976cd5

What about not using the word "error" for mismatches?

Screen Shot(21)

The suggestion does not work in list unfortunately...

@kbjarkefur thanks for the review and comments. I think the second solution your propose in your comment is the one that would work best. I'm leaving this issue open so that we remember that it's pending and will approve #261 so v1 will be published