gvegayon/parallel

No dataset for instance 0002.

jibanes opened this issue · 14 comments

Preliminaries

Before submitting an issue, please check (with x in brackets) that you:

  • Are using the newest release (see here for latest release version number).
  • Have checked that the examples in the help work.
  • Have read the help (HTML version) and the gallery of examples.
  • Have checked that there is not already an existing issues for what you are reporting.

Expected behavior and actual behavior

Described what you expected to see and what you actually see

I'm running the attached Stata do file in a loop, it fails after many iterations, typically 1-2 hours with the error message "No dataset for instance 0002." while I do see __pllXXX_dta0002.dta on disk (which is use/append-able). I've repro'd this 4 times, and everytime the error message pointed to instance 0002.

Important datapoint: "it fails after many iterations".
All the previous iterations, sometimes dozens, sometimes in the hundred range SUCCEED; this is random failure, which I have found no other way to reproduce but by having the do script call itself, and wait for a few hours (typically 1-2). The NFS is used by a large number of machines with no known issue/outage (it's a commercial NFS appliance, from a well-known Fortune 500 company).

See attached error below:
error.txt

I've emailed the _pll files leading to the failure to @gvegayon .

Steps to reproduce the problem

attached, the script is calling itself, will typically fail within 1-2 hours.
parallel.txt

System information

Some relevant information

I've repro'd using stata-mp and stata binaries.

  • Stata version and flavor (e.g. v14 MP): stata 16.1 MP (31 Mar 2020)
  • OS type and version (e.g. Windows 10): linux
  • kernel: Linux 4.18.13 #1 SMP Thu Oct 11 23:02:29 PDT 2018 x86_64 GNU/Linux
  • Parallel version: github (1.20.0 19mar2019) retrieved 2020-04-26.

Output from creturn list:

System values

-----------------------------------------------------------------------------------------------
    c(current_date) = "27 Apr 2020"
    c(current_time) = "11:06:48"
       c(rmsg_time) = 0                          (seconds, from set rmsg)
-----------------------------------------------------------------------------------------------
   c(stata_version) = 16.1
         c(version) = 16.1                       (version)
     c(userversion) = 16.1                       (version)
  c(dyndoc_version) = 2                          (dyndoc)
-----------------------------------------------------------------------------------------------
       c(born_date) = "31 Mar 2020"
          c(flavor) = "IC"
             c(bit) = 64
              c(SE) = 0
              c(MP) = 0
      c(processors) = 1                          (Stata/MP, set processors)
  c(processors_lic) = 1
 c(processors_mach) = .
  c(processors_max) = 1
            c(mode) = ""
         c(console) = ""
-----------------------------------------------------------------------------------------------
              c(os) = "Unix"
           c(osdtl) = ""
        c(hostname) = "sea-cpu-035"
    c(machine_type) = "PC (64-bit x86-64)"
       c(byteorder) = "lohi"
        c(username) = "jibanes"
-----------------------------------------------------------------------------------------------

Directories and paths

-----------------------------------------------------------------------------------------------
    c(sysdir_stata) = "/opt/stata/"              (sysdir)
     c(sysdir_base) = "/opt/stata/ado/ba.."      (sysdir)
     c(sysdir_site) = "/opt/ado/"                (sysdir)
     c(sysdir_plus) = "~/ado/plus/"              (sysdir)
 c(sysdir_personal) = "~/ado/personal/"          (sysdir)
 c(sysdir_oldplace) = "~/ado/"                   (sysdir)
          c(tmpdir) = "/tmp"
-----------------------------------------------------------------------------------------------
         c(adopath) = "BASE;SITE;.;PERSO.."      (adopath)
             c(pwd) = "/nfs/hydrogen/hom.."      (cd)
          c(dirsep) = "/"
-----------------------------------------------------------------------------------------------

System limits

-----------------------------------------------------------------------------------------------
    c(max_N_theory) = 2147483620
    c(max_k_theory) = 2048
c(max_width_theory) = 1048576
-----------------------------------------------------------------------------------------------
      c(max_matdim) = 800
-----------------------------------------------------------------------------------------------
    c(max_it_cvars) = 64
    c(max_it_fvars) = 8
-----------------------------------------------------------------------------------------------
    c(max_macrolen) = 264392
        c(macrolen) = 264392
         c(charlen) = 67783
      c(max_cmdlen) = 264408
          c(cmdlen) = 264408
     c(namelenbyte) = 128
     c(namelenchar) = 32
           c(eqlen) = 1337
-----------------------------------------------------------------------------------------------

Numerical and string limits

-----------------------------------------------------------------------------------------------
       c(mindouble) = -8.9884656743e+307
       c(maxdouble) = 8.9884656743e+307
       c(epsdouble) = 2.22044604925e-16
  c(smallestdouble) = 2.2250738585e-308
-----------------------------------------------------------------------------------------------
        c(minfloat) = -1.70141173319e+38
        c(maxfloat) = 1.70141173319e+38
        c(epsfloat) = 1.19209289551e-07
-----------------------------------------------------------------------------------------------
         c(minlong) = -2147483647
         c(maxlong) = 2147483620
-----------------------------------------------------------------------------------------------
          c(minint) = -32767
          c(maxint) = 32740
-----------------------------------------------------------------------------------------------
         c(minbyte) = -127
         c(maxbyte) = 100
-----------------------------------------------------------------------------------------------
    c(maxstrvarlen) = 2045
   c(maxstrlvarlen) = 2000000000
    c(maxvlabellen) = 32000
-----------------------------------------------------------------------------------------------

Current dataset

-----------------------------------------------------------------------------------------------
           c(frame) = "default"
               c(N) = 5000
               c(k) = 3
           c(width) = 9
         c(changed) = 1
        c(filename) = ""
        c(filedate) = ""
-----------------------------------------------------------------------------------------------

Memory settings

-----------------------------------------------------------------------------------------------
          c(memory) = 33554432
          c(maxvar) = 2048
        c(niceness) = 5                          (set min_memory)
      c(min_memory) = 0                          (set min_memory)
      c(max_memory) = .                          (set max_memory)
     c(segmentsize) = 33554432                   (set segmentsize)
         c(adosize) = 1000                       (set adosize)
-----------------------------------------------------------------------------------------------

Output settings

-----------------------------------------------------------------------------------------------
            c(more) = "off"                      (set more)
            c(rmsg) = "off"                      (set rmsg)
              c(dp) = "period"                   (set dp)
        c(linesize) = 99                         (set linesize)
        c(pagesize) = 43                         (set pagesize)
         c(logtype) = "smcl"                     (set logtype)
         c(noisily) = 1
-----------------------------------------------------------------------------------------------
         c(iterlog) = "on"                       (set iterlog)
-----------------------------------------------------------------------------------------------
           c(level) = 95                         (set level)
          c(clevel) = 95                         (set clevel)
-----------------------------------------------------------------------------------------------
  c(showbaselevels) = ""                         (set showbaselevels)
  c(showemptycells) = ""                         (set showemptycells)
     c(showomitted) = ""                         (set showomitted)
         c(fvlabel) = "on"                       (set fvlabel)
          c(fvwrap) = 1                          (set fvwrap)
        c(fvwrapon) = "word"                     (set fvwrapon)
        c(lstretch) = ""                         (set lstretch)
-----------------------------------------------------------------------------------------------
         c(cformat) = ""                         (set cformat)
         c(sformat) = ""                         (set sformat)
         c(pformat) = ""                         (set pformat)
-----------------------------------------------------------------------------------------------
  c(coeftabresults) = "on"                       (set coeftabresults)
            c(dots) = "on"                       (set dots)

Interface settings

-----------------------------------------------------------------------------------------------
      c(reventries) = 5000                       (set reventries)
      c(fastscroll) = "on"                       (set fastscroll)
         c(linegap) = 1                          (set linegap)
   c(scrollbufsize) = 204800                     (set scrollbufsize)
           c(maxdb) = 50                         (set maxdb)
-----------------------------------------------------------------------------------------------

Graphics settings

-----------------------------------------------------------------------------------------------
        c(graphics) = "on"                       (set graphics)
          c(scheme) = "s2color"                  (set scheme)
      c(printcolor) = "automatic"                (set printcolor)
   c(min_graphsize) = 1                          (region_options)
   c(max_graphsize) = 100                        (region_options)
-----------------------------------------------------------------------------------------------

Network settings

-----------------------------------------------------------------------------------------------
        c(checksum) = "off"                      (set checksum)
        c(timeout1) = 30                         (set timeout1)
        c(timeout2) = 180                        (set timeout2)
-----------------------------------------------------------------------------------------------
       c(httpproxy) = "off"                      (set httpproxy)
   c(httpproxyhost) = ""                         (set httpproxyhost)
   c(httpproxyport) = 8080                       (set httpproxyport)
-----------------------------------------------------------------------------------------------
   c(httpproxyauth) = "off"                      (set httpproxyauth)
   c(httpproxyuser) = ""                         (set httpproxyuser)
     c(httpproxypw) = ""                         (set httpproxypw)
-----------------------------------------------------------------------------------------------

Trace (program debugging) settings

-----------------------------------------------------------------------------------------------
           c(trace) = "off"                      (set trace)
      c(tracedepth) = 32000                      (set tracedepth)
        c(tracesep) = "on"                       (set tracesep)
     c(traceindent) = "on"                       (set traceindent)
     c(traceexpand) = "on"                       (set traceexpand)
     c(tracenumber) = "off"                      (set tracenumber)
     c(tracehilite) = ""                         (set tracehilite)
-----------------------------------------------------------------------------------------------

Mata settings

-----------------------------------------------------------------------------------------------
      c(matastrict) = "off"                      (set matastrict)
        c(matalnum) = "off"                      (set matalnum)
    c(mataoptimize) = "on"                       (set mataoptimize)
       c(matafavor) = "space"                    (set matafavor)
       c(matacache) = 2000                       (set matacache)
        c(matalibs) = "lmatabase;lmatanu.."      (set matalibs)
     c(matamofirst) = "off"                      (set matamofirst)
-----------------------------------------------------------------------------------------------

Java settings

-----------------------------------------------------------------------------------------------
    c(java_heapmax) = "2048m"                    (set java_heapmax)
       c(java_home) = "/opt/stata/utilit.."      (set java_home)
-----------------------------------------------------------------------------------------------

putdocx settings

-----------------------------------------------------------------------------------------------
  c(docx_hardbreak) = "off"                      (set docx_hardbreak)
   c(docx_paramode) = "off"                      (set docx_paramode)
-----------------------------------------------------------------------------------------------

Python settings

-----------------------------------------------------------------------------------------------
     c(python_exec) = ""                         (set python_exec)
 c(python_userpath) = ""                         (set python_userpath)
-----------------------------------------------------------------------------------------------

RNG settings

-----------------------------------------------------------------------------------------------
             c(rng) = "default"                  (set rng)
     c(rng_current) = "mt64"
        c(rngstate) = "XAAe618954835c170.."      (set rngstate)
   c(rngseed_mt64s) = 123456789
       c(rngstream) = 1                          (set rngstream)
-----------------------------------------------------------------------------------------------

Unicode settings

-----------------------------------------------------------------------------------------------
       c(locale_ui) = "en_US"                    (set locale_ui)
c(locale_functions) = "en_US"                    (set locale_functions)
  c(locale_icudflt) = "en_US"                    (unicode locale)
-----------------------------------------------------------------------------------------------

Other settings

-----------------------------------------------------------------------------------------------
            c(type) = "float"                    (set type)
         c(maxiter) = 300                        (set maxiter)
   c(searchdefault) = "all"                      (set searchdefault)
       c(varabbrev) = "on"                       (set varabbrev)
      c(emptycells) = "keep"                     (set emptycells)
         c(fvtrack) = "term"                     (set fvtrack)
          c(fvbase) = "on"                       (set fvbase)
         c(odbcmgr) = "iodbc"                    (set odbcmgr)
      c(odbcdriver) = "unicode"                  (set odbcdriver)
         c(fredkey) = ""                         (set fredkey)
-----------------------------------------------------------------------------------------------

Other

-----------------------------------------------------------------------------------------------
              c(pi) = 3.141592653589793
           c(alpha) = "a b c d e f g h i.."
           c(ALPHA) = "A B C D E F G H I.."
            c(Mons) = "Jan Feb Mar Apr M.."
          c(Months) = "January February .."
           c(Wdays) = "Sun Mon Tue Wed T.."
        c(Weekdays) = "Sunday Monday Tue.."
              c(rc) = 0                          (capture)
-----------------------------------------------------------------------------------------------

.

It looks like you're working on a cluster (the hostnames option was being used). Do you get the error if you keep it local to one machine? It might be something with the ssh connections timing out.

Brian,

I will do a run without hostnames(), stay tuned.

Brian,

Same error: "No dataset for instance 0002."

Same script as above but I removed hostnames() from the parallel command, I left ssh() and procexec().
parallel initialize 36, f statapath(XXX) ssh("ssh -o 'StrictHostKeyChecking no' -q") procexec(2)
Where XXX is the location of the binary.

It failed after a few dozen successful runs (~30-45 mns).

Can you remove the ssh as well? (Also, procexec is only for Windows so you can remove that). I don't think this is your issue, but on some clusters the tmp space is cleared out periodically, so I've had to start Stata with a temp directory that was local to my username. Also can you try to view the log from the failed subprocess?

Brian,

Same error "No dataset for instance 0002." with:
parallel initialize 36, f statapath(XXX)

It took roughly the same number of successful tries before it failed.

__pllhr8undy119.tar.gz

Contains all dta, do, sh files.

Note: I replaced the paths with "XXX" manually from the output below.

. parallel initialize 36, f statapath(XXX)
N Child processes: 36
Stata dir:  XXX

. parallel, prog(parfor): parfor y_pll
--------------------------------------------------------------------------------
Exporting the following program(s): parfor

parfor:
  1.   args var
  2.   di "`c(hostname)'"
  3.   di "`=_N' obs"
  4.   forval i=1/`=_N' {
  5.     qui replace `var' = sqrt(x) in `i'
  6.     replace hostname = "`c(hostname)'"
  7.   }
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Parallel Computing with Stata
Child processes: 36
pll_id         : hr8undy119
Running at     : XXX
Randtype       : datetime

Waiting for the child processes to finish...
child process 0001 has exited without error...
child process 0002 has exited without error...
child process 0003 has exited without error...
child process 0004 has exited without error...
child process 0005 has exited without error...
child process 0006 has exited without error...
child process 0007 has exited without error...
child process 0008 has exited without error...
child process 0009 has exited without error...
child process 0010 has exited without error...
child process 0011 has exited without error...
child process 0012 has exited without error...
child process 0013 has exited without error...
child process 0014 has exited without error...
child process 0015 has exited without error...
child process 0016 has exited without error...
child process 0017 has exited without error...
child process 0018 has exited without error...
child process 0019 has exited without error...
child process 0020 has exited without error...
child process 0021 has exited without error...
child process 0022 has exited without error...
child process 0023 has exited without error...
child process 0024 has exited without error...
child process 0025 has exited without error...
child process 0026 has exited without error...
child process 0027 has exited without error...
child process 0028 has exited without error...
child process 0029 has exited without error...
child process 0030 has exited without error...
child process 0031 has exited without error...
child process 0032 has exited without error...
child process 0033 has exited without error...
child process 0034 has exited without error...
child process 0035 has exited without error...
child process 0036 has exited without error...
--------------------------------------------------------------------------------
Enter -parallel printlog #- to checkout logfiles.
--------------------------------------------------------------------------------
No dataset for instance 0002.
r(601);
[...]

I've made an interesting discovery; and explored a few options.

First, it doesn't look like a file descriptors exhaustion, I was wondering if that would be preventing an append function. I've noticed after repeated tests that it always fails at the 60th try, my guess is that it's how deep a recursion can go (nested do operations); if you look at the attached (above) parallel.do script, you will see that it calls itself, and 60 levels must be either exhausting a local resource or just the max level of nested operations.

As such, I have repeatedly called parallel.do without the recursion 1000 times independently, not a single one failed; but they fail repeatedly after 60 tries using nested calls.

Brian, does this sound like a possibility, that the recursion in the do script would cause an append function to fail?

That sounds right. I think I've hit recursion limits before in Stata, though it's been a while. I suppose it could be tested w/o doing parallel to make sure. Seems like there might not be much we can do at our end, except note the issue.

I agree. Before I close the issue, do you see any issues having twice (or more) the same hostname in the hostnames() argument, in order to balance the payload on multiple machines of different speeds in a more efficient manner? I've done some testing and it looks fine.
i.e.
hostnames("a a a b c c c c")
assuming here that a has 3x more cores than b, and c has 4x more cores than b for instance.

Having duplicate hostnames should fine. That's a good use for them.