SebKrantz/collapse

FIFO

Steviey opened this issue · 9 comments

Could it be, collapse can cause accumulating FIFO connections- and how to get rid of them? I 'm working with extensive data set-windows.

image ... Too many open files

Hi, thanks, but I have no idea what FIFO is, and I'm not going to study it unless there is clear evidence that collapse is doing something harmful. Please submit a reproducible example and explain what is going on and why it needs to be changed. collapse passes rigorous compiled code checks on CRAN, so I don't think memory leakage is an issue.

R latest, collapse latest, Ubuntu 20.04.5 LTS

Well, I have an extensive script mainly developed with collapse. Unfortunately it runs 10+ hours and I can't isolate or simulate parts of it, because the only glimpse I have is, that there are after 2 hours+ to many file handles open. The script calculates ML-features over runner windows- meaning it does the same calculations over and over again in intervals of several minutes. The error message "Too many open files" arises whenever something tries to save something- but it needs 2 hours+ to happen. I read somewhere about FIFO in context of C-coding, but nothing specific to the issue. FIFO should be connections which do not close automatically. Do you have any ideas on my issue?

P.S.: Some related Linux-commands:

open file limits:
ulimit -n

open files:
lsof -bT -p <pid of rsession or R or rserver>

most notorious open-file progs:
lsof | awk '{ print $1 " " $2; }' | sort -rn | uniq -c | sort -rn | head -15

...As you can see in the picture above, I can locate the problem only until R and further the process Types: FIFO and REG. As far as I know, there is currently no build-in function to locate such problems in R.
I try to narrow it down a little further and let you know.

Another error I get from time to time is (might be related):

 *** caught segfault ***
address 0x30, cause 'memory not mapped'

I could split the data to process in parts, but this would be a hint only, since rolling-windows calculations based on each other and time consuming.

Searching the problem...

cat('\n')
msg<-paste0("lsof -V 2>/dev/null -bT -p ",pid)
output<-system(msg,intern=TRUE)

At the beginning of the calculations 'lsof-output' looks relatively normal.
I'm filtering TYPE=='REG' | TYPE=='FIFO' in the second screen shot.
Notice there is no FIFO-entry at all (137 rounds of calculations, 30 min.).
summary of file handles by type:
image

file handles of type REG/FIFO:
image

The only thing what happens, is adding columns (features) to a tibble, based on calculations using mainly collapse-methods.

The most frequent part is the following embedded in a S4-Class:

df <-df %>% 
	collapse::fselect(variable=idTxt,value=valTxt) %>%
	collapse::funique(.) %>%
	collapse::fmutate(title=paste0(titleTxt,variable)) %>% 
	collapse::fmutate(ids=1) %>% # immer eine row!!! 
	collapse::pivot(.,
		ids      = 'ids'
		,values  = 'value'
		,names   = 'title'
		,how='wider',na.rm=TRUE
	)
df[['ids']]<-NULL 

I tried:

  • gc() (no success)
  • Sys.sleep() (no success)
  • unloading + reloading of packages e.g. collapse (no success)
  • reduce the columns/features to calculate drastically (success, but not the desired outcome)

With an artificial col-limit of: 3108, I suspected:
Automatic reduction of open file handles at: min.: 72, round: 232, from 3609 to 2286
So far no FIFO involved.

And then:
min.: 96, round: 285, dramatic increase of file handles, FIFO comes into play.
There is no further information about the location then 'pipe':

image

Additional searches in regard to the FIFO pipes are leading to the executed R-script but not to potential libraries/packages.

Thanks! This is helpful, but I‘d need to know the exact function call that is causing problems. The segfault us a clear indication of some memory error somewhere. Which function is causing it?

Currently I have no specific function call- even using lobstr-tree.

The final crash can happen whenever something tries to save something, after we have a bunch of open file handles- meaning everywhere. For example after sampling the new features/columns and save them via saveRDS. But even lobstr (getting line-info for debug) won't work properly after that, since it tries to save something too ;-). So the actual problem is the accumulation of file handles- for what we have no line number no function indication yet. Meanwhile I try to get some information about the contents of the FIFO pipes on directory level...

        get sender and client of pipe id:
        lsof -n -P | grep 1359259

        print command of sender pid (result=triggered R shell script):
        ps -p 255386 -o cmd

        print client (result=R, useless):
        ps -p 255488 

        print content of a relevant pipe (failed):
        cat /proc/255386/fd/999

        ...try to solve with ChatGPT...
  

I also read something in general, about problems in R with too much colúmns.

Ok, just a question: are you using data.table's in this code? and are use using indexed series or frames from collapse?

I guess I don't use data.tables directly. But I m not sure. I don't know what indexed series or frames of collapse mean. Mainly I use matrices and tibbles and often the collapse convertations qTBL(), qM().

For the second error you mentioned I have a traceback. But it is relatively useless...
since it is a irrecoverable exception...it points to the initial starting point.

51: dplyr::mutate(., metaData = purrr::map(data, ~doRoll(.)))
				52: setupGrid %>% dplyr::group_by(mapId) %>% tidyr::nest(data = -any_of(c("mapId"))) %>%     dplyr::mutate(metaData = purrr::map(data, ~doRoll(.)))
				
				An irrecoverable exception occurred. R is aborting now ...
				./prepPreds.sh: line 427: 249493 Segmentation fault      Rscript --vanilla --max-ppsize=500000 newTrigger.R ten66Roll $algoMode

Ok, thanks. I don't know then which part of collapse code would open file handles. data.table's would be a possiility since they need to be shallow copied, and indexed structures also create external pointers. But the other code is pretty normal. I don't think in your case this has anything to do with these features. So at the moment there is nothing I can do, unless you can narrow down the problem to a specifica function call. There is not a general problem with collapse as a package, but it may be a specific C-function that might access memory in non-existing places (generally causing a segfault).

R latest, collapse latest, Ubuntu 20.04.5 LTS

Ah ok thank you. I will search a little further.

Update: I wasn't able to finally find the causing function(s) on 'file-level', so I decided to split up- not the data, but the calculations. The refactoring causes efforts, but this will give me the most reliability and flexibility.

Setup:

Server: AMD 8 core, 32 GB RAM, standalone
collapse::set_collapse(
  nthreads = 2
  ,sort    = FALSE
  ,mask    = c("mutate", "unique","summarise","select","group_by","nrow","ncol")
  ,remove  = "old"
  ,verbose = FALSE
)
RcppParallel::setThreadOptions(numThreads=2, stackSize = "auto")

Splitting up the calculations over

  • constant ~3500 rows
  • 667 rolling windows a 150 rows
    ...in 4 parts/R-sessions, shows:
  • part 1: +315 cols, runtime: 106.8 min., FIFO: 1, no significant accumulation
  • part 2: +1472 cols, runtime: 66.73 min., FIFO: 1, no significant accumulation
  • part 3: +52752 cols, runtime: 127.07 min., FIFO: 1, no significant accumulation
  • part 4: +42610 cols, runtime: 100+ min., FIFO:1, no significant accumulation

Update I: The problem has gone by splitting up the calculations into 4 different
R-sessions. Currently there is no further explanation (strange).

Update II: A hardware-check shows a defect RAM-module on the remote server.
Theoretically this could be related.

D3SL commented

A bit late to the party here but I think I can offer some insight. What you're describing sounds like a very low level issue that has to do with how R hands interacting with the filesystem and other processes. There's a number of ways this could become an issue and without seeing the actual code it's impossible to be specific.

The obvious one is of course reading and writing files. R treats those as a connection and depending on how you're handling everything you may not actually ever be closing them. This isn't a collapse specific issue but rather R itself, which is why several other people have had the same problem with everything from read_csv() to base functions like file() and download.file().

If I'm right you might be able to find the culprit by running your original script while periodically logging showConnections(all=TRUE).

As an aside if you're running into this issue there's likely a lot of other places you can optimize your code as well:

  • fread and fwrite are vastly faster than their tidyverse equivalents, and arrow's .parquet files are better still.
  • data.tables are faster and more efficient than tibbles, especially when keyed well.
  • pipes (particular the base pipe) are pretty fast but with absolutely huge objects the overhead can be non-trivial.
  • IF you're doing operations that make deep copies then update-by-reference will be massively faster and use half the memory.

And on a much more advanced level there's multithreading. By nature every call to map is a parallelizable operation. The question is whether the objects involved are serializable, whether serialization overhead will outstrip parallelization gains, and if you have enough RAM for each child process. Crew's efficient Mirai backend nearly moots the overhead issue but is a bit more complicated to use than the others. The use of external R processes may also help with your connection issues since you'll have greater control over the workers and controller.