Move backend processing back from Paphlagon to BEVi

Now that I've done the RAM upgrade, we should restore the data processing workflow on the cluster

We'll need to install R package dependencies to handle the number of infections model, which hasn't been run on BEVi before.

[gopigugan@BEVi ~]$ python3 --version
Python 3.6.8
[gopigugan@BEVi ~]$ R --version
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

We are currently running older versions of Python and R. Should these be updated on BEVi?

My subscription for the clusterware system has expired so I cannot use the package manager to update python on BEVi. The easiest workaround would probably be to do a local installation (i.e. into /usr/local/bin) of a newer version of Python and make sure that it is in the $PATH. Same thing to get R up to version 4.0.

Updated R to version 4.3 at /usr/local
Python updated to 3.11 at /usr/local
ran into SSL certificate issues downloading packages

@GopiGugan to run some tests on BEVi before we switch back over

Running into issues installing the tidyquant R package on BEVi:

[gopigugan@BEVi ~]$ R -e "install.packages('tidyquant',dependencies=TRUE, repos='http://cran.rstudio.com/')"
...
ERROR: dependency ‘textshaping’ is not available for package ‘ragg’
...
ERROR: dependency ‘ragg’ is not available for package ‘tidyverse’
...
ERROR: dependency ‘tidyverse’ is not available for package ‘tidyquant’

Looks like there are some dependencies not available for packages on version 4.3.2 of R. We are currently using version 4.2.2 on Paphlagon.

Downgrading R from version 4.3.2 to 4.2.2 on BEVi

gcc v4.8.5 is the latest supported version on the package manager
R package tidyquant has dependencies that are failing to install because gcc is too old. Compiled gcc version 11.2.0 locally (/usr/local), however gfortran missing. Reinstalling gcc with the following:
- ./configure --prefix=/usr/local --enable-languages=c,c++,fortran --disable-multilib
- make
- make install

Successfully installed R packages. Now running into error when installing rpy2 package:

# pip3 install .
Processing /home/gopigugan/rpy2-RELEASE_3_5_14
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [45 lines of output]
      R was not built as a library
      /home/gopigugan/rpy2-RELEASE_3_5_14/./rpy2/situation.py:335: UserWarning: No libraries as -l arguments to the compiler.
        warnings.warn('No libraries as -l arguments to the compiler.')
      R was not built as a library
      /home/gopigugan/rpy2-RELEASE_3_5_14/./rpy2/situation.py:322: UserWarning: No include specified
        warnings.warn('No include specified')
      /tmp/tmp_pw_r_7nwyffu6/test_pw_r.c:1:10: fatal error: Rinterface.h: No such file or directory
          1 | #include <Rinterface.h>

Issue seems to be the following: R was not built as a library

Reinstalling R version 4.2.2 with the --enable-R-shlib option:

make clean
./configure --prefix=/usr/local --enable-R-shlib
make
make install

rpy2 successfully installed but error importing rpy2

# python3
Python 3.11.3 (main, Jan 16 2024, 01:12:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rpy2.robjects import pandas2ri
Error in glue(.Internal(R.home()), "library", "base", "R", "base", sep = .Platform$file.sep) :
  4 arguments passed to .Internal(paste) which requires 3
Error: could not find function "attach"
Error: object '.ArgsEnv' not found
Fatal error: unable to initialize the JIT

Had to set the following variable to resolve error: export LD_LIBRARY_PATH="$(python3 -m rpy2.situation LD_LIBRARY_PATH)":${LD_LIBRARY_PATH}

Pipeline ran successfully with the test data file:

[covizu@BEVi covizu]$ python3 batch.py --dry-run --infile dev.2000.json.xz
🏄 [0:00:01.038814] Processing GISAID feed data
🏄 [0:00:03.346096] aligned 0 records
🏄 [0:00:03.430148] filtered 1066 problematic features
🏄 [0:00:03.430193]          671 genomes with excess missing sites
🏄 [0:00:03.430204]          163 genomes with excess divergence
🏄 [0:00:03.430838] Parsing Pango lineage designations
🏄 [0:00:05.122239] Identifying lineage representative genomes
🏄 [0:00:05.185900] Reconstructing tree with fasttree2
FastTree Version 2.1.11 Double precision (No SSE3)
...
🏄 [0:01:58.282415][5/56] starting BA.2.1
🏄 [0:02:04.622877][0/56] starting BF.7.5
🏄 [0:02:04.949943][0/56] starting BA.5.1.3
🏄 [0:02:05.022366] Parsing output files
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:

R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:

🏄 [0:03:30.864626] All done!

Initiated a dry run to verify there are no issues: nohup python3 batch.py --dry-run > ~/iss499.log &

@GopiGugan reports a successful run

where are we going to store the database?
BEVi has a couple of RAIDs:
- /dev/md126 is mounted at /home and is a RAID1 of two drives for 1.8TB storage (790GB currently available)
- /dev/md127 is mounted at /data and is a RAID5 of four drives for 11TB storage (9.8 TB currently available)
I don't think there would be a latency difference in writing to one RAID versus another, other than the performance hit of different formats.

current estimate for database size is on the order of 10 GB
I think we can get away with storing it on /home for now for the improved write performance of RAID1 over RAID5
can do database dump backups to /data

Obviously this is on hold until we can get the damn cluster back online (#516)

Currently building database on BEVi (#493, #485)

Investigating a KeyError while building the database:

Traceback (most recent call last):
  File "/home/covizu/covizu/batch.py", line 250, in <module>
    by_lineage = process_feed(args, cur, cb.callback)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/covizu/covizu/batch.py", line 179, in process_feed
    return gisaid_utils.sort_by_lineage(filtered, callback=callback)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 277, in sort_by_lineage
    for i, record in enumerate(records):
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 220, in filter_problematic
    for record in records:
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 179, in extract_features
    record = new_records[qname]
             ~~~~~~~~~~~^^^^^^^
KeyError: 'hCoV-19/South'

The issue is that when there is a space in the virus name (qname), e.g. hCoV-19/South Africa/..... it gets cut off in the minimap2 output:

covizu/covizu/minimap2.py

Lines 67 to 72 in db11b2f

    
           for line in output: 
        
               if line == '' or line.startswith('@'): 
        
                   # split on \n leaves empty line; @ prefix header lines 
        
                   continue 
        
               qname, flag, rname, rpos, _, cigar, _, _, _, seq = \ 
        
                   line.strip('\n').split('\t')[:10]

So the output was failing when trying to retrieve a record by qname:

covizu/covizu/utils/gisaid_utils.py

Lines 177 to 180 in db11b2f

    
           for qname, diffs, missing in result: 
        
               # reconcile minimap2 output with GISAID record 
        
               record = new_records[qname] 
        
               record.update({'diffs': diffs, 'missing': missing})

Pipeline is also failing because we are retrieving records and inserting records into the database based on the qname instead of the accession id and qname is not unique:

covizu/covizu/utils/gisaid_utils.py

Lines 122 to 124 in db11b2f

    
           if cur: 
        
               cur.execute("SELECT * FROM SEQUENCES WHERE qname = '%s'"%qname) 
        
               result = cur.fetchone()

covizu/covizu/utils/gisaid_utils.py

Lines 179 to 187 in db11b2f

    
           record = new_records[qname] 
        
           record.update({'diffs': diffs, 'missing': missing}) 
        
           if cur: 
        
               # inserting diffs and missing as json strings 
        
               cur.execute("INSERT INTO SEQUENCES VALUES(%s, %s, %s, %s, %s, %s, %s)", 
        
                           [json.dumps(v) if k in ['diffs', 'missing'] else v for k, v in record.items()]) 
        
               cur.execute("INSERT INTO NEW_RECORDS VALUES(%s, %s)", 
        
                           [qname, record['covv_lineage']])

Let's write database dumps to the filesystem on the following basis:

weekly (with every run)
erase 3 weeks out of 4 past three months (retain monthly dumps beyond 3 months to present)
in the long run (?) retain quarterly dumps beyond 3 years

@GopiGugan testing out script for clearing out expired logs

@GopiGugan to push the clean up script to repo and close

	for line in output:
	if line == '' or line.startswith('@'):
	# split on \n leaves empty line; @ prefix header lines
	continue
	qname, flag, rname, rpos, _, cigar, _, _, _, seq = \
	line.strip('\n').split('\t')[:10]

	for qname, diffs, missing in result:
	# reconcile minimap2 output with GISAID record
	record = new_records[qname]
	record.update({'diffs': diffs, 'missing': missing})

	if cur:
	cur.execute("SELECT * FROM SEQUENCES WHERE qname = '%s'"%qname)
	result = cur.fetchone()