Move backend processing back from Paphlagon to BEVi
Closed this issue · 21 comments
Now that I've done the RAM upgrade, we should restore the data processing workflow on the cluster
We'll need to install R package dependencies to handle the number of infections model, which hasn't been run on BEVi before.
[gopigugan@BEVi ~]$ python3 --version
Python 3.6.8
[gopigugan@BEVi ~]$ R --version
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.
We are currently running older versions of Python and R. Should these be updated on BEVi?
My subscription for the clusterware system has expired so I cannot use the package manager to update python on BEVi. The easiest workaround would probably be to do a local installation (i.e. into /usr/local/bin
) of a newer version of Python and make sure that it is in the $PATH
. Same thing to get R up to version 4.0.
- Updated R to version 4.3 at
/usr/local
- Python updated to 3.11 at
/usr/local
- ran into SSL certificate issues downloading packages
@GopiGugan to run some tests on BEVi before we switch back over
Running into issues installing the tidyquant
R package on BEVi:
[gopigugan@BEVi ~]$ R -e "install.packages('tidyquant',dependencies=TRUE, repos='http://cran.rstudio.com/')"
...
ERROR: dependency ‘textshaping’ is not available for package ‘ragg’
...
ERROR: dependency ‘ragg’ is not available for package ‘tidyverse’
...
ERROR: dependency ‘tidyverse’ is not available for package ‘tidyquant’
Looks like there are some dependencies not available for packages on version 4.3.2 of R. We are currently using version 4.2.2 on Paphlagon.
Downgrading R from version 4.3.2 to 4.2.2 on BEVi
- gcc v4.8.5 is the latest supported version on the package manager
- R package
tidyquant
has dependencies that are failing to install becausegcc
is too old. Compiledgcc
version 11.2.0 locally (/usr/local
), however gfortran missing. Reinstalling gcc with the following:./configure --prefix=/usr/local --enable-languages=c,c++,fortran --disable-multilib
make
make install
Successfully installed R packages. Now running into error when installing rpy2
package:
# pip3 install .
Processing /home/gopigugan/rpy2-RELEASE_3_5_14
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [45 lines of output]
R was not built as a library
/home/gopigugan/rpy2-RELEASE_3_5_14/./rpy2/situation.py:335: UserWarning: No libraries as -l arguments to the compiler.
warnings.warn('No libraries as -l arguments to the compiler.')
R was not built as a library
/home/gopigugan/rpy2-RELEASE_3_5_14/./rpy2/situation.py:322: UserWarning: No include specified
warnings.warn('No include specified')
/tmp/tmp_pw_r_7nwyffu6/test_pw_r.c:1:10: fatal error: Rinterface.h: No such file or directory
1 | #include <Rinterface.h>
Issue seems to be the following: R was not built as a library
Reinstalling R version 4.2.2 with the --enable-R-shlib
option:
make clean
./configure --prefix=/usr/local --enable-R-shlib
make
make install
rpy2
successfully installed but error importing rpy2
# python3
Python 3.11.3 (main, Jan 16 2024, 01:12:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rpy2.robjects import pandas2ri
Error in glue(.Internal(R.home()), "library", "base", "R", "base", sep = .Platform$file.sep) :
4 arguments passed to .Internal(paste) which requires 3
Error: could not find function "attach"
Error: object '.ArgsEnv' not found
Fatal error: unable to initialize the JIT
Had to set the following variable to resolve error: export LD_LIBRARY_PATH="$(python3 -m rpy2.situation LD_LIBRARY_PATH)":${LD_LIBRARY_PATH}
Pipeline ran successfully with the test data file:
[covizu@BEVi covizu]$ python3 batch.py --dry-run --infile dev.2000.json.xz
🏄 [0:00:01.038814] Processing GISAID feed data
🏄 [0:00:03.346096] aligned 0 records
🏄 [0:00:03.430148] filtered 1066 problematic features
🏄 [0:00:03.430193] 671 genomes with excess missing sites
🏄 [0:00:03.430204] 163 genomes with excess divergence
🏄 [0:00:03.430838] Parsing Pango lineage designations
🏄 [0:00:05.122239] Identifying lineage representative genomes
🏄 [0:00:05.185900] Reconstructing tree with fasttree2
FastTree Version 2.1.11 Double precision (No SSE3)
...
🏄 [0:01:58.282415][5/56] starting BA.2.1
🏄 [0:02:04.622877][0/56] starting BF.7.5
🏄 [0:02:04.949943][0/56] starting BA.5.1.3
🏄 [0:02:05.022366] Parsing output files
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:
🏄 [0:03:30.864626] All done!
Initiated a dry run to verify there are no issues: nohup python3 batch.py --dry-run > ~/iss499.log &
@GopiGugan reports a successful run
- where are we going to store the database?
- BEVi has a couple of RAIDs:
/dev/md126
is mounted at/home
and is a RAID1 of two drives for 1.8TB storage (790GB currently available)/dev/md127
is mounted at/data
and is a RAID5 of four drives for 11TB storage (9.8 TB currently available)
- I don't think there would be a latency difference in writing to one RAID versus another, other than the performance hit of different formats.
- current estimate for database size is on the order of 10 GB
- I think we can get away with storing it on
/home
for now for the improved write performance of RAID1 over RAID5 - can do database dump backups to
/data
Investigating a KeyError while building the database:
Traceback (most recent call last):
File "/home/covizu/covizu/batch.py", line 250, in <module>
by_lineage = process_feed(args, cur, cb.callback)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/covizu/covizu/batch.py", line 179, in process_feed
return gisaid_utils.sort_by_lineage(filtered, callback=callback)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 277, in sort_by_lineage
for i, record in enumerate(records):
File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 220, in filter_problematic
for record in records:
File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 179, in extract_features
record = new_records[qname]
~~~~~~~~~~~^^^^^^^
KeyError: 'hCoV-19/South'
The issue is that when there is a space in the virus name (qname
), e.g. hCoV-19/South Africa/.....
it gets cut off in the minimap2 output
:
Lines 67 to 72 in db11b2f
So the output was failing when trying to retrieve a record by qname
:
covizu/covizu/utils/gisaid_utils.py
Lines 177 to 180 in db11b2f
Pipeline is also failing because we are retrieving records and inserting records into the database based on the qname
instead of the accession id and qname
is not unique:
covizu/covizu/utils/gisaid_utils.py
Lines 122 to 124 in db11b2f
covizu/covizu/utils/gisaid_utils.py
Lines 179 to 187 in db11b2f
Let's write database dumps to the filesystem on the following basis:
- weekly (with every run)
- erase 3 weeks out of 4 past three months (retain monthly dumps beyond 3 months to present)
- in the long run (?) retain quarterly dumps beyond 3 years
@GopiGugan testing out script for clearing out expired logs
@GopiGugan to push the clean up script to repo and close