biocommons/uta

Refreshing tx_similarity_mv takes several hours

Closed this issue · 9 comments

fdlk commented

I'm importing the postgres dump of uta_20190927.
The download and import of the dump take just a couple of minutes but then it gets stuck on the refresh of the last materialized view:

15-9-2020 20:53:15 REFRESH MATERIALIZED VIEW uta_20190927.tx_similarity_mv; 
16-9-2020 07:14:36 REFRESH MATERIALIZED VIEW

Is this normal?

Seems that uta_20190926 has the same issue, but uta_20190925 doesn't

reece commented

Invitae (where I used to work) uses tx_similarity views often. In uta_20190926, they materialized that view for performance before making the database dump used for docker installs. In a large db instance of AWS RDS, that takes 8 hours; On my fast laptop with SSD, it ran for 12 hours before I killed it.

I've already spoken with Invitae and they've agreed to make that materialization a post-install step rather than part of the database dump.

This problem is isolated to uta_20190926. We intend to support the docker distribution for the foreseeable future.

For now, please use another backup. I'll leave this issue open until we solve it. No estimate for when that will be yet.

Thanks!

fdlk commented

For now, please use another backup. I'll leave this issue open until we solve it. No estimate for when that will be yet.

@reece I picked this version because it was the latest available on http://dl.biocommons.org/uta/. Which version would you recommend? And should it match the version of seqrepo that I'm using?

I had the same issue with http://dl.biocommons.org/uta/uta_20201027.pgd.gz. A quick workaround is to ignore all lines starting with REFRESH MATERIALIZED VIEW:

gzip -cdq uta_${UTA_VERSION}.pgd.gz | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta

I tried the workaround suggested by @oyvinev using a docker container. Briefly, I used the Dockerfile and load-uta.sh entrypoint script from this repo and modified the entrypoint script to filter out REFRESH MATERIALIZED VIEW commands while building the container image (see below code excerpt from load-uta.sh).

The image builds fine and then starts the UTA database quickly without any error. However, after setting up local SeqRepo and the hgvs python package and running the test (as suggested in the hgvs docs) using hgvs-shell, I get an error materialized view "tx_def_summary_mv" has not been populated.
Does this mean that the latest version of hgvs package is expecting the materialized view? Am I missing anything here? Any pointers will be helpful. Thanks!

The details of my implementation are below

hgvs-shell -- interactive hgvs
hgvs version: 1.5.1
data provider url: postgresql://anonymous@localhost:15432/uta/uta_20210129
schema_version: 1.1
data_version: uta_20210129
sequences source: SeqRepo (/mnt/linux-data/persistent_storage/seqrepo/20210129)

Below is the stack trace of the error

In [1]: v = hp.parse_hgvs_variant("NM_004333.6:c.1799T>A")

In [2]: am37.c_to_g(v)
WARNING:hgvs.dataproviders.uta:Lost connection to postgresql://anonymous@localhost:15432/uta/uta_20210129; attempting reconnect
WARNING:hgvs.dataproviders.uta:Reconnected to postgresql://anonymous@localhost:15432/uta/uta_20210129
---------------------------------------------------------------------------
ObjectNotInPrerequisiteState              Traceback (most recent call last)
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/dataproviders/uta.py in _get_cursor(self, n_retries)
    581 
--> 582                 yield cur
    583 

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/dataproviders/uta.py in _fetchall(self, sql, *args)
    220         with self._get_cursor() as cur:
--> 221             cur.execute(sql, *args)
    222             return cur.fetchall()

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/psycopg2/extras.py in execute(self, query, vars)
    145         self._query_executed = True
--> 146         return super(DictCursor, self).execute(query, vars)
    147 

ObjectNotInPrerequisiteState: materialized view "tx_def_summary_mv" has not been populated
HINT:  Use the REFRESH MATERIALIZED VIEW command.


During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/shell.py in <module>
----> 1 am37.c_to_g(v)

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/assemblymapper.py in c_to_g(self, var_c)
    113     def c_to_g(self, var_c):
    114         alt_ac = self._alt_ac_for_tx_ac(var_c.ac)
--> 115         var_out = super(AssemblyMapper, self).c_to_g(
    116             var_c, alt_ac, alt_aln_method=self.alt_aln_method)
    117         return self._maybe_normalize(var_out)

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/variantmapper.py in c_to_g(self, var_c, alt_ac, alt_aln_method)
    283             raise HGVSInvalidVariantError("Expected a cDNA (c.); got " + str(var_c))
    284         if self._validator:
--> 285             self._validator.validate(var_c)
    286         var_c.fill_ref(self.hdp)
    287         mapper = self._fetch_AlignmentMapper(

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/validator.py in validate(self, var, strict)
     38         if strict is None:
     39             strict = self.strict
---> 40         return self._ivr.validate(var, strict) and self._evr.validate(var, strict)
     41 
     42 

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/validator.py in validate(self, var, strict)
     82             var_n = var
     83         elif var.type == "c":
---> 84             var_n = self.vm.c_to_n(var)
     85         if var_n is not None:
     86             res, msg = self._n_within_transcript_bounds(var_n)

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/variantmapper.py in c_to_n(self, var_c)
    327             self._validator.validate(var_c)
    328         var_c.fill_ref(self.hdp)
--> 329         mapper = self._fetch_AlignmentMapper(
    330             tx_ac=var_c.ac, alt_ac=var_c.ac, alt_aln_method="transcript")
    331         pos_n = mapper.c_to_n(var_c.posedit.pos)

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/decorators/lru_cache.py in wrapper(*args, **kwds)
    185                         stats[HITS] += 1
    186                         return result
--> 187                 result = user_function(*args, **kwds)
    188                 with lock:
    189                     root, = nonlocal_root

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/variantmapper.py in _fetch_AlignmentMapper(self, tx_ac, alt_ac, alt_aln_method)
    474         possibly caching the result.
    475         """
--> 476         return hgvs.alignmentmapper.AlignmentMapper(
    477             self.hdp, tx_ac=tx_ac, alt_ac=alt_ac, alt_aln_method=alt_aln_method)
    478 

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/alignmentmapper.py in __init__(self, hdp, tx_ac, alt_ac, alt_aln_method)
    114 
    115             # this covers the identity cases n <-> c
--> 116             tx_identity_info = hdp.get_tx_identity_info(self.tx_ac)
    117             if tx_identity_info is None:
    118                 raise HGVSDataNotAvailableError(

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/decorators/lru_cache.py in wrapper(*args, **kwds)
    185                         stats[HITS] += 1
    186                         return result
--> 187                 result = user_function(*args, **kwds)
    188                 with lock:
    189                     root, = nonlocal_root

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/dataproviders/uta.py in get_tx_identity_info(self, tx_ac)
    358 
    359         """
--> 360         rows = self._fetchall(self._queries['tx_identity_info'], [tx_ac])
    361         if len(rows) == 0:
    362             raise HGVSDataNotAvailableError(

/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/dataproviders/uta.py in _fetchall(self, sql, *args)
    220         with self._get_cursor() as cur:
    221             cur.execute(sql, *args)
--> 222             return cur.fetchall()
    223 
    224     ############################################################################

/usr/lib/python3.8/contextlib.py in __exit__(self, type, value, traceback)
    160                     return False
    161                 raise
--> 162             raise RuntimeError("generator didn't stop after throw()")
    163 
    164 

RuntimeError: generator didn't stop after throw()

modified load-uta.sh

...
# Build post-processing script
cat <<EOF >/tmp/$UTA_VERSION.psql

EOF

# Temp workaround to avoid refresh materialized view creation during build time.
# https://github.com/biocommons/uta/issues/228

gzip -cdq < "${UTA_PGD_FN}" \
    | grep -v "^REFRESH MATERIALIZED VIEW" | psql -1e -U uta_admin -d uta -v ON_ERROR_STOP=1

# gzip -cdq < "${UTA_PGD_FN}" \
#     | perl -n \
# 	   -e 'm/CREATE SCHEMA (uta_\d+)/ && print("ALTER DATABASE :DBNAME SET search_path=$1;\n");' \
# 	   -e 'print if s/CREATE MATERIALIZED VIEW (.\S+) AS/REFRESH MATERIALIZED VIEW $1;/' \
#     | psql -1e -U uta_admin -d uta -v ON_ERROR_STOP=1

...
reece commented

Finally addressing this issue. So sorry everyone for the very long delay.

Log

Copy uta_20210129 to uta_dev database, without refreshing the expensive view

zgrep -v 'REFRESH MATERIALIZED VIEW uta_20210129.tx_similarity_mv;' uta_20210129.pgd.gz | psql -h uta.biocommons.org -U postgres -d uta_dev -1be -v ON_ERROR_STOP=1

Rename the schema and use it

alter schema uta_20210129 rename to uta_20210129b;
set search_path = uta_20210129b ;

Remove the materialize view and update the interface view

drop view tx_similarity_v ;
drop materialized view tx_similarity_mv ;
lter view tx_similarity_dv rename to tx_similarity_v;

N.B. The convention in UTA is that views with a _v suffix are the interfaces to data. If those are found to be slow, I rename the _v to _dv ("defining view") with a corresponding materialize view (_mv), and then recreate the interface view on top of the mat view.

Copy the new schema to the production uta database

pg_dump -h uta.biocommons.org -U postgres -d uta_dev -n uta_20210129b | psql -h uta.biocommons.org -U postgres -d uta -1be -v ON_ERROR_STOP=1

I ❤️ postgresql.

Updated the uta/misc/dump/Makefile

So, now:

snafu$ make dumps
pg_dump -d uta_dev -U uta_admin --no-owner -n uta_20210129b               | gzip -cq >uta_20210129b.pgd.gz
pg_dump -d uta_dev -U uta_admin --no-owner -n uta_20210129b --schema-only | gzip -cq >uta_20210129b-schema.pgd.gz
sha512sum uta_20210129b.pgd.gz >uta_20210129b.pgd.gz.sha512
sha512sum uta_20210129b-schema.pgd.gz >uta_20210129b-schema.pgd.gz.sha512

snafu$ make dl-push
rsync --progress uta_20210129b.pgd.gz uta_20210129b-schema.pgd.gz uta_20210129b.pgd.gz.sha512 uta_20210129b-schema.pgd.gz.sha512 stuart:/biocommons/dl.biocommons.org/uta/
uta_20210129b-schema.pgd.gz
          4,879 100%    3.99MB/s    0:00:00 (xfr#1, to-chk=3/4)
uta_20210129b-schema.pgd.gz.sha512
            158 100%    0.00kB/s    0:00:00 (xfr#2, to-chk=2/4)
uta_20210129b.pgd.gz
    294,563,368 100%    2.95GB/s    0:00:00 (xfr#3, to-chk=1/4)
uta_20210129b.pgd.gz.sha512
            151 100%    1.59kB/s    0:00:00 (xfr#4, to-chk=0/4)

snafu$ make db-push -n
gzip -cdq <uta_20210129b.pgd.gz | time psql -d uta -1be -v ON_ERROR_STOP=1 -f -

Update the docker image

cd misc/docker
UTA_VERSION=uta_20210129b
Then follow the README

Test it!

Following the instructions at https://github.com/biocommons/uta#installing-with-docker-preferred, and with no layers or dumps downloaded , zero-to-installed took me <3 min.

snafu$ date
Sun Sep 10 08:22:14 PM PDT 2023

snafu$ docker run \
   -d \
   -e POSTGRES_PASSWORD=some-password-that-you-make-up \
   -v /tmp:/tmp \
   -v uta_vol:/var/lib/postgresql/data \
   --name uta_20210129b \
   --network=host \
   biocommons/uta:uta_20210129b

snafu$ docker logs -f uta_20210129b
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data/uta_20210129b ... ok
⋮
2023-09-11 03:24:54.863 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-11 03:24:54.879 UTC [135] LOG:  database system was shut down at 2023-09-11 03:24:54 UTC
2023-09-11 03:24:54.894 UTC [1] LOG:  database system is ready to accept connections

snafu$ date
Sun Sep 10 08:25:19 PM PDT 2023

snafu$ psql -X -h localhost -U anonymous -d uta -c "select * from $uta_v.meta"
      key       |                               value                                
----------------+--------------------------------------------------------------------
 schema_version | 1.1
 created on     | 2015-08-21T10:53:50.666152
 license        | CC-BY-SA (http://creativecommons.org/licenses/by-sa/4.0/deed.en_US
 uta version    | 0.2.0a2.dev11+n52ed6e969cfc

reece commented

Done!

@jsstevenson @theferrit32 : Would you please confirm?

docker run \
   -e POSTGRES_PASSWORD=some-password-that-you-make-up \
   -v /tmp:/tmp \
   -v uta_vol:/var/lib/postgresql/data \
   --name uta_20210129b \
   --network=host \
   biocommons/uta:uta_20210129b

Success! Thanks @reece

Screenshot 2023-09-11 at 7 14 18 AM