Refreshing tx_similarity_mv takes several hours
Closed this issue · 9 comments
I'm importing the postgres dump of uta_20190927.
The download and import of the dump take just a couple of minutes but then it gets stuck on the refresh of the last materialized view:
15-9-2020 20:53:15 REFRESH MATERIALIZED VIEW uta_20190927.tx_similarity_mv;
16-9-2020 07:14:36 REFRESH MATERIALIZED VIEW
Is this normal?
Seems that uta_20190926
has the same issue, but uta_20190925
doesn't
Invitae (where I used to work) uses tx_similarity views often. In uta_20190926, they materialized that view for performance before making the database dump used for docker installs. In a large db instance of AWS RDS, that takes 8 hours; On my fast laptop with SSD, it ran for 12 hours before I killed it.
I've already spoken with Invitae and they've agreed to make that materialization a post-install step rather than part of the database dump.
This problem is isolated to uta_20190926. We intend to support the docker distribution for the foreseeable future.
For now, please use another backup. I'll leave this issue open until we solve it. No estimate for when that will be yet.
Thanks!
For now, please use another backup. I'll leave this issue open until we solve it. No estimate for when that will be yet.
@reece I picked this version because it was the latest available on http://dl.biocommons.org/uta/
. Which version would you recommend? And should it match the version of seqrepo that I'm using?
I had the same issue with http://dl.biocommons.org/uta/uta_20201027.pgd.gz. A quick workaround is to ignore all lines starting with REFRESH MATERIALIZED VIEW:
gzip -cdq uta_${UTA_VERSION}.pgd.gz | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta
I tried the workaround suggested by @oyvinev using a docker container. Briefly, I used the Dockerfile and load-uta.sh entrypoint script from this repo and modified the entrypoint script to filter out REFRESH MATERIALIZED VIEW commands while building the container image (see below code excerpt from load-uta.sh).
The image builds fine and then starts the UTA database quickly without any error. However, after setting up local SeqRepo and the hgvs python package and running the test (as suggested in the hgvs docs) using hgvs-shell, I get an error materialized view "tx_def_summary_mv" has not been populated
.
Does this mean that the latest version of hgvs package is expecting the materialized view? Am I missing anything here? Any pointers will be helpful. Thanks!
The details of my implementation are below
hgvs-shell -- interactive hgvs
hgvs version: 1.5.1
data provider url: postgresql://anonymous@localhost:15432/uta/uta_20210129
schema_version: 1.1
data_version: uta_20210129
sequences source: SeqRepo (/mnt/linux-data/persistent_storage/seqrepo/20210129)
Below is the stack trace of the error
In [1]: v = hp.parse_hgvs_variant("NM_004333.6:c.1799T>A")
In [2]: am37.c_to_g(v)
WARNING:hgvs.dataproviders.uta:Lost connection to postgresql://anonymous@localhost:15432/uta/uta_20210129; attempting reconnect
WARNING:hgvs.dataproviders.uta:Reconnected to postgresql://anonymous@localhost:15432/uta/uta_20210129
---------------------------------------------------------------------------
ObjectNotInPrerequisiteState Traceback (most recent call last)
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/dataproviders/uta.py in _get_cursor(self, n_retries)
581
--> 582 yield cur
583
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/dataproviders/uta.py in _fetchall(self, sql, *args)
220 with self._get_cursor() as cur:
--> 221 cur.execute(sql, *args)
222 return cur.fetchall()
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/psycopg2/extras.py in execute(self, query, vars)
145 self._query_executed = True
--> 146 return super(DictCursor, self).execute(query, vars)
147
ObjectNotInPrerequisiteState: materialized view "tx_def_summary_mv" has not been populated
HINT: Use the REFRESH MATERIALIZED VIEW command.
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/shell.py in <module>
----> 1 am37.c_to_g(v)
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/assemblymapper.py in c_to_g(self, var_c)
113 def c_to_g(self, var_c):
114 alt_ac = self._alt_ac_for_tx_ac(var_c.ac)
--> 115 var_out = super(AssemblyMapper, self).c_to_g(
116 var_c, alt_ac, alt_aln_method=self.alt_aln_method)
117 return self._maybe_normalize(var_out)
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/variantmapper.py in c_to_g(self, var_c, alt_ac, alt_aln_method)
283 raise HGVSInvalidVariantError("Expected a cDNA (c.); got " + str(var_c))
284 if self._validator:
--> 285 self._validator.validate(var_c)
286 var_c.fill_ref(self.hdp)
287 mapper = self._fetch_AlignmentMapper(
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/validator.py in validate(self, var, strict)
38 if strict is None:
39 strict = self.strict
---> 40 return self._ivr.validate(var, strict) and self._evr.validate(var, strict)
41
42
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/validator.py in validate(self, var, strict)
82 var_n = var
83 elif var.type == "c":
---> 84 var_n = self.vm.c_to_n(var)
85 if var_n is not None:
86 res, msg = self._n_within_transcript_bounds(var_n)
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/variantmapper.py in c_to_n(self, var_c)
327 self._validator.validate(var_c)
328 var_c.fill_ref(self.hdp)
--> 329 mapper = self._fetch_AlignmentMapper(
330 tx_ac=var_c.ac, alt_ac=var_c.ac, alt_aln_method="transcript")
331 pos_n = mapper.c_to_n(var_c.posedit.pos)
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/decorators/lru_cache.py in wrapper(*args, **kwds)
185 stats[HITS] += 1
186 return result
--> 187 result = user_function(*args, **kwds)
188 with lock:
189 root, = nonlocal_root
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/variantmapper.py in _fetch_AlignmentMapper(self, tx_ac, alt_ac, alt_aln_method)
474 possibly caching the result.
475 """
--> 476 return hgvs.alignmentmapper.AlignmentMapper(
477 self.hdp, tx_ac=tx_ac, alt_ac=alt_ac, alt_aln_method=alt_aln_method)
478
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/alignmentmapper.py in __init__(self, hdp, tx_ac, alt_ac, alt_aln_method)
114
115 # this covers the identity cases n <-> c
--> 116 tx_identity_info = hdp.get_tx_identity_info(self.tx_ac)
117 if tx_identity_info is None:
118 raise HGVSDataNotAvailableError(
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/decorators/lru_cache.py in wrapper(*args, **kwds)
185 stats[HITS] += 1
186 return result
--> 187 result = user_function(*args, **kwds)
188 with lock:
189 root, = nonlocal_root
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/dataproviders/uta.py in get_tx_identity_info(self, tx_ac)
358
359 """
--> 360 rows = self._fetchall(self._queries['tx_identity_info'], [tx_ac])
361 if len(rows) == 0:
362 raise HGVSDataNotAvailableError(
/mnt/linux-data/projects/py-venvs/genomics/lib/python3.8/site-packages/hgvs/dataproviders/uta.py in _fetchall(self, sql, *args)
220 with self._get_cursor() as cur:
221 cur.execute(sql, *args)
--> 222 return cur.fetchall()
223
224 ############################################################################
/usr/lib/python3.8/contextlib.py in __exit__(self, type, value, traceback)
160 return False
161 raise
--> 162 raise RuntimeError("generator didn't stop after throw()")
163
164
RuntimeError: generator didn't stop after throw()
modified load-uta.sh
...
# Build post-processing script
cat <<EOF >/tmp/$UTA_VERSION.psql
EOF
# Temp workaround to avoid refresh materialized view creation during build time.
# https://github.com/biocommons/uta/issues/228
gzip -cdq < "${UTA_PGD_FN}" \
| grep -v "^REFRESH MATERIALIZED VIEW" | psql -1e -U uta_admin -d uta -v ON_ERROR_STOP=1
# gzip -cdq < "${UTA_PGD_FN}" \
# | perl -n \
# -e 'm/CREATE SCHEMA (uta_\d+)/ && print("ALTER DATABASE :DBNAME SET search_path=$1;\n");' \
# -e 'print if s/CREATE MATERIALIZED VIEW (.\S+) AS/REFRESH MATERIALIZED VIEW $1;/' \
# | psql -1e -U uta_admin -d uta -v ON_ERROR_STOP=1
...
Finally addressing this issue. So sorry everyone for the very long delay.
Log
Copy uta_20210129 to uta_dev database, without refreshing the expensive view
zgrep -v 'REFRESH MATERIALIZED VIEW uta_20210129.tx_similarity_mv;' uta_20210129.pgd.gz | psql -h uta.biocommons.org -U postgres -d uta_dev -1be -v ON_ERROR_STOP=1
Rename the schema and use it
alter schema uta_20210129 rename to uta_20210129b;
set search_path = uta_20210129b ;
Remove the materialize view and update the interface view
drop view tx_similarity_v ;
drop materialized view tx_similarity_mv ;
lter view tx_similarity_dv rename to tx_similarity_v;
N.B. The convention in UTA is that views with a _v
suffix are the interfaces to data. If those are found to be slow, I rename the _v
to _dv
("defining view") with a corresponding materialize view (_mv
), and then recreate the interface view on top of the mat view.
Copy the new schema to the production uta database
pg_dump -h uta.biocommons.org -U postgres -d uta_dev -n uta_20210129b | psql -h uta.biocommons.org -U postgres -d uta -1be -v ON_ERROR_STOP=1
I ❤️ postgresql.
Updated the uta/misc/dump/Makefile
So, now:
snafu$ make dumps
pg_dump -d uta_dev -U uta_admin --no-owner -n uta_20210129b | gzip -cq >uta_20210129b.pgd.gz
pg_dump -d uta_dev -U uta_admin --no-owner -n uta_20210129b --schema-only | gzip -cq >uta_20210129b-schema.pgd.gz
sha512sum uta_20210129b.pgd.gz >uta_20210129b.pgd.gz.sha512
sha512sum uta_20210129b-schema.pgd.gz >uta_20210129b-schema.pgd.gz.sha512
snafu$ make dl-push
rsync --progress uta_20210129b.pgd.gz uta_20210129b-schema.pgd.gz uta_20210129b.pgd.gz.sha512 uta_20210129b-schema.pgd.gz.sha512 stuart:/biocommons/dl.biocommons.org/uta/
uta_20210129b-schema.pgd.gz
4,879 100% 3.99MB/s 0:00:00 (xfr#1, to-chk=3/4)
uta_20210129b-schema.pgd.gz.sha512
158 100% 0.00kB/s 0:00:00 (xfr#2, to-chk=2/4)
uta_20210129b.pgd.gz
294,563,368 100% 2.95GB/s 0:00:00 (xfr#3, to-chk=1/4)
uta_20210129b.pgd.gz.sha512
151 100% 1.59kB/s 0:00:00 (xfr#4, to-chk=0/4)
snafu$ make db-push -n
gzip -cdq <uta_20210129b.pgd.gz | time psql -d uta -1be -v ON_ERROR_STOP=1 -f -
Update the docker image
cd misc/docker
UTA_VERSION=uta_20210129b
Then follow the README
Test it!
Following the instructions at https://github.com/biocommons/uta#installing-with-docker-preferred, and with no layers or dumps downloaded , zero-to-installed took me <3 min.
snafu$ date
Sun Sep 10 08:22:14 PM PDT 2023
snafu$ docker run \
-d \
-e POSTGRES_PASSWORD=some-password-that-you-make-up \
-v /tmp:/tmp \
-v uta_vol:/var/lib/postgresql/data \
--name uta_20210129b \
--network=host \
biocommons/uta:uta_20210129b
snafu$ docker logs -f uta_20210129b
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data/uta_20210129b ... ok
⋮
2023-09-11 03:24:54.863 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-11 03:24:54.879 UTC [135] LOG: database system was shut down at 2023-09-11 03:24:54 UTC
2023-09-11 03:24:54.894 UTC [1] LOG: database system is ready to accept connections
snafu$ date
Sun Sep 10 08:25:19 PM PDT 2023
snafu$ psql -X -h localhost -U anonymous -d uta -c "select * from $uta_v.meta"
key | value
----------------+--------------------------------------------------------------------
schema_version | 1.1
created on | 2015-08-21T10:53:50.666152
license | CC-BY-SA (http://creativecommons.org/licenses/by-sa/4.0/deed.en_US
uta version | 0.2.0a2.dev11+n52ed6e969cfc
Done!
@jsstevenson @theferrit32 : Would you please confirm?
docker run \
-e POSTGRES_PASSWORD=some-password-that-you-make-up \
-v /tmp:/tmp \
-v uta_vol:/var/lib/postgresql/data \
--name uta_20210129b \
--network=host \
biocommons/uta:uta_20210129b