package ignoring locally installed versions
knaaptime opened this issue · 9 comments
hey folks, I'm able to install a package locally, but doing so doesnt seem to accomplish anything because quilt always defaults to the remote version
import quilt3
quilt3.Package.install(
"census/administrative",
"s3://spatial-ucr",
)
restart notebook kernel
from quilt3.data.census import administrative
administrative.get('msa_definitions.parquet') # this should be local now
returns
's3://spatial-ucr/census/administrative/msa_definitions.parquet?versionId=y5lH1FmQZmmnCXh5x180fiVAWOjYuitb'
the files do exist where they should at /Users/knaaptime/Library/Application Support/Quilt/packages/census/administrative/
but all the package methods ignore them. Further, if i do
admin = quilt3.Package.browse("census/administrative", "local")
it appears to be looking in the wrong place, returning this error:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-6-eb0fe4f642ff> in <module>
----> 1 admin = quilt3.Package.browse("census/administrative", "local")
~/anaconda3/envs/geosnap/lib/python3.7/site-packages/quilt3/telemetry.py in decorated(*args, **kwargs)
129 ApiTelemetry.report_api_use(self.api_name, ApiTelemetry.session_id)
130
--> 131 results = func(*args, **kwargs)
132 # print(f"{len(ApiTelemetry.pending_reqs)} request(s) pending!")
133
~/anaconda3/envs/geosnap/lib/python3.7/site-packages/quilt3/packages.py in browse(cls, name, registry, top_hash)
530 top_hash(string): top hash of package version to load
531 """
--> 532 return cls._browse(name=name, registry=registry, top_hash=top_hash)
533
534 @classmethod
~/anaconda3/envs/geosnap/lib/python3.7/site-packages/quilt3/packages.py in _browse(cls, name, registry, top_hash)
543 if top_hash is None:
544 top_hash_file = registry_parsed.join(f'.quilt/named_packages/{name}/latest')
--> 545 top_hash = get_bytes(top_hash_file).decode('utf-8').strip()
546 else:
547 top_hash = cls.resolve_hash(registry_parsed, top_hash)
~/anaconda3/envs/geosnap/lib/python3.7/site-packages/quilt3/data_transfer.py in get_bytes(src)
761 if src.is_local():
762 src_file = pathlib.Path(src.path)
--> 763 data = src_file.read_bytes()
764 else:
765 params = dict(Bucket=src.bucket, Key=src.path)
~/anaconda3/envs/geosnap/lib/python3.7/pathlib.py in read_bytes(self)
1212 Open the file in bytes mode, read it, and close the file.
1213 """
-> 1214 with self.open(mode='rb') as f:
1215 return f.read()
1216
~/anaconda3/envs/geosnap/lib/python3.7/pathlib.py in open(self, mode, buffering, encoding, errors, newline)
1206 self._raise_closed()
1207 return io.open(self, mode, buffering, encoding, errors, newline,
-> 1208 opener=self._opener)
1209
1210 def read_bytes(self):
~/anaconda3/envs/geosnap/lib/python3.7/pathlib.py in _opener(self, name, flags, mode)
1061 def _opener(self, name, flags, mode=0o666):
1062 # A stub for the opener argument to built-in open()
-> 1063 return self._accessor.open(self, flags, mode)
1064
1065 def _raw_open(self, flags, mode=0o777):
FileNotFoundError: [Errno 2] No such file or directory: '/Users/knaaptime/Dropbox/projects/geosnap/examples/local/.quilt/named_packages/census/administrative/latest'
finally, if i try to list my local packages, it returns a generator:
quilt3.list_packages()
<generator object _list_packages at 0x7fd188906cd0>
this using quilt3 3.1.14
I think I have mostly good news.
- The generator is expected and a perf optimization. Try
list(quilt3.list_packages())
. - .get() will always return the manifest location; you can try .get_cached_path() and let me know if that works better. It will only work after you successfully install said package locally.
Let us know how that pans out.
To browse a local package do this:
p = quilt3.Package.browse('nauto/trips')
Let me know if you find any inconsistencies in the docs around how we name the local registry, but it's usually specified with None
for the registry.
Let me know if you find any inconsistencies in the docs around how we name the local registry, but it's usually specified with None for the registry.
I thought None
was the way to go, but when I was getting the behavior above using the p = quilt3.Package.browse('census/administrative')
i figured i'd give this a shot to see if i could force it to find the local path:
actually, none of the get_*
methods are available
dir(p)
['__class__',
'__contains__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__iter__',
'__le__',
'__len__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_browse',
'_build',
'_children',
'_dump',
'_dump_manifest_to_scratch',
'_ensure_subpackage',
'_filter',
'_fix_sha256',
'_from_path',
'_load',
'_map',
'_meta',
'_set',
'_set_commit_message',
'_shorten_tophash',
'_split_key',
'_walk_dir_meta',
'browse',
'build',
'delete',
'diff',
'dump',
'fetch',
'filter',
'get',
'install',
'keys',
'load',
'manifest',
'map',
'meta',
'push',
'readme',
'resolve_hash',
'rollback',
'set',
'set_dir',
'set_meta',
'top_hash',
'verify',
'walk']
ah, my mistake. Wrong syntax. that should be p['msa_definitions.parquet'].get_cached_path()
The issue this came up is that im seeing a noticeable performance hit when accessing files that should normally be fast i/o, and it appears to be because the data are streaming rather than local
using the the Package['file']()
syntax is loading the remote version of the file rather than local (and if i cancel a long-running file load, the process that gets cut short is a boto transaction)
Does get_cached_path return anything for you after an install? If not can you include a minimal repro here for install but there is no cached path and I'll have an engineer take a look?
import quilt3
import pandas as pd
quilt3.Package.install(
"census/administrative",
"s3://spatial-ucr",
)
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 7576.42entries/s]
Successfully installed package 'census/administrative', tophash=0616d02 from s3://spatial-ucr
p = quilt3.Package.browse('census/administrative')
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 4036.09entries/s]
p['msa_definitions.parquet'].get_cached_path()
'/Users/knaaptime/Library/Application Support/Quilt/packages/census/administrative/msa_definitions.parquet'
If i use the cached path attribute to read the file in directly with pandas, I get much better performance than trying to deserialize using the ()
syntax. I'm confident this is because the ()
syntax is streaming the remote file
%%timeit # reading in the file directly
df = pd.read_parquet(p['msa_definitions.parquet'].get_cached_path())
7.08 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit # using native deserialization
df = p['msa_definitions.parquet']()
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached.
944 ms ± 768 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)