quiltdata/quilt

package ignoring locally installed versions

knaaptime opened this issue · 9 comments

hey folks, I'm able to install a package locally, but doing so doesnt seem to accomplish anything because quilt always defaults to the remote version

import quilt3
quilt3.Package.install(
    "census/administrative",
    "s3://spatial-ucr",
)

restart notebook kernel

from quilt3.data.census import administrative
administrative.get('msa_definitions.parquet') # this should be local now

returns

's3://spatial-ucr/census/administrative/msa_definitions.parquet?versionId=y5lH1FmQZmmnCXh5x180fiVAWOjYuitb'

the files do exist where they should at /Users/knaaptime/Library/Application Support/Quilt/packages/census/administrative/ but all the package methods ignore them. Further, if i do

admin = quilt3.Package.browse("census/administrative", "local")

it appears to be looking in the wrong place, returning this error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-6-eb0fe4f642ff> in <module>
----> 1 admin = quilt3.Package.browse("census/administrative", "local")

~/anaconda3/envs/geosnap/lib/python3.7/site-packages/quilt3/telemetry.py in decorated(*args, **kwargs)
    129             ApiTelemetry.report_api_use(self.api_name, ApiTelemetry.session_id)
    130 
--> 131             results = func(*args, **kwargs)
    132             # print(f"{len(ApiTelemetry.pending_reqs)} request(s) pending!")
    133 

~/anaconda3/envs/geosnap/lib/python3.7/site-packages/quilt3/packages.py in browse(cls, name, registry, top_hash)
    530             top_hash(string): top hash of package version to load
    531         """
--> 532         return cls._browse(name=name, registry=registry, top_hash=top_hash)
    533 
    534     @classmethod

~/anaconda3/envs/geosnap/lib/python3.7/site-packages/quilt3/packages.py in _browse(cls, name, registry, top_hash)
    543         if top_hash is None:
    544             top_hash_file = registry_parsed.join(f'.quilt/named_packages/{name}/latest')
--> 545             top_hash = get_bytes(top_hash_file).decode('utf-8').strip()
    546         else:
    547             top_hash = cls.resolve_hash(registry_parsed, top_hash)

~/anaconda3/envs/geosnap/lib/python3.7/site-packages/quilt3/data_transfer.py in get_bytes(src)
    761     if src.is_local():
    762         src_file = pathlib.Path(src.path)
--> 763         data = src_file.read_bytes()
    764     else:
    765         params = dict(Bucket=src.bucket, Key=src.path)

~/anaconda3/envs/geosnap/lib/python3.7/pathlib.py in read_bytes(self)
   1212         Open the file in bytes mode, read it, and close the file.
   1213         """
-> 1214         with self.open(mode='rb') as f:
   1215             return f.read()
   1216 

~/anaconda3/envs/geosnap/lib/python3.7/pathlib.py in open(self, mode, buffering, encoding, errors, newline)
   1206             self._raise_closed()
   1207         return io.open(self, mode, buffering, encoding, errors, newline,
-> 1208                        opener=self._opener)
   1209 
   1210     def read_bytes(self):

~/anaconda3/envs/geosnap/lib/python3.7/pathlib.py in _opener(self, name, flags, mode)
   1061     def _opener(self, name, flags, mode=0o666):
   1062         # A stub for the opener argument to built-in open()
-> 1063         return self._accessor.open(self, flags, mode)
   1064 
   1065     def _raw_open(self, flags, mode=0o777):

FileNotFoundError: [Errno 2] No such file or directory: '/Users/knaaptime/Dropbox/projects/geosnap/examples/local/.quilt/named_packages/census/administrative/latest'

finally, if i try to list my local packages, it returns a generator:

quilt3.list_packages()
<generator object _list_packages at 0x7fd188906cd0>

this using quilt3 3.1.14

I think I have mostly good news.

  • The generator is expected and a perf optimization. Try list(quilt3.list_packages()).
  • .get() will always return the manifest location; you can try .get_cached_path() and let me know if that works better. It will only work after you successfully install said package locally.

Let us know how that pans out.

To browse a local package do this:
p = quilt3.Package.browse('nauto/trips')

Let me know if you find any inconsistencies in the docs around how we name the local registry, but it's usually specified with None for the registry.

thanks for the impressively speedy reply, but looks like get_cached_path doesnt seemt to be available

image

Let me know if you find any inconsistencies in the docs around how we name the local registry, but it's usually specified with None for the registry.

I thought None was the way to go, but when I was getting the behavior above using the p = quilt3.Package.browse('census/administrative') i figured i'd give this a shot to see if i could force it to find the local path:

image

actually, none of the get_* methods are available

dir(p)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_browse',
 '_build',
 '_children',
 '_dump',
 '_dump_manifest_to_scratch',
 '_ensure_subpackage',
 '_filter',
 '_fix_sha256',
 '_from_path',
 '_load',
 '_map',
 '_meta',
 '_set',
 '_set_commit_message',
 '_shorten_tophash',
 '_split_key',
 '_walk_dir_meta',
 'browse',
 'build',
 'delete',
 'diff',
 'dump',
 'fetch',
 'filter',
 'get',
 'install',
 'keys',
 'load',
 'manifest',
 'map',
 'meta',
 'push',
 'readme',
 'resolve_hash',
 'rollback',
 'set',
 'set_dir',
 'set_meta',
 'top_hash',
 'verify',
 'walk']

ah, my mistake. Wrong syntax. that should be p['msa_definitions.parquet'].get_cached_path()

The issue this came up is that im seeing a noticeable performance hit when accessing files that should normally be fast i/o, and it appears to be because the data are streaming rather than local

image

using the the Package['file']() syntax is loading the remote version of the file rather than local (and if i cancel a long-running file load, the process that gets cut short is a boto transaction)

Does get_cached_path return anything for you after an install? If not can you include a minimal repro here for install but there is no cached path and I'll have an engineer take a look?

import quilt3
import pandas as pd
quilt3.Package.install(
    "census/administrative",
    "s3://spatial-ucr",
)
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 7576.42entries/s]

Successfully installed package 'census/administrative', tophash=0616d02 from s3://spatial-ucr
p = quilt3.Package.browse('census/administrative')
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 4036.09entries/s]
p['msa_definitions.parquet'].get_cached_path()
'/Users/knaaptime/Library/Application Support/Quilt/packages/census/administrative/msa_definitions.parquet'

If i use the cached path attribute to read the file in directly with pandas, I get much better performance than trying to deserialize using the () syntax. I'm confident this is because the () syntax is streaming the remote file

%%timeit   #  reading in the file directly

df = pd.read_parquet(p['msa_definitions.parquet'].get_cached_path())
7.08 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit  #  using native deserialization

df = p['msa_definitions.parquet']()
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached.
944 ms ± 768 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)