Data point discontinuous in sampling with `--load_fast` (aka Rust board) implementation
way-zer opened this issue · 3 comments
Environment information (required)
Diagnostics
Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa
--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=11, micro=8, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='zw6f6', release='5.4.0-167-generic', version='#184-Ubuntu SMP Tue Oct 31 09:21:49 UTC 2023', machine='x86_64')
INFO: sys.getwindowsversion(): N/A
--- check: package_management
INFO: has conda-meta: True
INFO: $VIRTUAL_ENV: None
--- check: installed_packages
WARNING: no installation among: ['tb-nightly', 'tensorboard', 'tensorflow-tensorboard']
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.16.2'
--- check: tensorflow_python_version
Traceback (most recent call last):
File "/root/rl4net/examples/diagnose_tensorboard.py", line 511, in main
suggestions.extend(check())
^^^^^^^
File "/root/rl4net/examples/diagnose_tensorboard.py", line 81, in wrapper
result = fn()
^^^^
File "/root/rl4net/examples/diagnose_tensorboard.py", line 267, in tensorflow_python_version
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
--- check: tensorboard_data_server_version
INFO: data server binary: '/root/micromamba/envs/rl4net/lib/python3.11/site-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.0'
--- check: tensorboard_binary_path
INFO: which tensorboard: b'/root/micromamba/envs/rl4net/bin/tensorboard\n'
--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]
--- check: readable_fqdn
INFO: socket.getfqdn(): 'zw6f6'
--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=62270220, st_dev=2097307, st_nlink=2, st_uid=0, st_gid=0, st_size=4096, st_atime=1710727568, st_mtime=1710727551, st_ctime=1710727551)
INFO: mode: 0o40777
--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/root/micromamba/envs/rl4net/lib/python3.11/site-packages']; bad_roots (0): []
--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py @ file:///home/conda/feedstock_root/build_artifacts/absl-py_1705494584803/work
aim==3.18.1
aim-ui==3.18.1
aimrecords==0.0.7
aimrocks==0.4.0
aiofiles==23.2.1
aiohttp @ file:///home/conda/feedstock_root/build_artifacts/aiohttp_1707669768135/work
aiosignal @ file:///home/conda/feedstock_root/build_artifacts/aiosignal_1667935791922/work
alembic==1.13.1
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1708355285029/work
ase @ file:///home/conda/feedstock_root/build_artifacts/ase_1638384343806/work
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1704011227531/work
base58==2.0.1
black @ file:///home/conda/feedstock_root/build_artifacts/black-recipe_1708248203050/work
blinker @ file:///home/conda/feedstock_root/build_artifacts/blinker_1698890160476/work
Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1695989787169/work
cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work
cachetools==5.3.3
captum==0.7.0
certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1707022139797/work/certifi
cffi @ file:///croot/cffi_1700254295673/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1698833585322/work
click @ file:///home/conda/feedstock_root/build_artifacts/click_1692311806742/work
cloudpickle==3.0.0
colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1666700638685/work
contourpy @ file:///home/conda/feedstock_root/build_artifacts/contourpy_1699041375599/work
cryptography==42.0.5
cycler @ file:///home/conda/feedstock_root/build_artifacts/cycler_1696677705766/work
docutils @ file:///home/conda/feedstock_root/build_artifacts/docutils_1701882599793/work
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
fastapi==0.110.0
filelock @ file:///home/conda/feedstock_root/build_artifacts/filelock_1698714947081/work
Flask @ file:///home/conda/feedstock_root/build_artifacts/flask_1707043907952/work
fonttools @ file:///home/conda/feedstock_root/build_artifacts/fonttools_1708049097969/work
frozenlist @ file:///home/conda/feedstock_root/build_artifacts/frozenlist_1702645450877/work
fsspec==2024.3.0
gmpy2 @ file:///home/conda/feedstock_root/build_artifacts/gmpy2_1666808665953/work
google-auth @ file:///opt/conda/conda-bld/google-auth_1646735974934/work
google-auth-oauthlib @ file:///work/ci_py311_2/google-auth-oauthlib_1679340681059/work
greenlet==3.0.3
grpcio @ file:///home/conda/feedstock_root/build_artifacts/grpc-split_1700258025969/work
h11 @ file:///home/conda/feedstock_root/build_artifacts/h11_1664132893548/work
h5py @ file:///home/conda/feedstock_root/build_artifacts/h5py_1702471424890/work
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1701026962277/work
imagecodecs @ file:///home/conda/feedstock_root/build_artifacts/imagecodecs_1704019718039/work
imageio @ file:///home/conda/feedstock_root/build_artifacts/imageio_1707730027807/work
importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1703269254275/work
isodate @ file:///home/conda/feedstock_root/build_artifacts/isodate_1639582763789/work
itsdangerous @ file:///home/conda/feedstock_root/build_artifacts/itsdangerous_1648147185463/work
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1704966972576/work
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1691577114857/work
kiwisolver @ file:///home/conda/feedstock_root/build_artifacts/kiwisolver_1695379920604/work
lazy_loader @ file:///home/conda/feedstock_root/build_artifacts/lazy_loader_1692295373316/work
lightning-utilities @ file:///home/conda/feedstock_root/build_artifacts/lightning-utilities_1705619433111/work
llvmlite==0.42.0
Mako==1.3.2
marimo @ file:///home/conda/feedstock_root/build_artifacts/marimo_1710189166977/work
Markdown @ file:///home/conda/feedstock_root/build_artifacts/markdown_1704908347571/work
MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1706899926732/work
matplotlib @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-suite_1708026439111/work
mpmath @ file:///home/conda/feedstock_root/build_artifacts/mpmath_1678228039184/work
multidict @ file:///home/conda/feedstock_root/build_artifacts/multidict_1707040702345/work
munkres==1.1.4
mypy-extensions @ file:///home/conda/feedstock_root/build_artifacts/mypy_extensions_1675543315189/work
networkx @ file:///home/conda/feedstock_root/build_artifacts/networkx_1698504735452/work
numba @ file:///home/conda/feedstock_root/build_artifacts/numba_1707024788644/work
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1707225376651/work/dist/numpy-1.26.4-cp311-cp311-linux_x86_64.whl#sha256=d08e1c9e5833ae7780563812aa73e2497db1ee3bd5510d3becb8aa18aa2d0c7c
oauthlib @ file:///croot/oauthlib_1679489621486/work
opt-einsum @ file:///home/conda/feedstock_root/build_artifacts/opt_einsum_1696448916724/work
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1696202382185/work
pandas @ file:///home/conda/feedstock_root/build_artifacts/pandas_1708708634263/work
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathspec @ file:///home/conda/feedstock_root/build_artifacts/pathspec_1702249949303/work
patsy @ file:///home/conda/feedstock_root/build_artifacts/patsy_1704469236901/work
pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1704252032614/work
pip==24.0
platformdirs @ file:///croot/platformdirs_1692205439124/work
protobuf==4.24.4
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1705722403006/work
pyasn1 @ file:///Users/ktietz/demo/mc3/conda-bld/pyasn1_1629708007385/work
pyasn1-modules==0.2.8
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==2.6.3
pydantic_core==2.16.3
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700607939962/work
PyJWT @ file:///work/ci_py311/pyjwt_1676827385359/work
pymdown-extensions @ file:///home/conda/feedstock_root/build_artifacts/pymdown-extensions_1703982974286/work
pynndescent @ file:///home/conda/feedstock_root/build_artifacts/pynndescent_1700514549498/work
pyOpenSSL @ file:///croot/pyopenssl_1708380408460/work
pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1690737849915/work
pyrallis==0.3.1
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1709299778482/work
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1706886791323/work
PyWavelets @ file:///home/conda/feedstock_root/build_artifacts/pywavelets_1695567566807/work
PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1695373611984/work
pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1701783162530/work
rdflib @ file:///home/conda/feedstock_root/build_artifacts/rdflib-split_1690986372614/work
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1684774241324/work
requests-oauthlib==1.3.0
RestrictedPython==7.0
-e git+egg=rl4net
rsa @ file:///tmp/build/80754af9/rsa_1614366226499/work
ruff @ file:///home/conda/feedstock_root/build_artifacts/ruff_1709955894551/work
scikit-image @ file:///home/conda/feedstock_root/build_artifacts/scikit-image_1697028611470/work/dist/scikit_image-0.22.0-cp311-cp311-linux_x86_64.whl#sha256=53d8b95f752df47007e9e71dd1c9805b9334e1e4791cf48e3762abb922636f04
scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_1708073809211/work
scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy-split_1706041487672/work/dist/scipy-1.12.0-cp311-cp311-linux_x86_64.whl#sha256=c4f0d8ecd4373069a033d0ee818c2fe5959c8828937fa46deb00a478190f703a
setuptools==69.1.1
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1708952932303/work
SQLAlchemy==2.0.28
starlette==0.36.3
statsmodels @ file:///home/conda/feedstock_root/build_artifacts/statsmodels_1702575375433/work
sympy @ file:///home/conda/feedstock_root/build_artifacts/sympy_1684180540116/work
tensorboard @ file:///home/conda/feedstock_root/build_artifacts/tensorboard_1708285739699/work/tensorboard-2.16.2-py3-none-any.whl#sha256=9f2b4e7dad86667615c0e5cd072f1ea8403fc032a299f0072d6f74855775cc45
tensorboard-data-server @ file:///home/conda/feedstock_root/build_artifacts/tensorboard-data-server_1695425375375/work/tensorboard_data_server-0.7.0-py3-none-manylinux2014_x86_64.whl#sha256=4a87e32f17958007f01c1acb90cf7aab5877e41b1a929e3a016020697c37b53d
tensorboardX @ file:///tmp/build/80754af9/tensorboardx_1621440489103/work
tensordict==0.3.1
threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1707930541534/work
tifffile @ file:///home/conda/feedstock_root/build_artifacts/tifffile_1707824820518/work
tomlkit @ file:///home/conda/feedstock_root/build_artifacts/tomlkit_1709043728182/work
torch==2.2.1
torch-scatter @ file:///usr/share/miniconda/envs/test/conda-bld/pytorch-scatter_1706804494952/work
torch_geometric @ file:///home/conda/feedstock_root/build_artifacts/pytorch_geometric_1708619951869/work
torchmetrics @ file:///home/conda/feedstock_root/build_artifacts/torchmetrics_1701462872995/work
torchrl==0.3.1
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1708363099148/work
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1707598593068/work
trimesh @ file:///home/conda/feedstock_root/build_artifacts/trimesh_1709252138892/work
triton==2.2.0
typing-inspect==0.9.0
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1708904622550/work
tzdata @ file:///home/conda/feedstock_root/build_artifacts/python-tzdata_1707747584337/work
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1708239446578/work
uvicorn @ file:///home/conda/feedstock_root/build_artifacts/uvicorn-split_1707597428881/work
websockets @ file:///home/conda/feedstock_root/build_artifacts/websockets_1697914680106/work
Werkzeug @ file:///home/conda/feedstock_root/build_artifacts/werkzeug_1698235201373/work
wheel==0.42.0
yarl @ file:///home/conda/feedstock_root/build_artifacts/yarl_1705508295175/work
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work
For browser-related issues, please additionally specify:
- Browser type and version (e.g., Chrome 64.0.3282.140): Microsoft Edge 122.0.2365.92
- Screenshot, if it’s a visual issue:
Issue description
There is a significant interruption in data point sampling when using tensorboard.
Using EventAccumulator
, I checked the data file is complete and use --samples_per_plugin=scalars=10000
also works but slowly.
Data file: events.out.tfevents.zip
Reproduce step: open the tfevents file with tensorboard.
Correct, data is sampled, and the behavior can be overridden by the flag that you mentioned. This is working as intended.
The sampling algorithm has a few attributes that influenced the design choice:
- While the data is still being written, it is not known how big the population will be.
- Generally, users are interested in seeing the last logged value.
- The implementation is deterministic, so you would always see the same results.
Due to this, we use a reservoir sampling implementation that keeps the last value. You can find it here. Unfortunately, as the population size grows larger than the sample size, it is likely that the algorithm will just keep replacing the latest read value.
It was interesting to think about this. I came up with an implementation that attempts to be more fair in keeping a representative sample, with a trade-off in memory usage. I think we would need to put more thought into this if we wanted to submit this change for the actual implementation, but you're welcome to fork our repo and change the implementation in the mean time, if you'd like.
Changing the code to something like this:
def AddItem(self, item, f=lambda x: x):
"""Add an item to the ReservoirBucket, replacing an old item if
necessary.
If the bucket has reached capacity, then an old item will be replaced
with probability (_max_size/_num_items_seen).
It is expected that the "add" operations will be far more frequent than
the "read" operations. Therefore, the list keeps track of insertion
order as a tuple (index, val), the replacement is done in-place at a
random position in the items list; and when the elements are read via
the Items() method, the list is sorted by insertion order using the
first value in the tuple.
This means, insertion is O(1) (at the cost of using more memory, but
still O(k)). Reading the items should be O(n*log(n)).
Args:
item: The item to add to the bucket.
f: A function to transform item before addition, if it will be kept in
the reservoir.
"""
with self._mutex:
self._num_items_seen += 1
# The count of num_items_seen serves as an index for data read, so
# we can insert efficiently and only return the data in the order it
# was read when the items are read.
new_item = (self._num_items_seen, f(item))
self._latest_seen = new_item
if self._items_len < self._max_size or self._max_size == 0:
self.items.append(new_item)
self._items_len += 1
else:
# Attempts to make the sampling unlikely to entirely replace the
# previously seen values. As the population grows larger, it
# becomes less likely that a value will be replaced.
sample_ratio = self._max_size / float(self._num_items_seen)
if self._random.random() < sample_ratio:
r = self._random.randint(0, self._max_size - 1)
# replace item without sorting, for efficient writing.
self.items[r] = new_item
def Items(self):
"""Get all the items in the bucket.
If self.always_keep_last is true, it will replace the last element in
the sample with the last element seen.
Calling this method has O(n*log(n)) runtime complexity, but reads are
less frequent than writes, which are O(1) with this implementation, and
it keeps a somewhat more representative sample.
Perhaps some optimizations can be done to avoid recalculating the list
when nothing has changed.
"""
with self._mutex:
sorted_list = sorted(self.items, key=lambda x: x[0])
if self.always_keep_last:
sorted_list[-1] = self._latest_seen
return [x[1] for x in sorted_list]
To compare, the view with the --samples_per_plugin=scalars=20000
looks like this:
And the sampled view with this implementation (still using the default sample size) looks like this:
Having said that, here are a few notes to consider:
- This implementation only takes effect with the flag
--load_fast=false
, which is used by default whenever you also have installed thetensorboard-data-server
package. Wheneverload_fast
is enabled, the Rust code will be used, rather than the python code. - I suppose, we could look at and change the implementation in the Rust code as well, but we don't have the bandwidth to look into that at the moment.
- This might be the reason why changing the sampling was slow for you... if you don't have that package installed, perhaps if you install that package and using the sampling flag won't be as slow and it's a simpler alternative. You can learn about this a bit here (although this guide is for development... generally, if you install that other package, it should by default be faster in many cases).
- This implementation hasn't been tested much, nor analyzed for broader use cases.
First of all, thank you for your detailed explanation of sampling.
I would like to add some information.
-
The issue mentioned above is based on rust implementation(load_fast=true)
-
When use
load_fast=false
withoutsamples_per_plugin
, it also works with almost uniform sampling interval. -
What I have said about
slow
is viewing the data, loading in frontend.
Less than 20 experiments, for the scalarenv/delay
with--samples_per_plugin=scalars=10000
, it spends more than 10 seconds to load the data.
-
The main problem is why get long interruption in sampling when use rust implementation.
From this python code, It is difficult to always replace the last value to cause the long interruption.
tensorboard/tensorboard/backend/event_processing/reservoir.py
Lines 223 to 226 in cf27fe0
Ah, you are correct! The python algorithm should work. It is the Rust implementation the one with the issue.
I didn't think of the Rust implementation at the beginning, and then I guess I was trying to fit an explanation of what happened to the code that I was looking at in python.
Anyway... I'll reopen this issue and rename to emphasize that the issue is with the Rust implementation, but honestly we haven't touched that code in a while, the people who wrote it are no longer working with the team, so it's unlikely that we will pick this up any time soon.