awslabs/sagemaker-debugger
Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
PythonApache-2.0
Issues
- 2
Full shap values
#294 opened by NRauschmayr - 1
- 0
pre-commit changes current master?
#664 opened by ChristopherBrix - 2
- 1
- 1
Understanding of how sagemaker-debugger works
#453 opened by anotinelg - 1
smdebug crashes with newer numpy versions
#645 opened by fredsensibill - 0
Cannot run a custom container using smdistributed/dataparallel unless USE_SMDEBUG is turned off
#609 opened by plamb-viso - 0
test_pytorch_integration.py::test_pytorch[False-False] is incompatible with PyTorch >=1.7
#580 opened by tejaschumbalkar - 0
- 4
Turn off debugger hooks in PyTorch?
#401 opened by austinmw - 5
- 1
- 0
Error while running sagemaker-debugger with custom pytorch container and custom model
#476 opened by aditya5558 - 0
TF keras.py _wrap_tape_gradient breaks for arrays
#446 opened by arewellborn - 0
Can we save tensors that match a regex pattern only for a particular collection
#434 opened by NihalHarish - 1
Compatibility with gradient accumulation
#426 opened by quasimik - 0
TypeError: os.environ.get() takes no keyword argument (breaking all PyTorch training jobs)
#419 opened by robwhelan - 0
Sagemaker debugger hooks for keras unet
#413 opened by shubham-scisar - 9
TensorBoardOutputConfig/Sagemaker Debugger does not behave as documented
#284 opened by ando-khachatryan - 1
error in atexit
#290 opened by Vikas-kum - 0
Error in atexit
#363 opened by Vikas-kum - 2
FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker
#392 opened by piyushghai - 1
smdebug causes an OperatorNotAllowedInGraphError inside a function decorated with tf.function
#398 opened by horietakehiro - 0
Extend Logs to report time and memory usage
#397 opened by NihalHarish - 0
- 1
tensorflow_datasets failed to load dataset with data_dir="s3://<sagemaker-bucket>" in sagemaker notebook instance
#361 opened by komushi - 0
- 2
Version 0.9.1 makes saving a tf model fail with KeyError: "callable_inputs"
#344 opened by hm-haitham - 1
- 0
Optimizer variables, Layer inputs/outputs, model inputs/output are not being saved when save_all=True
#326 opened by rahul003 - 0
MXNet hook saving more tensors than specified
#327 opened by rahul003 - 5
Training crashes with DebuggerHookConfig
#321 opened by ratulray - 1
Sagemaker Debugger with HPO
#325 opened by tvkpz - 0
Move All Datasets to S3
#296 opened by NihalHarish - 0
Remove Redundant Upgrade in Buildspec
#297 opened by NihalHarish - 0
Revert Pytest Version Pinning
#295 opened by NihalHarish - 1
Logging error: I/O operation on closed file
#270 opened by vandanavk - 1
Run Model Inputs PR with Custom Docker
#289 opened by NihalHarish - 0
Test Horovod with TF 2 non-eager mode
#286 opened by vandanavk - 0
remove the log line
#280 opened by Vikas-kum - 0
Too many warnings printed with TF 2.X
#261 opened by vandanavk - 1
Tensors not saved in PREDICT step
#269 opened by NihalHarish - 0
Codecov migration to marketplace app
#260 opened by thomasrockhu - 0
keras TF 2.2 mileading error mesg
#255 opened by Vikas-kum - 0
Test for DistriibutedValues support
#235 opened by Vikas-kum - 0
- 1
Loss Tensors Are Saved Twice On AWS Pytorch
#205 opened by NihalHarish - 1
- 6
Crash occurs when trying to register a hook on a tensor that doesn't require gradients
#184 opened by jbschlosser-zz