compile_tf2 fails in the tutorial
Closed this issue · 9 comments
compile_ft2 seems to fail in the tutorial. This error is captured from the Jupyter notebook tutorial.
from checkmate.tf2.wrapper import compile_tf2
element_spec = train_ds.__iter__().__next__()
train_iteration = compile_tf2(
model,
loss=loss,
optimizer=optimizer,
input_spec=element_spec[0], # retrieve first element of dataset
label_spec=element_spec[1]
)
WARNING:tensorflow:From /content/checkmate/checkmate/tf2/wrapper.py:29: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /content/checkmate/checkmate/tf2/wrapper.py:29: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
ERROR:root:[checkmate] At the moment, Checkmate does not guarentee scheduling under the specified budget. This feature will appear soon.
ERROR:root:Model infeasible
Traceback (most recent call last):
File "/content/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 104, in solve_checkmate_cvxpy
r, s, u, free_e = lpsolver.solve(solver_override=solver_override)
File "/content/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 97, in solve
raise ValueError("Model infeasible")
ValueError: Model infeasible
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-ba593e08ef17> in <module>()
6 optimizer=optimizer,
7 input_spec=element_spec[0], # retrieve first element of dataset
----> 8 label_spec=element_spec[1]
9 )
1 frames
/content/checkmate/checkmate/tf2/execution.py in edit_graph(fxn, op_dict, schedule)
33
34 duplicate_ops = []
---> 35 sched_ordered = list(enumerate([s for s in schedule if isinstance(s, OperatorEvaluation)]))
36
37 # duplicate rematerialized operation
TypeError: 'NoneType' object is not iterable
It looks like the error comes from core/solvers/cvxpy_solver.py returning an infeasible solution. Can you confirm usage of CVXPy? Will fix ASAP :)
@aninrusimha Platform to replicate this: RTX 2080 Ti and CUDA 10.1 on Linux
compile_tf2 fails in the tutorial, but with different error message:
WARNING:tensorflow:From /content/checkmate/checkmate/tf2/wrapper.py:24: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /content/checkmate/checkmate/tf2/wrapper.py:24: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
ERROR:root:Model infeasible
Traceback (most recent call last):
File "/content/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 104, in solve_checkmate_cvxpy
r, s, u, free_e = lpsolver.solve(solver_override=solver_override)
File "/content/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 97, in solve
raise ValueError("Model infeasible")
ValueError: Model infeasible
ERROR:root:[checkmate] Checkmate solver could find no feasible schedule for the specificed budget of 9626.638745600001
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-ba593e08ef17> in <module>()
6 optimizer=optimizer,
7 input_spec=element_spec[0], # retrieve first element of dataset
----> 8 label_spec=element_spec[1]
9 )
/content/checkmate/checkmate/tf2/wrapper.py in compile_tf2(model, loss, optimizer, input_spec, label_spec, scheduler, budget, **kwargs)
112 if not sched_result.feasible:
113 logging.error("[checkmate] Checkmate solver could find no feasible schedule for the specificed budget of {}".format(budget))
--> 114 raise ValueError("No feasible solution for specified budget of {}".format(budget))
115 logging.debug("[checkmate] Schedule solved")
116
ValueError: No feasible solution for specified budget of 9626.638745600001
I guess cvxpy
does not work as intended. Could you please check? Does gurobipy
work well?
Hi @hy00nc -- in this case, the open-source solver is unable to find a feasible solution at the given memory allocation. I've been able to reproduce this, and am still deriving a new solver implementation. Apologies about the delays.
Hi @hy00nc -- would you mind sending me the output of the following script?
import logging
import subprocess
import psutil
import tensorflow as tf
def _using_gpu_check():
return tf.test.is_gpu_available() and tf.test.is_built_with_cuda()
def nvidiasmi_query(query="memory.total"):
# from https://discuss.pytorch.org/t/access-gpu-memory-usage-in-pytorch/3192/4
mem = subprocess.check_output(
["nvidia-smi", "--query-gpu={}".format(query), "--format=csv,nounits,noheader"], encoding="utf-8"
)
query_result_list = [int(x) for x in mem.strip().split("\n")]
return dict(zip(range(len(query_result_list)), query_result_list))
def _get_gpu_memory():
if _using_gpu_check(): # choose based on available GPU RAM
gpu_ram = nvidiasmi_query("memory.total")
budget = min(gpu_ram.values()) * 0.9
logging.info(
"[checkmate] No budget specified; defaulting to the minimum amount of total GPU RAM on any single "
"GPU, {0:.2f}MB".format(budget)
)
else: # choose based available system memory
budget = psutil.virtual_memory().available * 0.8 / 1000000
logging.debug("[checkmate] No GPU detected, using system DRAM on CPU")
logging.info("[checkmate] No budget specified; defaulting to {0:.2f}MB".format(budget))
return budget
print(_get_gpu_memory())
Hi, the error message I commented above was actually produced in the TF 2.0 tutorial in Colab.
Meanwhile, on my GPU-equipped machine, which also raises the same error message as the tutorial, the output of the script is 10976.4
.
Looking at values in cost_ram
in ILPSolverCVXPY
, the solver may expect a budget in bytes.
(Pdb) self.g.cost_ram
{53: 4, 35: 471859200, 77: 8, 16: 235929600, 140: 4096, 46: 1, 30: 4, 69: 4096, 62: 8192, 8: 1, 94: 4, 65: 4, 49: 4, 34: 4, 64: 4, 135: 4, 67: 8192, 38: 4, 23: 471859200, 75: 40960, 24: 512, 96: 4, 73: 4096, 86: 4, 63: 4, 101: 4, 71: 4, 139: 4, 68: 40960, 3: 4, 115: 40, 57: 4, 107: 4, 113: 5120, 74: 4096, 41: 8, 50: 4, 6: 1, 120: 4, 9: 6912, 76: 40960, 51: 4096, 4: 4, 54: 4, 37: 4, 14: 1, 61: 8, 29: 471859200, 45: 4, 78: 40960, 18: 32768, 12: 235929600, 72: 45056, 59: 4, 19: 1, 79: 524288, 56: 4, 11: 256, 2: 4, 132: 40960, 10: 235929600, 126: 16, 1: 4, 121: 40960, 123: 16, 5: 4, 31: 471859200, 36: 8, 97: 235929600, 58: 40, 102: 4, 98: 235929600, 108: 235929600, 92: 32, 114: 32768, 47: 524288, 55: 4, 0: 0, 142: 16, 99: 235929600, 106: 4, 70: 4096, 80: 16, 84: 4096, 128: 16, 82: 4, 83: 524288, 105: 4, 127: 256, 111: 8, 66: 4096, 81: 16, 7: 12582912, 21: 471859200, 133: 12582912, 117: 8, 88: 471859200, 110: 32, 42: 5120, 15: 4, 134: 6912, 137: 256, 20: 32768, 122: 6912, 85: 4, 104: 4, 119: 32768, 95: 471859200, 52: 40960, 89: 471859200, 33: 471859200, 44: 4, 27: 4, 124: 8, 129: 4, 109: 4, 91: 471859200, 138: 32768, 22: 4, 40: 471859200, 87: 4, 26: 4, 116: 512, 131: 512, 25: 512, 100: 8, 60: 40960, 90: 4, 17: 1, 125: 16, 118: 512, 143: 32768, 39: 1, 144: 512, 28: 4, 93: 32, 13: 1, 130: 16, 136: 5120, 43: 4, 103: 16, 48: 4, 141: 40, 112: 4, 32: 8}
Ram cost seems to come from the following code where it is the number of bytes.
def dfgraph_from_tf_function(fn) -> DFGraph:
...
...
for op in ops:
out_elem_count = [np.prod([i or 1 for i in list(out.shape or [])]) for out in op.outputs]
out_dtype_len = [out.dtype.size or 4 for out in op.outputs]
op_ram_cost = int(np.dot(out_dtype_len, out_elem_count))
gb.add_node(op.name, cpu_cost=1, ram_cost=op_ram_cost, backward=op.name in grad_nodes)
I gave it a try to give a budget in bytes by changing _get_gpu_memory
, then hit the following error. I set verbose
to True for lpsolver.solve
.
==== BUDGET: 10398833049.6
ECOS 2.0.7 - (C) embotech GmbH, Zurich Switzerland, 2012-15. Web: www.embotech.com/ECOS
It pcost dcost gap pres dres k/t mu step sigma IR | BT
0 +2.761e+05 -2.952e+14 +3e+14 5e-04 4e-01 1e+00 1e+09 --- --- 1 1 - | - -
1 +1.302e+06 -2.952e+14 +3e+14 5e-04 6e-03 1e+03 1e+09 0.0000 1e+00 0 0 0 | 0 0
No further progress possible, recovering best iterate (0) and stopping.
NUMERICAL PROBLEMS (reached feastol=4.1e-01, reltol=-nan, abstol=3.0e+14).
Runtime: 59.242615 seconds.
Solve: 61.614s
Traceback (most recent call last):
File "test_tutorial.py", line 55, in <module>
label_spec=element_spec[1]
File "/home/sejongo/workspaces/cuda/checkmate/checkmate/tf2/wrapper.py", line 131, in compile_tf2
sched_result = scheduler(g, budget, **kwargs)
File "/home/sejongo/workspaces/cuda/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 104, in solve_checkmate_cvxpy
r, s, u, free_e = lpsolver.solve(solver_override=solver_override, verbose=True)
File "/home/sejongo/workspaces/cuda/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 94, in solve
self.problem.solve(verbose=verbose)
File "/home/sejongo/venv/tf2/lib/python3.6/site-packages/cvxpy/problems/problem.py", line 290, in solve
return solve_func(self, *args, **kwargs)
File "/home/sejongo/venv/tf2/lib/python3.6/site-packages/cvxpy/problems/problem.py", line 575, in _solve
self.unpack_results(solution, full_chain, inverse_data)
File "/home/sejongo/venv/tf2/lib/python3.6/site-packages/cvxpy/problems/problem.py", line 718, in unpack_results
"Try another solver, or solve with verbose=True for more "
cvxpy.error.SolverError: Solver 'ECOS' failed. Try another solver, or solve with verbose=True for more information.
Once I installed ubuntu packages, coinor-cbc
and coinor-libcbc-dev
, as well as cylp
python package, the CBC solver was picked up and the tutorial works!
@sarangsable Thank you so much for the fix! I've merged it in #147
@hy00nc Please let me know if you have any further issues using the tutorial. Please pull the latest checkmate copy and ensure that you install CyLP. I've updated the tutorial. Using a fresh Google Colab instance, CyLP took 67s to solve.