parasj/checkmate

compile_tf2 fails in the tutorial

Closed this issue · 9 comments

compile_ft2 seems to fail in the tutorial. This error is captured from the Jupyter notebook tutorial.

from checkmate.tf2.wrapper import compile_tf2
element_spec = train_ds.__iter__().__next__()
train_iteration = compile_tf2(
    model,
    loss=loss,
    optimizer=optimizer,
    input_spec=element_spec[0],  # retrieve first element of dataset
    label_spec=element_spec[1]
)
WARNING:tensorflow:From /content/checkmate/checkmate/tf2/wrapper.py:29: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /content/checkmate/checkmate/tf2/wrapper.py:29: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
ERROR:root:[checkmate] At the moment, Checkmate does not guarentee scheduling under the specified budget. This feature will appear soon.
ERROR:root:Model infeasible
Traceback (most recent call last):
  File "/content/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 104, in solve_checkmate_cvxpy
    r, s, u, free_e = lpsolver.solve(solver_override=solver_override)
  File "/content/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 97, in solve
    raise ValueError("Model infeasible")
ValueError: Model infeasible
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-ba593e08ef17> in <module>()
      6     optimizer=optimizer,
      7     input_spec=element_spec[0],  # retrieve first element of dataset
----> 8     label_spec=element_spec[1]
      9 )

1 frames
/content/checkmate/checkmate/tf2/execution.py in edit_graph(fxn, op_dict, schedule)
     33 
     34     duplicate_ops = []
---> 35     sched_ordered = list(enumerate([s for s in schedule if isinstance(s, OperatorEvaluation)]))
     36 
     37     # duplicate rematerialized operation

TypeError: 'NoneType' object is not iterable

It looks like the error comes from core/solvers/cvxpy_solver.py returning an infeasible solution. Can you confirm usage of CVXPy? Will fix ASAP :)

@aninrusimha Platform to replicate this: RTX 2080 Ti and CUDA 10.1 on Linux

compile_tf2 fails in the tutorial, but with different error message:

WARNING:tensorflow:From /content/checkmate/checkmate/tf2/wrapper.py:24: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From /content/checkmate/checkmate/tf2/wrapper.py:24: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
ERROR:root:Model infeasible
Traceback (most recent call last):
  File "/content/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 104, in solve_checkmate_cvxpy
    r, s, u, free_e = lpsolver.solve(solver_override=solver_override)
  File "/content/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 97, in solve
    raise ValueError("Model infeasible")
ValueError: Model infeasible
ERROR:root:[checkmate] Checkmate solver could find no feasible schedule for the specificed budget of 9626.638745600001
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-ba593e08ef17> in <module>()
      6     optimizer=optimizer,
      7     input_spec=element_spec[0],  # retrieve first element of dataset
----> 8     label_spec=element_spec[1]
      9 )

/content/checkmate/checkmate/tf2/wrapper.py in compile_tf2(model, loss, optimizer, input_spec, label_spec, scheduler, budget, **kwargs)
    112     if not sched_result.feasible:
    113         logging.error("[checkmate] Checkmate solver could find no feasible schedule for the specificed budget of {}".format(budget))
--> 114         raise ValueError("No feasible solution for specified budget of {}".format(budget))
    115     logging.debug("[checkmate] Schedule solved")
    116 

ValueError: No feasible solution for specified budget of 9626.638745600001

I guess cvxpy does not work as intended. Could you please check? Does gurobipy work well?

Hi @hy00nc -- in this case, the open-source solver is unable to find a feasible solution at the given memory allocation. I've been able to reproduce this, and am still deriving a new solver implementation. Apologies about the delays.

Hi @hy00nc -- would you mind sending me the output of the following script?

import logging
import subprocess

import psutil
import tensorflow as tf

def _using_gpu_check():
    return tf.test.is_gpu_available() and tf.test.is_built_with_cuda()


def nvidiasmi_query(query="memory.total"):
    # from https://discuss.pytorch.org/t/access-gpu-memory-usage-in-pytorch/3192/4
    mem = subprocess.check_output(
        ["nvidia-smi", "--query-gpu={}".format(query), "--format=csv,nounits,noheader"], encoding="utf-8"
    )
    query_result_list = [int(x) for x in mem.strip().split("\n")]
    return dict(zip(range(len(query_result_list)), query_result_list))


def _get_gpu_memory():
    if _using_gpu_check():  # choose based on available GPU RAM
        gpu_ram = nvidiasmi_query("memory.total")
        budget = min(gpu_ram.values()) * 0.9
        logging.info(
            "[checkmate] No budget specified; defaulting to the minimum amount of total GPU RAM on any single "
            "GPU, {0:.2f}MB".format(budget)
        )
    else:  # choose based available system memory
        budget = psutil.virtual_memory().available * 0.8 / 1000000
        logging.debug("[checkmate] No GPU detected, using system DRAM on CPU")
        logging.info("[checkmate] No budget specified; defaulting to {0:.2f}MB".format(budget))
    return budget
print(_get_gpu_memory())

Hi, the error message I commented above was actually produced in the TF 2.0 tutorial in Colab.
Meanwhile, on my GPU-equipped machine, which also raises the same error message as the tutorial, the output of the script is 10976.4.

Looking at values in cost_ram in ILPSolverCVXPY, the solver may expect a budget in bytes.

(Pdb) self.g.cost_ram
{53: 4, 35: 471859200, 77: 8, 16: 235929600, 140: 4096, 46: 1, 30: 4, 69: 4096, 62: 8192, 8: 1, 94: 4, 65: 4, 49: 4, 34: 4, 64: 4, 135: 4, 67: 8192, 38: 4, 23: 471859200, 75: 40960, 24: 512, 96: 4, 73: 4096, 86: 4, 63: 4, 101: 4, 71: 4, 139: 4, 68: 40960, 3: 4, 115: 40, 57: 4, 107: 4, 113: 5120, 74: 4096, 41: 8, 50: 4, 6: 1, 120: 4, 9: 6912, 76: 40960, 51: 4096, 4: 4, 54: 4, 37: 4, 14: 1, 61: 8, 29: 471859200, 45: 4, 78: 40960, 18: 32768, 12: 235929600, 72: 45056, 59: 4, 19: 1, 79: 524288, 56: 4, 11: 256, 2: 4, 132: 40960, 10: 235929600, 126: 16, 1: 4, 121: 40960, 123: 16, 5: 4, 31: 471859200, 36: 8, 97: 235929600, 58: 40, 102: 4, 98: 235929600, 108: 235929600, 92: 32, 114: 32768, 47: 524288, 55: 4, 0: 0, 142: 16, 99: 235929600, 106: 4, 70: 4096, 80: 16, 84: 4096, 128: 16, 82: 4, 83: 524288, 105: 4, 127: 256, 111: 8, 66: 4096, 81: 16, 7: 12582912, 21: 471859200, 133: 12582912, 117: 8, 88: 471859200, 110: 32, 42: 5120, 15: 4, 134: 6912, 137: 256, 20: 32768, 122: 6912, 85: 4, 104: 4, 119: 32768, 95: 471859200, 52: 40960, 89: 471859200, 33: 471859200, 44: 4, 27: 4, 124: 8, 129: 4, 109: 4, 91: 471859200, 138: 32768, 22: 4, 40: 471859200, 87: 4, 26: 4, 116: 512, 131: 512, 25: 512, 100: 8, 60: 40960, 90: 4, 17: 1, 125: 16, 118: 512, 143: 32768, 39: 1, 144: 512, 28: 4, 93: 32, 13: 1, 130: 16, 136: 5120, 43: 4, 103: 16, 48: 4, 141: 40, 112: 4, 32: 8}

Ram cost seems to come from the following code where it is the number of bytes.

def dfgraph_from_tf_function(fn) -> DFGraph:
...
...
    for op in ops:
        out_elem_count = [np.prod([i or 1 for i in list(out.shape or [])]) for out in op.outputs]
        out_dtype_len = [out.dtype.size or 4 for out in op.outputs]
        op_ram_cost = int(np.dot(out_dtype_len, out_elem_count))
        gb.add_node(op.name, cpu_cost=1, ram_cost=op_ram_cost, backward=op.name in grad_nodes)

I gave it a try to give a budget in bytes by changing _get_gpu_memory, then hit the following error. I set verbose to True for lpsolver.solve.

==== BUDGET:  10398833049.6

ECOS 2.0.7 - (C) embotech GmbH, Zurich Switzerland, 2012-15. Web: www.embotech.com/ECOS

It     pcost       dcost      gap   pres   dres    k/t    mu     step   sigma     IR    |   BT
 0  +2.761e+05  -2.952e+14  +3e+14  5e-04  4e-01  1e+00  1e+09    ---    ---    1  1  - |  -  - 
 1  +1.302e+06  -2.952e+14  +3e+14  5e-04  6e-03  1e+03  1e+09  0.0000  1e+00   0  0  0 |  0  0
No further progress possible, recovering best iterate (0) and stopping.
NUMERICAL PROBLEMS (reached feastol=4.1e-01, reltol=-nan, abstol=3.0e+14).
Runtime: 59.242615 seconds.

Solve: 61.614s
Traceback (most recent call last):
  File "test_tutorial.py", line 55, in <module>
    label_spec=element_spec[1]
  File "/home/sejongo/workspaces/cuda/checkmate/checkmate/tf2/wrapper.py", line 131, in compile_tf2
    sched_result = scheduler(g, budget, **kwargs)
  File "/home/sejongo/workspaces/cuda/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 104, in solve_checkmate_cvxpy
    r, s, u, free_e = lpsolver.solve(solver_override=solver_override, verbose=True)
  File "/home/sejongo/workspaces/cuda/checkmate/checkmate/core/solvers/cvxpy_solver.py", line 94, in solve
    self.problem.solve(verbose=verbose)
  File "/home/sejongo/venv/tf2/lib/python3.6/site-packages/cvxpy/problems/problem.py", line 290, in solve
    return solve_func(self, *args, **kwargs)
  File "/home/sejongo/venv/tf2/lib/python3.6/site-packages/cvxpy/problems/problem.py", line 575, in _solve
    self.unpack_results(solution, full_chain, inverse_data)
  File "/home/sejongo/venv/tf2/lib/python3.6/site-packages/cvxpy/problems/problem.py", line 718, in unpack_results
    "Try another solver, or solve with verbose=True for more "
cvxpy.error.SolverError: Solver 'ECOS' failed. Try another solver, or solve with verbose=True for more information.

Once I installed ubuntu packages, coinor-cbc and coinor-libcbc-dev, as well as cylp python package, the CBC solver was picked up and the tutorial works!

@sarangsable Thank you so much for the fix! I've merged it in #147

@hy00nc Please let me know if you have any further issues using the tutorial. Please pull the latest checkmate copy and ensure that you install CyLP. I've updated the tutorial. Using a fresh Google Colab instance, CyLP took 67s to solve.