FMInference/FlexLLMGen

【bug】? if we forget to add time mark code line in hf_ds folder

oujieww opened this issue · 2 comments

when i run code for benchmark of huggingface , i got error:
Traceback (most recent call last):
File "hf_opt.py", line 360, in
run_generation(args.model, args.batch_size, args.prompt_len, args.gen_len,
File "hf_opt.py", line 278, in run_generation
prefill_latency = costs[0]
IndexError: list index out of range

this costs is []

I find the timer is :
class _Timer:
"""An internal timer."""

def __init__(self, name: str):
    self.name = name
    self.started = False
    self.start_time = None

    # start-stop timestamp pairs
    self.start_times = []
    self.stop_times = []
    self.costs = []

def start(self, sync_func: Callable = None):
    """Start the timer."""
    assert not self.started, f"timer {self.name} has already been started."
    if sync_func:
        sync_func()

    self.start_time = time.perf_counter()
    self.start_times.append(self.start_time)
    self.started = True

def stop(self, sync_func: Callable = None):
    """Stop the timer."""
    assert self.started, f"timer {self.name} is not started."
    if sync_func:
        sync_func()

    stop_time = time.perf_counter()
    self.costs.append(stop_time - self.start_time)
    self.stop_times.append(stop_time)
    self.started = False

def reset(self):
    """Reset timer."""
    self.started = False
    self.start_time = None
    self.start_times = []
    self.stop_times = []
    self.costs = []

def elapsed(self, mode: str = "average"):
    """Calculate the elapsed time."""
    if not self.costs:
        return 0.0
    if mode == "average":
        return sum(self.costs) / len(self.costs)
    elif mode == "sum":
        return sum(self.costs)
    else:
        raise RuntimeError("Supported mode is: average | sum")

should we change the "hf_opt.py" as follow;

Run

print("benchmark")
timers("generate-forward").reset()
timers("generate-forward").start()
generate_kwargs = dict(max_new_tokens=execute_gen_len, do_sample=False)
with torch.no_grad():
    output_ids = model.generate(input_ids=input_ids, **generate_kwargs)
timers("generate-forward").stop()
costs = timers("generate-forward").costs

but I think the result I got is not right :
[2023-06-11 01:34:46,533] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
load model
wamup
benchmark
[0.8706504209985724]
<flexgen.timer._Timer object at 0x7fbdd3ff50d0>
Outputs:

0: Paris is the capital city of

15: Paris is the capital city of

model size: 2.443 GB cache size: 1.594 GB hidden size (p): 0.033 GB
peak gpu mem: 6.232 GB projected: False
prefill latency: 0.871 s prefill throughput: 9409.058 token/s
decode latency: 0.000 s decode throughput: 4960000000000.000 token/s
total latency: 0.871 s total throughput: 588.066 token/s

[2023-06-11 01:34:46,533] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
load model
wamup
benchmark
[0.8706504209985724]
<flexgen.timer._Timer object at 0x7fbdd3ff50d0>

model size: 2.443 GB cache size: 1.594 GB hidden size (p): 0.033 GB
peak gpu mem: 6.232 GB projected: False
prefill latency: 0.871 s prefill throughput: 9409.058 token/s
decode latency: 0.000 s decode throughput: 4960000000000.000 token/s
total latency: 0.871 s total throughput: 588.066 token/s

install ./third_package