Performance on Xeon Scalable
Opened this issue · 1 comments
Hello everyone, we are seeing slower than expected inference times on one of our CPU node with Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz with following instruction sets:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg rdpid fsrm md_clear flush_l1d arch_capabilities
With latest version of neuralchat_server
and neural-speed
in combination with intel-extension-for-transformers
with following config:
host: "0.0.0.0"
port: 8000
model_name_or_path: "/root/Intel/neural-chat-7b-v3-3"
device: cpu
tasks_list: ["textchat"]
optimization:
use_neural_speed: true
optimization_type: weight_only
compute_dtype: fp32
weight_dtype: int8
We are seeing extremely slow time to first token with example prompts like Tell me about Intel Xeon Scalable Processors.
With following measured times :
Weight Precision | Max Tokens | Response Time |
---|---|---|
Int8 | unset | 73s |
Int8 | 128 | 69s |
Int4 | unset | 73s |
Int4 | 128 | 65s |
Without neural-speed
compression of said model, we got inference times to only around 20s
.
Is there any misconfiguration on our part?
I would love to hear your feedback and appreciate any help.
could you try neural-speed alone with this model? it may not be an issue of neural-speed.