intel mkl optimized tensorflow performance degradation
patelprateek opened this issue · 29 comments
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Deep Learning VM
Version: m10
Based on: Debian GNU/Linux 9.5 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n) - Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): deep-learning image
- TensorFlow version (use command below): 1.11
- Python version: 2.7
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory: N/A
You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the current behavior
Running deep model and some wide linear models. Inference performance is very bad. 2-4x slower relative to running inference without MKL
Describe the expected behavior
Performance should actually improve with intel mkl instruction set .
Code to reproduce the issue
code for deep and wide linear model. or logistic regression example code from tensorflow example
Other info / logs
When running using google deep learning image version M9 on gpu machine (image : tf-latest-cu92, version M9) . Note : the inference is only running on cpu as i turn off the visibility for cuda devices, So the tensorflow code runs only runs on cpu. The image family says they are intel optimized packages but when i rung the benchmarks with verbosity on , i do not observe any mkl related stuff.
I start another deep learning image (tf-latest-cpu , version M10): Running exact same code on this machine with environment variable (export MKL_VERBOSE=1): I can observe a lot of openMP thread settings , KMP_xxx settings and mkl instructions logged with some timing information. I didn't observe any such thing in the M9 gpu image , even though in both place when i execute command i observe following logs:
M9 gpu image
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x55fd25117d40,1,0x55fd25117d40,1) 1.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0
M10 cpu image :
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
User settings:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32
Effective settings:
KMP_ABORT_DELAY=0
KMP_ADAPTIVE_LOCK_PROPS='1,1024'
KMP_ALIGN_ALLOC=64
KMP_ALL_THREADPRIVATE=128
KMP_ATOMIC_MODE=2
KMP_BLOCKTIME=0
KMP_CPUINFO_FILE: value is not defined
KMP_DETERMINISTIC_REDUCTION=false
KMP_DEVICE_THREAD_LIMIT=2147483647
KMP_DISP_HAND_THREAD=false
KMP_DISP_NUM_BUFFERS=7
KMP_DUPLICATE_LIB_OK=false
KMP_FORCE_REDUCTION: value is not defined
KMP_FOREIGN_THREADS_THREADPRIVATE=true
KMP_FORKJOIN_BARRIER='2,2'
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
KMP_FORKJOIN_FRAMES=true
KMP_FORKJOIN_FRAMES_MODE=3
KMP_GTID_MODE=3
KMP_HANDLE_SIGNALS=false
KMP_HOT_TEAMS_MAX_LEVEL=1
KMP_HOT_TEAMS_MODE=0
KMP_INIT_AT_FORK=true
KMP_INIT_WAIT=2048
KMP_ITT_PREPARE_DELAY=0
KMP_LIBRARY=throughput
KMP_LOCK_KIND=queuing
KMP_MALLOC_POOL_INCR=1M
KMP_NEXT_WAIT=1024
KMP_NUM_LOCKS_IN_BLOCK=1
KMP_PLAIN_BARRIER='2,2'
KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
KMP_REDUCTION_BARRIER='1,1'
KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
KMP_SCHEDULE='static,balanced;guided,iterative'
KMP_SETTINGS=true
KMP_SPIN_BACKOFF_PARAMS='4096,100'
KMP_STACKOFFSET=64
KMP_STACKPAD=0
KMP_STACKSIZE=4M
KMP_STORAGE_MAP=false
KMP_TASKING=2
KMP_TASKLOOP_MIN_TASKS=0
KMP_TASK_STEALING_CONSTRAINT=1
KMP_TEAMS_THREAD_LIMIT=32
KMP_TOPOLOGY_METHOD=all
KMP_USER_LEVEL_MWAIT=false
KMP_VERSION=false
KMP_WARNINGS=true
OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}'
OMP_ALLOCATOR=omp_default_mem_alloc
OMP_CANCELLATION=false
OMP_DEBUG=disabled
OMP_DEFAULT_DEVICE=0
OMP_DISPLAY_AFFINITY=false
OMP_DISPLAY_ENV=false
OMP_DYNAMIC=false
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_MAX_TASK_PRIORITY=0
OMP_NESTED=false
OMP_NUM_THREADS='32'
OMP_PLACES: value is not defined
OMP_PROC_BIND='intel'
OMP_SCHEDULE='static'
OMP_STACKSIZE=4M
OMP_TARGET_OFFLOAD=DEFAULT
OMP_THREAD_LIMIT=2147483647
OMP_TOOL=enabled
OMP_TOOL_LIBRARIES: value is not defined
OMP_WAIT_POLICY=PASSIVE
KMP_AFFINITY='verbose,warnings,respect,granularity=fine,compact,1,0'
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 16 cores/pkg x 2 threads/core (16 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 0 core 3 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 0 core 4 thread 1
OMP: Info #250: KMP_AFFINITY: pid 8331 tid 8331 thread 0 bound to OS proc set 0
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x5622b7736500,1,0x5622b7736500,1) 2.54ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0
So i assume intel mkl is being used in M10 image where as mkl is not being used in the M9 image (Note: i have turned off visibility for cuda devices so only cpu inference is being compared) . I observe 2-4x performance degradation with intel mkl.
The mkl suggested flags are appropriate:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32
Any ideas on how to debug the root cause and get the maximum performance for my models.
It is "deep and wide linear model", can you do "export OMP_NUM_THREADS=1" as a first step?
And can you please try inter_op_parallelism_threads and intra_op_parallism_threads similar to https://github.com/NervanaSystems/tensorflow-models/commit/55d55abc71483723743c0273b9c1fd8e0c7d8391#diff-00c5d001cb14a21f6d7dbf16d4e55032R90 if you haven't?
@wei-v-wang : the link oyu mentioned doesnt work for me. Can you please share the link again, or may be let me know what config for inter and intra op parallelism i should try , i will post back the results here.
Also not just wide and linear models , also i am observing similar 2-3x worse latency (inference) for deep cross network model as well . Could you please perhaps explain the reasoning behing omp_num_threads = 1 as well , this will help us to understand better the internal workings.
Sorry, here is the updated link: https://github.com/tensorflow/models/blob/master/official/wide_deep/wide_deep_run_loop.py#L87-L88
If some application is not bound by compute, changing OMP_NUM_THREADS might help.
I think for wide/deep models, inter_op/intra_op has been providing some help. Please definitely enable it in your model and give it a try.
@wei-v-wang : the link you provided change inter and intra op thread settings : but when i run the code, it still prints out :
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32
, so i am not sure if it is taking affect . Are those 2 different settings ?
In order to change OMP_NUM_THREADS, please use "export OMP_NUM_THREADS=". The link I provided will only change inter and intra op settings only.
Ok so i tried bunch of parameters : Machine type : 32 core , logical threads per core 2
i tried : num of intra op threads = OMP threads : [4, 8, 16, 32, 64]
inter op threads = number of physical cores and number of sockets [2,8,16,32]
the best performance i could get for a batch size of 1k : 48 micro secs
the best i get without mkl without much tuning (num of inter and intra op threads being the same : 16/32/64] : 23 micro secs
Any other setting i need to try ?
Can we know if MKL library and ISA is even being taken advantage of by looking at some ops which should definitely perform better ?
I definitely found setting number of OMP threads to a lower count helped and same for inter op parallelism.
But the performance for the current model is still 2-3x worse in general
Since it is inference, I have one last suggestion:
could you please prefix your runs with "numactl -c 1 -m 1 python ..." , the rest of the configurations can remain the same. This is to use just one socket to rule out memory access overhead across two sockets.
If you still observe ~2X slowness with TF w/MKLDNN, can you please share your model script with us?
Sorry, I should have given out all BKMs in a batch. But, here is another important one that I've missed.
export OMP_NUM_THREADS=x
export KMP_BLOCKTIME=1
numactl -c 1 -m 1 python ... <inter_op> <intra_op>
numactl -c 1 -m 1 python ...
libnuma: Warning: node argument 1 is out of range
<1> is invalid
here is machine config :
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping: 0
CPU MHz: 2200.000
BogoMIPS: 4400.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0-31
i tried numactl -c 0 -m 0 python , still the best i get is around 48 micro secs with omp threads and inter op threads = 6 and the blocktime =1
@patelprateek I see, it is single socket system so numactl does not help here. Is it possible for us to get your customized model?
@wei-v-wang : i will try to get you that if that really helps debugging. But i would need to get some privacy and legal approval.
Are there any steps you want me to do help debug this. Basically i want to understand what ops are being used in my model (both with mkl and without mkl) , see if that helps us understand why mkl optimization degrades the performance.
As for model : i have wide and deep linear models using tf.estimator and dataset api.
Ok, I see. To simplify things, as you said, Wide and Deep (wide only) is a good proxy for your model. I will double check the performance comparison just using this wide and deep linear model. Hopefully learnings can be applied to your custom model.
BTW, Are you using private data set or public dataset? The performance may vary depend on the dataset size you are using.
data set is private. i can get more details about type of features and number of crosses if that helps but this is all for inference and not for training
@wei-v-wang : I am trying to re-write the model graph to anonymize the features. this works quite well except for few sparse features for which i have an embedding as well. Do you happen to know a tool/library that can help do this and take care of the edge acse i am missing ?
Mt graph rewrite code is pretty trivial by iterating over all nodes and seaching for some feature names and replacing them with some ids. For some reason i cant get the scores of model to match when i apply this translation for sparse features using embedding layer. Any caveats you know of ?
@wei-v-wang : any updates on how can this issue be resolved ? did you guys find perf regression in the the new DL image ?
@patelprateek Sorry for the delay, can you please try this PR #24272 ?
Before it is merged, you can use: https://github.com/Intel-tensorflow/tensorflow/tree/sriniva2/small_alloc_fix
still monitoring. Waiting for PR: #24777
I think I have similar issue.
Same here
4x slowdown using MKL
So far tried the anaconda version and google container (both latest releases)
HW is Xeon 6132, 2 sockets, HT on
@dare0021 Apologize for the issue and we are starting to address this. I will provide more frequent update here.
Hey, Facing the same issue. inference is 3-4x slower. Is there any update or any solution?
@patelprateek
@wangcj05
@dare0021
@aashay96
This topic is opened for a very long time.
Intel Optimization for Tensorflow has be improved more since this issue is created.
To release the performance potential of Tensowflow with MKL, user need to set for optimization.
Example for Intel Core CPU (4 cores/socket, 1 sockets)
export TF_ENABLE_MKL_NATIVE_FORMAT=1
export TF_NUM_INTEROP_THREADS=1
export TF_NUM_INTRAOP_THREADS=4
export OMP_NUM_THREADS=4
export KMP_BLOCKTIME=1
export KMP_AFFINITY=granularity=fine,compact,1,0
TF_ENABLE_MKL_NATIVE_FORMAT is key optimization, friendly to Keras model inference.
TF_NUM_INTEROP_THREADS could be set to 1, sockets number or other number no more than cores number.
TF_NUM_INTRAOP_THREADS & OMP_NUM_THREADS are set as cores number per socket, or other value no more it.
KMP_BLOCKTIME are set 0, 1 or other number.
PS: the recommended value would not be right value to your model. The best way is test the performance to find the right values.
Now, use could install the Tensorflow with MKL by pip and conda:
python -m pip install intel-tensorflow
conda install tensorflow-mkl
Or build it from source code with '--config=mkl'.
Please refer to the Intel® Optimization for TensorFlow* Installation Guide
@patelprateek
How do you think the suggestion?
If there is still issue, could you share it?
Thank you!
@patelprateek Could you please refer to this comment and try with the latest TF versions(2.4 or later ) as older TF versions(1.x) are not actively supported.Please let us know if it helps?Thanks!
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.
Closing as stale. Please reopen if you'd like to work on this further.