intel mkl optimized tensorflow performance degradation

Question

intel mkl optimized tensorflow performance degradation

patelprateek opened this issue 7 years ago · 29 comments

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Deep Learning VM
Version: m10
Based on: Debian GNU/Linux 9.5 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): deep-learning image
TensorFlow version (use command below): 1.11
Python version: 2.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory: N/A

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior
Running deep model and some wide linear models. Inference performance is very bad. 2-4x slower relative to running inference without MKL
Describe the expected behavior
Performance should actually improve with intel mkl instruction set .

Code to reproduce the issue
code for deep and wide linear model. or logistic regression example code from tensorflow example

Other info / logs
When running using google deep learning image version M9 on gpu machine (image : tf-latest-cu92, version M9) . Note : the inference is only running on cpu as i turn off the visibility for cuda devices, So the tensorflow code runs only runs on cpu. The image family says they are intel optimized packages but when i rung the benchmarks with verbosity on , i do not observe any mkl related stuff.

I start another deep learning image (tf-latest-cpu , version M10): Running exact same code on this machine with environment variable (export MKL_VERBOSE=1): I can observe a lot of openMP thread settings , KMP_xxx settings and mkl instructions logged with some timing information. I didn't observe any such thing in the M9 gpu image , even though in both place when i execute command i observe following logs:
M9 gpu image
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x55fd25117d40,1,0x55fd25117d40,1) 1.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0

M10 cpu image :
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime

User settings:

KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32

Effective settings:

KMP_ABORT_DELAY=0
KMP_ADAPTIVE_LOCK_PROPS='1,1024'
KMP_ALIGN_ALLOC=64
KMP_ALL_THREADPRIVATE=128
KMP_ATOMIC_MODE=2
KMP_BLOCKTIME=0
KMP_CPUINFO_FILE: value is not defined
KMP_DETERMINISTIC_REDUCTION=false
KMP_DEVICE_THREAD_LIMIT=2147483647
KMP_DISP_HAND_THREAD=false
KMP_DISP_NUM_BUFFERS=7
KMP_DUPLICATE_LIB_OK=false
KMP_FORCE_REDUCTION: value is not defined
KMP_FOREIGN_THREADS_THREADPRIVATE=true
KMP_FORKJOIN_BARRIER='2,2'
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
KMP_FORKJOIN_FRAMES=true
KMP_FORKJOIN_FRAMES_MODE=3
KMP_GTID_MODE=3
KMP_HANDLE_SIGNALS=false
KMP_HOT_TEAMS_MAX_LEVEL=1
KMP_HOT_TEAMS_MODE=0
KMP_INIT_AT_FORK=true
KMP_INIT_WAIT=2048
KMP_ITT_PREPARE_DELAY=0
KMP_LIBRARY=throughput
KMP_LOCK_KIND=queuing
KMP_MALLOC_POOL_INCR=1M
KMP_NEXT_WAIT=1024
KMP_NUM_LOCKS_IN_BLOCK=1
KMP_PLAIN_BARRIER='2,2'
KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
KMP_REDUCTION_BARRIER='1,1'
KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
KMP_SCHEDULE='static,balanced;guided,iterative'
KMP_SETTINGS=true
KMP_SPIN_BACKOFF_PARAMS='4096,100'
KMP_STACKOFFSET=64
KMP_STACKPAD=0
KMP_STACKSIZE=4M
KMP_STORAGE_MAP=false
KMP_TASKING=2
KMP_TASKLOOP_MIN_TASKS=0
KMP_TASK_STEALING_CONSTRAINT=1
KMP_TEAMS_THREAD_LIMIT=32
KMP_TOPOLOGY_METHOD=all
KMP_USER_LEVEL_MWAIT=false
KMP_VERSION=false
KMP_WARNINGS=true
OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}'
OMP_ALLOCATOR=omp_default_mem_alloc
OMP_CANCELLATION=false
OMP_DEBUG=disabled
OMP_DEFAULT_DEVICE=0
OMP_DISPLAY_AFFINITY=false
OMP_DISPLAY_ENV=false
OMP_DYNAMIC=false
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_MAX_TASK_PRIORITY=0
OMP_NESTED=false
OMP_NUM_THREADS='32'
OMP_PLACES: value is not defined
OMP_PROC_BIND='intel'
OMP_SCHEDULE='static'
OMP_STACKSIZE=4M
OMP_TARGET_OFFLOAD=DEFAULT
OMP_THREAD_LIMIT=2147483647
OMP_TOOL=enabled
OMP_TOOL_LIBRARIES: value is not defined
OMP_WAIT_POLICY=PASSIVE
KMP_AFFINITY='verbose,warnings,respect,granularity=fine,compact,1,0'

OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 16 cores/pkg x 2 threads/core (16 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 0 core 3 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 0 core 4 thread 1

OMP: Info #250: KMP_AFFINITY: pid 8331 tid 8331 thread 0 bound to OS proc set 0
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x5622b7736500,1,0x5622b7736500,1) 2.54ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0

So i assume intel mkl is being used in M10 image where as mkl is not being used in the M9 image (Note: i have turned off visibility for cuda devices so only cpu inference is being compared) . I observe 2-4x performance degradation with intel mkl.
The mkl suggested flags are appropriate:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32

Any ideas on how to debug the root cause and get the maximum performance for my models.

Answer 1 · 2018-10-26T20:43:01.000Z

It is "deep and wide linear model", can you do "export OMP_NUM_THREADS=1" as a first step?
And can you please try inter_op_parallelism_threads and intra_op_parallism_threads similar to https://github.com/NervanaSystems/tensorflow-models/commit/55d55abc71483723743c0273b9c1fd8e0c7d8391#diff-00c5d001cb14a21f6d7dbf16d4e55032R90 if you haven't?

Answer 2 · 2018-10-26T20:53:09.000Z

@wei-v-wang : the link oyu mentioned doesnt work for me. Can you please share the link again, or may be let me know what config for inter and intra op parallelism i should try , i will post back the results here.

Also not just wide and linear models , also i am observing similar 2-3x worse latency (inference) for deep cross network model as well . Could you please perhaps explain the reasoning behing omp_num_threads = 1 as well , this will help us to understand better the internal workings.

Answer 3 · 2018-10-26T21:51:11.000Z

Sorry, here is the updated link: https://github.com/tensorflow/models/blob/master/official/wide_deep/wide_deep_run_loop.py#L87-L88

If some application is not bound by compute, changing OMP_NUM_THREADS might help.

I think for wide/deep models, inter_op/intra_op has been providing some help. Please definitely enable it in your model and give it a try.

Answer 4 · 2018-10-30T23:07:18.000Z

@wei-v-wang : the link you provided change inter and intra op thread settings : but when i run the code, it still prints out :
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32

, so i am not sure if it is taking affect . Are those 2 different settings ?

Answer 5 · 2018-10-30T23:15:20.000Z

In order to change OMP_NUM_THREADS, please use "export OMP_NUM_THREADS=". The link I provided will only change inter and intra op settings only.

Answer 6 · 2018-10-31T00:43:07.000Z

Ok so i tried bunch of parameters : Machine type : 32 core , logical threads per core 2
i tried : num of intra op threads = OMP threads : [4, 8, 16, 32, 64]
inter op threads = number of physical cores and number of sockets [2,8,16,32]

the best performance i could get for a batch size of 1k : 48 micro secs
the best i get without mkl without much tuning (num of inter and intra op threads being the same : 16/32/64] : 23 micro secs

Any other setting i need to try ?
Can we know if MKL library and ISA is even being taken advantage of by looking at some ops which should definitely perform better ?

Answer 7 · 2018-10-31T00:45:38.000Z

I definitely found setting number of OMP threads to a lower count helped and same for inter op parallelism.
But the performance for the current model is still 2-3x worse in general

Answer 8 · 2018-10-31T00:53:34.000Z

Since it is inference, I have one last suggestion:
could you please prefix your runs with "numactl -c 1 -m 1 python ..." , the rest of the configurations can remain the same. This is to use just one socket to rule out memory access overhead across two sockets.

If you still observe ~2X slowness with TF w/MKLDNN, can you please share your model script with us?

Answer 9 · 2018-10-31T00:55:04.000Z

Sorry, I should have given out all BKMs in a batch. But, here is another important one that I've missed.

export OMP_NUM_THREADS=x
export KMP_BLOCKTIME=1
numactl -c 1 -m 1 python ... <inter_op> <intra_op>

Answer 10 · 2018-10-31T01:02:37.000Z

numactl -c 1 -m 1 python ...
libnuma: Warning: node argument 1 is out of range
<1> is invalid

Answer 11 · 2018-10-31T01:05:21.000Z

here is machine config :
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping: 0
CPU MHz: 2200.000
BogoMIPS: 4400.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0-31

Answer 12 · 2018-10-31T01:09:33.000Z

i tried numactl -c 0 -m 0 python , still the best i get is around 48 micro secs with omp threads and inter op threads = 6 and the blocktime =1

Answer 13 · 2018-10-31T01:22:17.000Z

@patelprateek I see, it is single socket system so numactl does not help here. Is it possible for us to get your customized model?

Answer 14 · 2018-10-31T01:30:26.000Z

@wei-v-wang : i will try to get you that if that really helps debugging. But i would need to get some privacy and legal approval.
Are there any steps you want me to do help debug this. Basically i want to understand what ops are being used in my model (both with mkl and without mkl) , see if that helps us understand why mkl optimization degrades the performance.

As for model : i have wide and deep linear models using tf.estimator and dataset api.

Answer 15 · 2018-10-31T01:38:07.000Z

Ok, I see. To simplify things, as you said, Wide and Deep (wide only) is a good proxy for your model. I will double check the performance comparison just using this wide and deep linear model. Hopefully learnings can be applied to your custom model.
BTW, Are you using private data set or public dataset? The performance may vary depend on the dataset size you are using.

Answer 16 · 2018-10-31T02:43:37.000Z

data set is private. i can get more details about type of features and number of crosses if that helps but this is all for inference and not for training

Answer 17 · 2018-11-01T02:01:27.000Z

@wei-v-wang : I am trying to re-write the model graph to anonymize the features. this works quite well except for few sparse features for which i have an embedding as well. Do you happen to know a tool/library that can help do this and take care of the edge acse i am missing ?
Mt graph rewrite code is pretty trivial by iterating over all nodes and seaching for some feature names and replacing them with some ids. For some reason i cant get the scores of model to match when i apply this translation for sparse features using embedding layer. Any caveats you know of ?

Answer 18 · 2018-12-03T08:05:47.000Z

@wei-v-wang : any updates on how can this issue be resolved ? did you guys find perf regression in the the new DL image ?

Answer 19 · 2018-12-13T19:42:51.000Z

@patelprateek Sorry for the delay, can you please try this PR #24272 ?

Before it is merged, you can use: https://github.com/Intel-tensorflow/tensorflow/tree/sriniva2/small_alloc_fix

Answer 20 · 2019-01-17T19:35:22.000Z

still monitoring. Waiting for PR: #24777

Answer 21 · 2019-08-09T20:23:55.000Z

I think I have similar issue.

Answer 22 · 2019-10-09T07:04:05.000Z

Same here
4x slowdown using MKL
So far tried the anaconda version and google container (both latest releases)
HW is Xeon 6132, 2 sockets, HT on

Answer 23 · 2019-10-09T07:12:41.000Z

@dare0021 Apologize for the issue and we are starting to address this. I will provide more frequent update here.

Answer 24 · 2020-03-27T12:54:56.000Z

Hey, Facing the same issue. inference is 3-4x slower. Is there any update or any solution?

Answer 25 · 2020-12-03T02:22:14.000Z

@patelprateek
@wangcj05
@dare0021
@aashay96

This topic is opened for a very long time.
Intel Optimization for Tensorflow has be improved more since this issue is created.

To release the performance potential of Tensowflow with MKL, user need to set for optimization.

Example for Intel Core CPU (4 cores/socket, 1 sockets)

export TF_ENABLE_MKL_NATIVE_FORMAT=1  
export TF_NUM_INTEROP_THREADS=1
export TF_NUM_INTRAOP_THREADS=4
export OMP_NUM_THREADS=4
export KMP_BLOCKTIME=1
export KMP_AFFINITY=granularity=fine,compact,1,0

TF_ENABLE_MKL_NATIVE_FORMAT is key optimization, friendly to Keras model inference.
TF_NUM_INTEROP_THREADS could be set to 1, sockets number or other number no more than cores number.
TF_NUM_INTRAOP_THREADS & OMP_NUM_THREADS are set as cores number per socket, or other value no more it.
KMP_BLOCKTIME are set 0, 1 or other number.

PS: the recommended value would not be right value to your model. The best way is test the performance to find the right values.

Now, use could install the Tensorflow with MKL by pip and conda:

python -m pip install intel-tensorflow
conda install tensorflow-mkl

Or build it from source code with '--config=mkl'.

Please refer to the Intel® Optimization for TensorFlow* Installation Guide

Answer 26 · 2020-12-17T07:33:44.000Z

@patelprateek
How do you think the suggestion?
If there is still issue, could you share it?

Thank you!

Answer 27 · 2022-04-20T09:22:32.000Z

@patelprateek Could you please refer to this comment and try with the latest TF versions(2.4 or later ) as older TF versions(1.x) are not actively supported.Please let us know if it helps?Thanks!

Answer 28 · 2022-04-27T10:22:30.000Z

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

Answer 29 · 2022-05-04T11:14:46.000Z

Closing as stale. Please reopen if you'd like to work on this further.