Memory-Efficient Training for Extremely Large Output Spaces

The content of this repository provides supplementary material for the ECMLPKDD 2023 submission "Towards Memory-Efficient Training for Extremely Large Output Spaces – Learning with 500k Labels on a Single Commodity GPU"

Improved Code

If you are not interested in the exact code for that paper, we recommend to use

instead, which provide improved tf/pytorch bindings for the CUDA kernels described in this work. The kernels themselves are shared between the two implementations, and can be found at https://version.aalto.fi/gitlab/xmc/xmc-kernels.

This repository contains the implementation for several variations of a sparse layer to be used with large output spaces.

Additional Results

Results for AmazonCat-670k with CascadeXML features.

setup	connectivity	intermediate	train-p@1	train-p@3	train-p@5	test-p@1	test-p@3	test-p@5	memory	epochs	time-per-epoch
Dense	768	--	99.79	94.5	89.01	47.52	42.26	38.28	13.43	28.4	624
Unstructured	32	--	88.83	71.44	55.62	30.42	23.75	18.96	6.27	95	1369
FFI	32	--	92.42	83.96	74.75	37.09	31.62	27.56	0.97	76.2	234
Unstructured	32	32k	99.68	94.32	88.66	42.5	37.06	33.05	6.5	36	1512
FFI	32	16k	99.46	94	88.02	41.32	35.91	31.91	1.25	34	270
FFI	32	32k	99.67	94.32	88.66	42.59	37.12	33.13	1.45	36.4	271
FFI	32	65k	99.7	94.37	88.77	43.7	38.43	34.42	1.74	39	305
FFI	32	100k	99.67	94.32	88.67	44.71	39.3	35.29	2.38	34	334
FFI	64	16k	99.74	94.43	88.89	43.36	38.05	34.18	2.14	27	290
FFI	64	32k	99.75	94.43	88.9	44.26	38.93	35.04	2.41	31	306
FFI	64	65k	99.74	94.41	88.86	45.26	39.81	35.88	2.55	33	391
FFI	64	100k	99.72	94.39	88.81	45.65	40.33	36.39	2.91	31	435
FFI	72	65k	99.75	94.44	88.9	45.25	39.9	35.99	2.7	31	440
Bottleneck	64	64	99.08	93.28	86.39	38	33.74	30.39	1.13	31.6	232

Results for AmazonCat-670k with Slice features.

setup	connectivity	intermediate	train-p@1	train-p@3	train-p@5	test-p@1	test-p@3	test-p@5	memory	epochs	time-per-epoch
Dense	512	--	99.18	93.91	88.39	33.76	29.62	26.58	8.96	27.2	472
Unstructured	32	--	64.77	49.37	38.98	14.45	11.54	9.51	6.36	73	1357
FFI	32	--	16.17	13.9	12.42	7.12	6.3	5.64	0.97	24.8	223
Unstructured	32	32k	98.85	93.41	87.53	32.65	28.68	25.79	6.42	45	1618
FFI	32	16k	98.22	92.45	85.96	31.73	27.86	25.02	1.11	42	259
FFI	32	32k	98.7	93.24	87.35	32.8	28.75	25.91	1.23	38	244
FFI	32	65k	98.96	93.61	87.98	33.7	29.69	26.84	1.35	36	309
FFI	32	100k	99.04	93.72	88.15	34.25	30.2	27.34	1.77	35	302
FFI	64	16k	99.1	93.8	88.19	33.16	29.17	26.41	1.95	33	301
FFI	64	32k	99.1	93.79	88.25	33.93	29.87	27.07	2.22	32	314
FFI	64	65k	99.09	93.79	88.26	34.56	30.5	27.69	2.54	30	396
FFI	64	100k	99.07	93.76	88.22	35.02	30.98	28.09	2.58	29	411
Bottleneck	64	64	96.36	88.88	80.08	30.69	27.33	24.56	1.13	33.6	219

Results for Wiki500k with CascadeXML features

setup	connectivity	intermediate	train-p@1	train-p@3	train-p@5	test-p@1	test-p@3	test-p@5	memory	epochs	time-per-epoch
Dense	768	--	96.7	79.72	64.24	77.17	58.55	45.11	10.04	25.6	1744
Unstructured	32	--	78.27	54.69	39.82	65.25	43.73	31.4	4.79	100	3870
FFI	32	--	69.09	51.79	40.44	58.67	41.98	32.2	0.72	59.4	715
Unstructured	32	32k	92.41	73.53	58.13	73.7	54.75	42.01	4.91	58	4423
FFI	32	16k	90.71	71.44	56.32	73.12	54.15	41.52	0.92	68	746
FFI	32	32k	92.96	74.31	58.89	73.65	54.78	42.06	1.02	67.4	842
FFI	32	65k	94.35	76.31	60.84	74.05	55.42	42.63	1.57	56	928
FFI	32	100k	94.45	76.57	61.14	74.32	55.76	42.95	2.36	49	1262
FFI	64	16k	94.08	76.01	60.4	74.39	55.63	42.75	1.67	56	878
FFI	64	32k	94.77	77.05	61.48	74.39	55.81	42.93	1.9	48	929
FFI	64	65k	95.28	77.83	62.32	74.51	56.05	43.19	2.03	43	1167
FFI	64	100k	95.84	78.64	63.14	74.63	56.24	43.37	2.65	45	1530
Bottleneck	64	64	86.43	64.81	49.5	71.9	50.7	37.93	0.96	47.6	678

Results for Wiki500k with Slice features

setup	connectivity	intermediate	train-p@1	train-p@3	train-p@5	test-p@1	test-p@3	test-p@5	memory	epochs	time-per-epoch
Dense	512	--	97.33	77.47	60.41	58.25	37.91	28.03	6.7	39.4	1249
Unstructured	32	--	58.33	37.7	28.07	45.49	27.29	19.86	4.79	78	3612
FFI	32	--	42.64	28.01	21.95	37.53	23.22	17.64	0.72	54.8	659
Unstructured	32	32k	83.74	61.42	47.69	59.01	38.48	28.9	4.81	40	3977
FFI	32	16k	80.26	58.19	45.12	58	37.74	28.36	0.92	59	946
FFI	32	32k	84.24	62.19	48.39	58.86	38.44	28.87	1.04	45.8	723
FFI	32	65k	88.46	66.83	52.23	59.83	39.24	29.48	1.18	37	821
FFI	32	100k	89.83	68.51	53.69	60.56	39.79	29.93	1.59	34	1106
FFI	64	16k	86.5	64.55	50.19	59.38	38.68	29.01	1.54	52	807
FFI	64	32k	89.06	67.44	52.62	59.97	39.25	29.45	1.78	43	843
FFI	64	65k	92.19	71.21	55.81	60.51	39.75	29.83	1.92	38	1035
FFI	64	100k	93.29	72.76	57.21	61.02	40.17	30.18	2.06	38	1335
Bottleneck	64	64	71.79	50.01	38.48	56.5	36.52	27.5	0.97	41.8	639

Preliminary results for Amazon3M with CascadeXML features

setup	connectivity	intermediate	train-p@1	train-p@3	train-p@5	test-p@1	test-p@3	test-p@5	memory	epochs	time-per-epoch
Dense	768	--	89.89	83.94	79.44	53.36	50.65	48.38	56.36	37	4228
FFI	32	65k	68.08	62.01	58.03	48.07	44.11	41.44	4.47	100	1901
FFI	32	131k	71.18	64.84	60.64	49.12	45	42.3	5.01	100	1902
FFI	64	131k	78.12	71.96	67.64	50.4	46.74	44.21	8.67	100	2337
FFI	96	131k	82.99	76.8	72.3	51.02	47.64	45.22	12.61	100	2776

Building the library

First, create a conda environment as provided by environment.yml, e.g. through

conda conda env create -f environment.yml

Activate the environment. Then, configure CMake for a build directory (e.g. build) and run the build

cmake -S . -B build
cmake --build build --target sparseops

After this, there should be a file build/libsparseops.so which contains the compiled parts of the library.

Running the python code

The sparse subdirectory contains the glue code that makes the custom kernels usable in tensorflow (the ops subdirectory), and several implementations of sparse multiplication layers and corresponding utilities (layers subdirectory). In order to be able to use the fast sparse layer, the libsparseops.so file needs to be placed alongside ops/fixed_fan_in_ops.py.

An example script is given in run.py, which runs a (sparse) training experiment specified in a json file. The tasks uses for the paper are given in the tasks subdirectory.

xmc-aalto/ecml23-sparse