stream-scaling: A C repository from schmiddy

Introduction

stream-scaling automates running the STREAM memory bandwidth test on Linux systems. It detects the number of CPUs and how large each of their caches are. The program then downloads STREAM, compiles it, and runs it with an array size large enough to not fit into cache. The number of threads is varied from 1 to the total number of cores in the server, so that you can see how memory speed scales as cores involved increase.

Installation/Usage

Just run stream-scaling:

./stream-scaling

And it should do the rest. Note that a stream.c and stream binary will be left behind afterwards.

Note that the program is only expected to work on systems using gcc 4.2 or later, as the OpenMP libraries are required.

Sample result

This sample is from an Intel i7 860 processor, featuring 4 real cores with Hyper Threading for a total of 8 virtual cores. It also features the Turbo feature to accelerate running with low core counts. Memory is 4 X 2GB DDR-1600:

$ ./stream-scaling
=== CPU cache information ===
CPU /sys/devices/system/cpu/cpu0 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu0 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu0 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu0 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu1 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu1 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu1 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu1 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu2 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu2 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu2 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu2 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu3 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu3 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu3 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu3 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu4 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu4 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu4 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu4 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu5 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu5 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu5 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu5 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu6 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu6 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu6 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu6 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu7 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu7 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu7 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu7 Level 3 Cache: 8192K (Unified)
Total CPU system cache: 69468160 bytes
Suggested minimum array elements needed: 31576436
Array elements used: 31576436

=== CPU Core Summary ===
processor   : 7
model name  : Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
cpu MHz             : 2898.023
siblings    : 8

=== Check and build stream ===
--2010-09-19 21:41:46--  http://www.cs.virginia.edu/stream/FTP/Code/stream.c
Resolving www.cs.virginia.edu... 128.143.137.29
Connecting to www.cs.virginia.edu|128.143.137.29|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11918 (12K) [text/plain]
Saving to: `stream.c'

100%[======================================>] 11,918      --.-K/s   in 0.03s

2010-09-19 21:41:46 (373 KB/s) - `stream.c' saved [11918/11918]


=== Testing up to 8 cores ===

-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 31576436, Offset = 0
Total memory required = 722.7 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 1
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 38888 microseconds.
   (= 38888 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        9663.6238       0.0524       0.0523       0.0527
Scale:       9315.7724       0.0545       0.0542       0.0558
Add:        10429.7390       0.0729       0.0727       0.0732
Triad:      10108.3413       0.0753       0.0750       0.0758
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Number of Threads requested = 2
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13095.9151       0.0579       0.0579       0.0580

Number of Threads requested = 3
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13958.5017       0.0545       0.0543       0.0547

Number of Threads requested = 4
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      14293.3696       0.0532       0.0530       0.0537

Number of Threads requested = 5
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13663.0608       0.0563       0.0555       0.0571

Number of Threads requested = 6
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13757.0249       0.0559       0.0551       0.0567

Number of Threads requested = 7
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13463.7445       0.0564       0.0563       0.0566

Number of Threads requested = 8
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13230.8312       0.0575       0.0573       0.0583

Like many of the post-Nehalem Intel processors, this system gets quite good memory bandwidth even when running a single thread, with the Turbo feature helping a bit too. And it's almost reached saturation of all available bandwidth with only two threads active, which is good for a system with this many cores; they don't all have to be doing something to take advantage of all the memory on this server.

Results database

Eventually it's hoped that this program can help build a database of per-core scaling information for STREAM similar to the the core STREAM project maintains for peak throughput. Guidelines for submission to such a project are still being worked on. Please contact the author if you have any ideas for helping organize this work.

In general the following information is needed:

Output from the stream-scaling command
CPU information
List of memory banks in the system, what size of RAM they have, and what technology/speed it runs at.

Common places you might assemble this info from include:

/proc/cpuinfo
lspci -v
dmidecode

Since CPU performance data of this sort is very generic, many submissions are sent to help this project without wanting the company or individual's name dislosed. Accordingly, unless credit for your submission is specifically requested, the source of reported results will remain private. So far all contributions have been anonymous.

Preliminary Samples

Here are some sample results from the program, showing how memory speeds have marched forward as the industry moved from slower DDR2 to increasingly fast DDR3. They also demonstrate why AMD was able to limp along with slower RAM for so long in their multi-socket configurations. While no single core gets great bandwidth, when the server is fully loaded the aggregate amount can be impressive.

T7200: Intel Core2 T7200. Dual core. 32K Data and Instruction L1 caches, 4096K L2 cache.
E5420: Intel Xeon E5420. Quad core. 16K Data and Instruction L1 caches, 6144MB L2 cache. 8 X 4GB DDR2-667.
2 X E5405: Dual Intel Xeon E5405. Quad core. 32K Data and Instruction L1 caches, 6144K L2 cache. 8 X 4GB DDR2-667.
4 X 8347: AMD Opteron 8347 HE. Quad core, 4 sockets. 64K Data and Instruction L1 caches, 512K L2 cache, 2048K L3 cache. 32 X 2GB DDR2-667.
E2180: Intel Pentium E2180. Dual core. 32K Data and Instruction L1 caches, 1024K L2 cache. 2 X 1GB DDR2-800.
X2 4600+: AMD Athlon 64 X2 4600+. Dual core. 64K Data and Instruction L1 caches, 512K L2 cache. 4 X 2GB RAM.
2 X 280: Amd Opteron 280. Dual core, 2 sockets. 64K Data and Instruction L1 caches, 1024K L2 cache. 8 X 1GB DDR2-800.
Q6600: Intel Q6600. Quad core. 32KB Data and Instruction L1 caches, 4096K L2 cache. 4 X 2GB RAM.
8 X 8431: AMD Opteron 8431. 6 cores each, 8 sockets. 64K Data and Instruction L1 caches, 512K L2 cache, 5118K L3 cache. 256GB RAM.
E5506: Intel Xeon E5506 2.13GHz. Quad core. 32K Data and Instruction L1 caches, 256K L2 cache, 4096K L3 cache.
E5520: Dual Intel Xeon E5520. Quad core with Turbo and Hyper Threading for 8 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 8192K L3 cache. 18 X 4GB RAM.
X4 955: AMD Phenon II X4 955. 64K Data and Instruction L1 caches, 512K L2 cache, 6144K L3 cache. 4GB DDR3-1333.
X6 1055T: AMD Phenon II X6 1055T. 64K Data and Instruction L1 caches, 512K L2 cache, 6144K L3 cache. 8GB DDR3-1333.
i860: Intel Core i7 860. Quad core with Turbo and Hyper Threading for 8 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 8192K L3 cache. 4 X 2GB RAM.
i870: Intel Core i7 870. Quad core with Turbo and Hyper Threading for 8 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 8192K L3 cache. 2 X 2GB RAM.
i870[2]: Intel Core i7 870, as above, except with 4 X 4GB RAM.
2 X E5620: Dual Intel Xeon E5620. Quad core with Turbo and Hyper Threading for 16 virtual cores. 32K Data and Instruction L1 cache, 256K L2 cache, 12288K L3 cache. 12 X 8GB DDR3/1333.
2 X X5560: Dual Intel Xeon X5560. Quad core with Turbo and Hyper Threading for 8 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 8192K L3 cache. 6 X 2GB DDR3/1333.
4 x E7540: Quad Intel Xeon E7540. Six cores with Turbo and Hyper Threading for 48 virtual cores, 32K Data and Instruction L1 caches, 256K L2 cache, 18432K L3 cache. 32 x 4096MB DDR3/1066.
4 x X7550: Quad Intel Xeon X7550. Eight cores with Turbo, Hyper Threading disabled for 32 total. 32K Data and Instruction L1 caches, 256K L2 cache, 18432K L3 cache. 32 X 4096 DDR3/1333.
4 X 6168: Quad AMD Opteron 6168. Twelve cores for 48 total, 64K Data and Instruction L1 caches, 512K L2 cache, 5118K L3 cache. 16 X 8192MB DDR3/133.
4 X 6172: Quad AMD Opteron 6172. Twelve cores for 48 total, 64K Data and Instruction L1 caches, 512K L2 cache, 5118K L3 cache. 32 X 4096MB DDR3/1333.
4x X7560: Quad Intel X7560. Eight cores with Turbo and Hyper Threading for 64 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 24576K L3 cache. 32 X 4096 DDR3/1066.
X7560[2]: Quad Intel X7560. Eight cores with Turbo and Hyper Threading disabled, for 32 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 24576K L3 cache. 32 X 4096 DDR3/1066.
4 X 4850: Quad Intel E7-4850. Ten cores with Turbo and Hyper Threading for 80 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 24576K L3 cache. 64 X 8192MB DDR3/1333.

Processor	Cores	Clock	Memory	1 Core	2	3	4	8	16	24	32	48
T7200	2	2.0GHz	DDR2/667	2965	3084
E5420	4	2.5GHz	DDR2/667	3596	3992	4305	4365	4452
2 X E5405	8	2.0GHz	DDR2/667	3651	3830	4941	5774	5773
4 X 8347	16	1.9GHz	DDR2/667	2684	5212	7542	8760	9389	14590
E2180	2	2.0GHz	DDR2/800	2744	2784
X2 4600+	2	2.4GHz	DDR2/800	3657	4460
2 X 280	4	2.4GHz	DDR2/800	3035	3263	3130	6264
Q6600	4	2.4GHz	DDR2/800	4383	4537	4480	4390
8 X 8431	48	2.4GHz	DDR2/800	4038	7996	11918	13520	23658	22801	23688	24522	27214
E5506	4	2.13GHz	DDR3/800	7826	9016	9273	9297
2 X E5520	8	2.27GHz	DDR3/1066	7548	9841	9377	9754	12101	13176
X4 955	4	3.2GHz	DDR3/1333	6750	7150	7286	7258
X6 1055T	6	3.2GHz	DDR3/1333	7207	8657	9873	9772	9932*
i860	8	2.8GHz	DDR3/1600	9664	13096	13959	14293	13231
i870	8	2.93GHz	DDR3/1600	10022	12714	13698	13909	12787
i870[2]	8	2.93GHz	DDR3/1600	9354	11935	13145	13853	12598
2 X E5620	16	2.4GHz	DDR3/1333	9514	16845	17960	22544	21744	19083
2 X X5560	16	2.8GHz	DDR3/1333	11658	18382	19918	24546	23407	29215
4 X E7540	48	2.0GHz	DDR3/1066	4992	9967	14926	18727	31685	35566	35488	35973	35284
4 X X7550	32	2.0GHz	DDR3/1333	5236	10482	15723	20963	32557	35941	35874	35819
4 X 6168	48	1.90GHz	DDR3/1333	5611	11148	15819	20943	34327	52206	67560	69517	65617
4 X 6172	48	2.1GHz	DDR3/1333	4958	9903	14493	19469	37613	51625	40611	47361	32301
4 X X7560	64	2.26GHz	DDR3/1066	4356	7710	13028	14561	18702	19761	19938	20011	15964
X7560[2]	32	2.26GHz	DDR3/1066	4345	8679	12970	16315	25293	27378	27368	28654
4 X 4850	80	2.0GHz	DDR3/1333	5932	11571	17404	16000	41932	72351	58657	71384	65395

The result for 6-core processors with 6 threads is shown in the 8-core column. Only so much space to work with here...

Multiple runs

Since significant run to run variation is often observed in stream results, a set of tools to help average this data out are included. The programs require the Ruby programming language be installed. Using them looks like this, where we're using the server hostname "grace" to label the files and averaging across 10 runs:

./multi-stream-scaling 10 grace
./multi-averager grace > stream.txt
gnuplot stream-plot

A stream.png file will be produced with a graph showing the average of the values from the multiple runs. If you are interested in analyzing the run to run variation, the stream.txt file also includes the standard deviation of the results at each core count.

Todo

Adding compatibility with more operating systems than Linux would be nice. Some results have been submitted from FreeBSD that look correct, but the automatic cache validation code hasn't been validated on that OS.
A results processor that took the verbose output shown and instead produced a compact version for easy comparison with other systems, similar to the CSV output mode of bonnie++, would make this program more useful.

Limitations

On systems with many processors and large caches, most commonly AMD systems with 24 or more cores, the results at high core counts will vary significantly. This is theorized to come from two causes:

Thread scheduling will move the running stream processees between processors in a way that impacts results.
Despite attempting to use a large enough data set to avoid it, some amount of processor caching will inflate results.

If the variation of results at high core counts is high, running the program multiple times and considering the worst results seen at higher thread counts is recommended. Results listed above have included some work to try and eliminate incorrect data from these processors. That may not have been entirely successful. For example, the 4 X 6172 results show extremely high results from 16 to 32 cores. Determing whether those are accurate is still a work in progress.

Bugs

On some systems, the amount of memory selected for the stream array ends up exceeding how large of a block of RAM the operatin system (or in some cases the compiler) is willing to allocate at once. This seems a particular issue on 32-bit operating systems, but even 64-bit ones are not immune.

If your system fails to compile stream with an error such as this:

stream.c:(.text+0x34): relocation truncated to fit: R_X86_64_32S against `.bss'

stream-scaling will try to compile stream using the gcc "-mcmodel=large" option after hitting this error. That will let the program use larger data structures. If you are using a new enough version of the gcc compiler, believed to be at least verison 4.4, the program will run normally after that; you can ignore these "relocation truncated" warnings.

If you have both a large amount of cache--so a matching large block of memory is needed--and an older version of gcc, the second compile attempt will also fail, with the following error:

stream.c:1: sorry, unimplemented: code model ‘large’ not supported yet

In that case, it is unlikely you will get accurate results from stream-scaling. You can try it anyway by manually decreasing the size of the array until the program will compile and link. Manual compile can be done like this:

gcc -O3 -DN=130000000 -fopenmp stream.c -o stream

And then reducing the -DN value until compilation is successful. After that upper limit is determined, adjust the setting for MAX_ARRAY_SIZE at the beginning of the stream-scaling program to reflect it. An upper limit on the stream array size of 130M as shown here allocates approximately 3GB of memory for the test array, with 4GB being the normal limit for 32-bit structures.

The fixes for this issue are new, and it is still possible a problem here still exists. If you have a gcc version >=4.4 but stream-scaling still won't compile correctly, a problem report to the author would be appreciated. It's not clear yet why the exact cut-off value varies on some systems, or if there are systems where the improved dynamic allocation logic may not be sufficient.

Documentation

The documentation README.rst for the program is in ReST markup. Tools that operate on ReST can be used to make versions of it formatted for other purposes, such as rst2html to make a HTML version.

Contact

The project is hosted at http://github.com/gregs1104/stream-scaling

If you have any hints, changes or improvements, please contact:

Greg Smith greg@2ndQuadrant.com

Credits

The sample results given in this file have benefitted from private contributions all over the world. Most submissions ask to remain anonymous.

The multiple run averaging programs were originally contributed by Ben Bleything <ben@bleything.net>

License

stream-scaling is licensed under a standard 3-clause BSD license.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

Neither the name of the author nor the names of contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

schmiddy/stream-scaling