During this lab you will:
- Benchmark a processor with arbitrary subroutines that you have coded yourself,
- Compare several machines according to the benchmarks, and
- Code in Make and C-language.
This lab assumes you have read or are familiar with the following topics:
- Chapter 1.6 (Clock rate, CPI and program performance)
- Knowledge of linux command line interface (CLI) and
gcc
- Difference between execution time (also known as wall time), user time and system time
- Linear algebra operations: dot product, matrix multiplication (book has C code, calls it
dgemm()
), and AXPY - Some experience with C language
- C-language: Heap allocation via
malloc()
andfree()
- Some experience with
make
Please study these topics if you are not familiar with them so that the lab can be completed in a timely manner.
The following is a list of requirements to complete the lab. Some labs can completed on any machine, whereas some require you to use a specific departmental server. You will find this information here.
This lab requires the following software:
gcc
version 8.3.0make
git
odin.cs.csubak.edu
has these already installed. If you're on your own machine running Ubuntu/Debian and you're not certain if these things are installed run:
$ sudo apt install build-essential git
which should install these three things. This course will use Makefiles to automate compiling your code. For this lab manual I'm assuming you've worked with Make before. If you haven't, the makefile
included in this repository is simple enough to learn from. Take time to read the comments if this is your first time with Make.
Linux | Mac | Windows |
---|---|---|
Yes | Yes | Yes, with WSL |
For Mac and Windows see the Appendix.
Processors can vary quite a bit in their clock rate, CPI and program performance. Most PC-builders often only pay attention to the clock rate of a CPU but this is the tip of the iceberg. Consider two processors with the same clock rate. There is no guarantee they complete an instruction in the same amount of cycles. This is due to:
- Instructions types taking a varying amount of clock cycles;
- How manufacturers design the hardware (logic gates); and
- A varying number and type of instructions for a program.
Thus, clock rate alone is a flawed way to compare two processors. A benchmark program is often required. This defines the number and type of instructions.
The execution time of a program is (instr./program)x(cycles/instr.)x(s/cycle) where (s/cycle) is the inverse of the clock rate. The (cycles/instr.) and (s/cycle) vary from microprocessor to microprocessor. It is common practice is to use a single program thus fixing the (instr./program). This is often called benchmarking. The program used is called a benchmark program. There are industry and commercial standard benchmarks:
Non-standard benchmarks by individuals:
- Dhrystone
- Whetsone
Programs not intended as a benchmark but are often used as one:
- Folding@home
- SETI@home
- Jack the ripper
- Prime95a
The purpose of this lab is to create your own benchmark programs. Benchmark programs are costly operations that are ripe for optimization. We will revisit and improve these throughout the course. This repository has a starting framework to test on different machines. Perhaps the results will surprise you.
If you're on odin.cs.csubak.edu
skip this step. git
, make
and gcc
should be installed. The following will indicate if your machine needs these things to be installed. Open a terminal, change the directory to your intended working directoy and download this repository:
$ git clone https://github.com/DrAlbertCruz/CMPS-3240-Benchmarking.git
...
$ cd CMPS-3240-Benchmarking
Running make all
compiles the benchmark into test_iaxpy.out
, as well as its pre-linked binary test_iaxpy.o
.
$ make
gcc -Wall -O0 -c test_iaxpy.c -o test_iaxpy.o
gcc -Wall -O0 -o test_iaxpy.out test_iaxpy.o
By default Make will execute the first target in the file. The -Wall
flag enables all warnings from the compiler. The -O0
flag prevents the compiler from performing any optimizations under the hood. We do not want the compiler to introduce unintended optimizations.b If you got to this point without any issues, you are clear to proceed to the next part of the lab.
The goal of this lab is to implement three benchmark programs:
void iaxpy( int length, int a, int *X, int *Y, int *Result );
float fdot( int length, float *X, float *Y );
void dgemm ( int length, double *X, double *Y, double *Result );
iaxpy()
has been provided. Feel free to read these operations using whatever resources at your disposal. These operations are costly array operations:
iaxpy
- an operation called A times X plus Y abbreviated as AXPY. The prefixi
indicates integers. Element-wise multiplication of scalar A timesx[i]
and addy[i]
. It is a linear cost operation. This tests integer multiplication and addition operations of a processor.cfdot
- an operation called dot product. The prefixf
indicates single-precision floating point values (float
). Element-wise multiplication ofx[i]
andy[i]
, and cumulatively sum the result. It is linear cost.fdot()
tests the floating point multiplication speed of a processor.dgemm
- an operation Generic Matrix Multiplication (DGEMM).d
indicates double-precision floating point. Carries out a matrix multiplication. Tests floating point operations. It also tests the cache with many re-references of the same index. This is a polynomial n^2 cost operation.
Study text_iaxpy.c
before proceeding. When you have a general understanding proceed. The idea is to create test programs for each of these operations, and test_iaxpy.c
is one example. Each benchmark should contain the following:
- The code for the operation
- In
main()
, allocate space for arrays on the heap withmalloc()
- Call the operation/function from
main()
- Free the memory with
free()
The program will run the function on an array of a very large size. So large that it will test the performance of a processor. Take a look at test_iaxpy.c
by opening the file with your favorite text editor:
$ vim text_iaxpy.c
The code declares and defines iaxpy()
:
void iaxpy( int length, int A, int *X, int *Y, int *Result ) {
for( int i = 0; i < length; i++ )
Result[i] = A * X[i] + Y[i];
}
It then allocates some test arrays dynamically on the heap:
const int N = 200000000;
printf( "Running IAXPY operation of size %d x 1", N );
int A = 13;
int *X = (int *) malloc( N * sizeof(int) );
int *Y = (int *) malloc( N * sizeof(int) );
int *Result = (int *) malloc( N * sizeof(int) );
iaxpy( N, A, X, Y, Result );
The size of the arrays is defined at compile-time as N
. malloc()
is used rather than defining an array the standard way via TYPE[N]
because you cannot dynamically declare an array the later way. malloc()
returns a pointer to an array of the size passed by argument. However, when allocating memory this way you must always free the memory via free()
when done:
free( X );
free( Y );
free( Result );
We also want to use malloc()
because there are limits to the size of an array declared in the traditional way via TYPE[N]
--due to system limitations of the size of an array that can be allocated on the stack, and we will definitely be exceeding this limit. Before proceeding to the next section, study test_iaxpy.c
. Do the following:
- Create a test program for
fdot
fromtest_iaxpy.c
, and make appropriate targets for it in the makefile. - Repeat for
dgemm
. Note that when allocating the arrays fordgemm
that it is n^2 so your need to modify your allocation as follows:(double *) malloc( N * N * sizeof(double) )
. This code is given as an example in the textbook.
Now to the benchmarking. You can use the time
command to time the performance of the benchmark. For the input size N, we want some arbitrarily large value so that we can really see the difference in run times for varying instruction types. When running any experiment, you want to run it at least three times and take the average, so we use a bit of scripting to call a timing operation on ./test_iaxpy.out
three times. Insert the following into the command line:d
$ for i in {1..3}; do time ./test_iaxpy.out; done;
This command runs the command time ./test_iaxpy.out
, which will run the iaxpy
operation. On my own Dell Latitude E5470 laptop I get the following:
$ for i in {1..3}; do time ./test_iaxpy.out; done;
Running IAXPY operation of size 200000000 x 1
real 0m1.397s
user 0m0.937s
sys 0m0.434s
Running IAXPY operation of size 200000000 x 1
real 0m1.398s
user 0m0.884s
sys 0m0.504s
Running IAXPY operation of size 200000000 x 1
real 0m1.365s
user 0m0.887s
sys 0m0.476s
Recall from the text that real (wall) time includes the time that was spent by the operating system allocating memory and doing I/O. We want to focus on the user time. So, for iaxpy
my Dell Latitude E5470 has an average of ~0.9 seconds. You should run this benchmark operation on odin.cs.csubak.edu
for each of the three operations. This means you must make benchmark programs for fdot
and dgemm
because they are not provided with the repo. You should get faster results because I have a slower processor. You want to run the experiment many times because factors out of your control may skew you measurement. For example, there may be too many people running the same benchmark at that exact moment.
To determine what processor you are running via the command line execute:
$ cat /proc/cpuinfo | grep "model name"
model name : Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz
you can also get the cache size with the following:
$ cat /proc/cpuinfo | grep "cache size"
cache size : 6144KB KB
You will get something different on odin.cs.csubak.edu
, sleipnir.cs.csubak.edu
and the other machines you intend to benchmark. Carry out the a benchmark of the three operations:
iaxpy
- For N = 200000000fdot
- For N = 200000000dgemm
- For N = 1024. Do not try to run this for N = 200000000 the operation is too large to run even onodin.cs.csubak.edu
.
each on at least one more computer (other than odin). Some suggestions: the local machine you're using to ssh to odin.cs.csubak.edu
on (if linux), sleipnir.cs.csubak.edu
(if you have a login for that), your macbook, etc.
On Mac, cat /proc/cpuinfo
does not work. To get the cpu information from the command line execute:
$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
For check off, do the following:
- Show your version of the DGEMM test program to the instructor
- Aggregate your results into a table, and show your results to the instructor. It should look something like:
Operation | iaxpy |
fdot |
dgemm |
---|---|---|---|
Albert's Dell Latitude E5470 w/ Intel Core i5-6440HQ | 0.771 | 0.790 | 4.110 |
Albert's 2014 Macbook Pro w/ Intel Core i7-5557U | 0.925 | 0.836 | 12.776 |
odin.cs.csubak.edu w/ Intel Xeon E5-2630 v4 |
|||
Local machine | |||
My linux laptop |
Etc. etc.
Mac is a POSIX operating system and should be most compatible with the labs, which were created on Debian 6.3.0 with GCC version 6.3.0. However, Mac actually uses clang
compiler, which is different from gcc
. They alias gcc
to clang
, so even if you call gcc
you are not using it. In the future, there will be labs that look at assembly code, and the resulting assembly mnemonics will be wildly different between gcc
and clang
. You should be OK using your Mac for this lab. If you plan to use your laptop, you should install xcode which will contain clang
/gcc
:
$ xcode-select --install
Verify with the following:
$ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
You will get something similar depending on what version of Mac OS version you are using. If you're getting errors with gcc
in Mac, sometimes doing this will help:
$ xcode-select --reset
This is not a complete manual for how to install a C compiler on your machine and if things get off the rails you may want to just use a departmental computer/server.
It is possible for you to continue this lab on Windows if you install the Windows Subsystem for Linux (WSL).e Please make sure you install Debian 10 for consistency with the lab manual. Once this is done refer to the Linux subsection for installation/checking of appropriate software. Keep in mind that WSL maintains a separate home directory from the local Windows user, so you may want to use symbolic links to save time when navigating to things you download/edit.
aOften used for thermal testing to see if you attached your heatsink properly.
bActually, in some environments, if you enable optimization flags, the benchmark code will be audited out of the program entirely because the for
loop is doing work on arrays that were never initialized, and the result is never used. However, we want the CPU to do the work. We do not care for the result because we are just measuring the arbitrary amount of time it takes a CPU to do the work.
cInteger arrays are important to test, even if unsophisticated, because they are 'normal' work, and designs have tended to favor floating point optimization at the expense of regular arithmetic.
dIt is possible for you to do this within Make as well.