\documentclass[11pt]{article}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{courier}
\usepackage[pdfborder=0 0 0]{hyperref}
\usepackage{url}
\usepackage[cm]{fullpage}
\newcommand{\HRule}{\rule{\linewidth}{0.5mm}}
\setlength{\parindent}{0cm}
\setlength{\parskip}{0.3cm}
\begin{document}
\title{DRAMSim2}
\author{Elliott Cooper-Balis \\
			Paul Rosenfeld \\
			Bruce Jacob \\
			University of Maryland \\
			\texttt{\footnotesize dramninjas \textit{[at]} gmail \textit{[dot]} com}
}
\date{}
\maketitle
\HRule
\tableofcontents
\HRule
\lstset{basicstyle={\scriptsize\ttfamily},tabsize=2,frame=single}
\section{Why Do We Need to Simulate DRAM Systems This Accurately?}
Modern computer system performance is increasingly limited
by the performance of DRAM-based memory systems. As a result, there is great
interest in providing accurate simulations of DRAM based memory systems as part
of architectural research. Unfortunately, there is great difficulty associated
with the study of modern DRAM memory systems, arising from
the fact that DRAM-system performance depends on many
independent variables such as workload characteristics of memory access rate
and request sequence, memory-system architecture, memory-system configuration, DRAM
access protocol, and DRAM device timing parameters. As a result, system
architects and design engineers often disagree on the usefulness of a given
performance-enhancing feature, since the performance impact of that feature typically 
depends on the characteristics of specific workloads, memory-system architecture, 
memory-system configuration, DRAM access protocol and DRAM device timing parameters. 
\subsection{DRAM Scheduling Complexity is Growing}

\begin{figure}[h]
\begin{center}
\includegraphics[width=\linewidth]{docs/why1.gif}
\caption{Timing diagram showing complexity of DRAM scheduling}
\label{timingcomplex}
\end{center}
\end{figure}
Figure \ref{timingcomplex} shows the pipelined scheduling of a DDR2 SDRAM device. Despite
the fact that the simulated memory system uses a closed-page policy and 
rotates through available banks on the DRAM device, which should simplify
scheduling considerably, the scheduling of this system is actually more
complex than earlier DRAM systems: for instance, new timing parameters
such as t_{RRD} and t_{FAW} are contributing to the growing
set of timing constraints placed on each successive generation of DRAM 
devices.

\subsection{DRAM performance characteristics changes every generation}
DRAM based memory systems are impacted primarily by two attributes: 
row cycle time and device datarate. Presently, DRAM row cycle times are 
decreasing at a rate of approximately 7\% per year, and DRAM device
datarates are increasing with each new generation of DRAM devices at the rate
of 100\% every three years, 
\begin{figure}[h]
\begin{center}
\includegraphics[width=\linewidth]{docs/why2.gif}
\caption{DRAM row cycle time trends}
\label{classes}
\end{center}
\end{figure}

\begin{figure}[h]
\begin{center}
\includegraphics[width=\linewidth]{docs/why3.gif}
\caption{DRAM device data rate trends}
\label{classes}
\end{center}
\end{figure}
The difference in the scaling trends of the DRAM device means that 
fundamental DRAM device performance characteristics are changing 
every single generation, and the changing performance characteristics
cannot be accurately predicted by linear extrapolations. The result is 
that no computer architect can rest easy knowing that he or she has 
obtained X\% of performance improvement with a set of microarchitectural
techniques on the current generation memory system, because the same set 
of microarchitectural techniques may not to be as effective in a 
future memory system due to the differences in the scaling attributes 
of DRAM devices, 

\subsection{The Sales Pitch}
Our DRAM-system simulation work enables system architects not
only to explore the impact of a set of microarchitectual techniques on a 
given memory system but also to examine the effectiveness of those 
microarchitectural techniques on a future generation memory system with
future generations of DRAM devices.

\section{About DRAMSim2}
	DRAMSim2 is a cycle accurate model of a DRAM memory controller, the DRAM
	modules which comprise system storage, and the buses by which they
	communicate. 

	The overarching goal is to have a simulator that is small,
	portable, and accurate. The simulator core has a simple interface 
	which allows it to be CPU simulator agnostic and should to work with any simulator (see section \ref{library}).  This core has no external run
	time or build time dependencies and has been tested with g++ on Linux
	as well as g++ on Cygwin on Windows.  

\section{Getting DRAMSim2}

DRAMSim2 is available on \href{http://github.com/dramninjasUMD}{github}. If you have git installed you can clone our repository by typing:\\

\texttt{\$ git clone git://github.com/dramninjasUMD/DRAMSim2.git }



\section{Building DRAMSim2}
	To build an optimized standalone trace-based simulator called \texttt{DRAMSim} simply type:

	\texttt{\$ make}

	For a debug build which contains debugging symbols and verbose output, run:

	\texttt{\$ make DEBUG=1}

	To build the DRAMSim2 library, type: 

	\texttt{\$ make libdramsim.so }

\section{Running DRAMSim2}
\begin{minipage}{\textwidth}
\subsection{Trace-Based Simulation}

In standalone mode, DRAMSim2 can simulate memory system traces. While traces are not as accurate
as a real CPU model driving the memory model, they are convenient since they can be generated in a number of different
ways (instrumentation, hardware traces, CPU simulation, etc.) and reused. 

We've provided a few small sample traces in the \texttt{traces/} directory. These gzipped
traces should first be pre-processed before running through the simulator. 
To run the preprocessor (the preprocessor requires python): 
\begin{lstlisting}
cd traces/
./traceParse.py k6_aoe_02_short.trc.gz
\end{lstlisting}
	This should produce the file \texttt{traces/k6\_aoe\_02\_short.trc}. Then, go back to the DRAMSim2 directory and run the trace based simulator:

\begin{lstlisting}
cd .
./DRAMSim -t traces/k6_aoe_02_short.trc -s system.ini -d ini/DDR3_micron_64M_8B_x4_sg15.ini -c 1000
\end{lstlisting}
	This will run a 1000 cycle simulation of the \texttt{k6\_aoe\_02\_short} trace using 
	the specified DDR3 part. The -s, -d, and -t flags are required to run a simulation.

	A full list of the command line arguments can be obtained by typing:
\begin{lstlisting}
$ ./DRAMSim --help
DRAMSim2 Usage: 
DRAMSim -t tracefile -s system.ini -d ini/device.ini [-c #] [-p pwd] -q
	-t, --tracefile=FILENAME 	specify a tracefile to run  
	-s, --systemini=FILENAME 	specify an ini file that describes the memory system parameters  
	-d, --deviceini=FILENAME 	specify an ini file that describes the device-level parameters
	-c, --numcycles=# 		specify number of cycles to run the simulation for [default=30] 
	-q, --quiet 			flag to suppress simulation output (except final stats) [default=no]
	-o, --option=OPTION_A=234			overwrite any ini file option from the command line
	-p, --pwd=DIRECTORY		Set the working directory

\end{lstlisting}

	Some traces include timing information, which can be used
	by the simulator or ignored. The benefit of ignoring timing information is that requests
	will stream as fast as possible into the memory system and can serve as a good stress
	test. To toggle the use of clock cycles, please change the \texttt{useClockCycle} flag in \texttt{TraceBasedSim.cpp}.

	If you have a custom trace format you'd like to use, you can modify the \texttt{parseTraceFileLine()} function ton add
	support for your own trace formats. 

	The prefix of the filename determines which type of trace this function will use (ex: k6\_foo.trc) will use the k6 format 
	in \texttt{parseTraceFileLine()}.
\end{minipage}

\subsection{Library Interface}\label{library}
In addition to simulating memory traces, DRAMSim2 can also be built as a dynamic
shared library which is convenient for connecting it to CPU simulators or other
custom front ends.  A \texttt{MemorySystem} object encapsulates the
functionality of the memory system (i.e., the memory controller and DIMMs). The 
classes that comprise DRAMSim2 can be seen in figure \ref{classes}. A
simple example application is provided in the \texttt{example\_app/} directory.
At this time we have plans to provide code to integrate DRAMSim2 into 
\href{http://www.marss86.org/index.php/Home}{MARSSx86},
\href{http://www.cs.sandia.gov/sst/}{SST}, and (eventually)
\href{http://www.m5sim.org/}{M5}.

\begin{figure}[h]
\begin{center}
\includegraphics[width=\linewidth]{docs/classes.png}
\caption{Block diagram of DRAMSim2. The \texttt{\footnotesize recv()} functions are actually called
\texttt{\footnotesize receiveFromBus()} but were abbreviated to save sapce.}
\label{classes}
\end{center}
\end{figure}

\section{Example Output}

\noindent\begin{minipage}{\textwidth}
The verbosity of the DRAMSim2 can be customized in the system.ini file by turning the
various debug flags on or off. 

Below, we have provided a detailed explanation of the simulator output.  With
all DEBUG flags enabled, the following output is displayed for each cycle
executed.  

   \textbf{NOTE} : BP = Bus Packet, T  = Transaction \\ 
					MC = MemoryController, R\# = Rank (index \#)
  
\begin{lstlisting}
 ----------------- Memory System Update ------------------
 ---------- Memory Controller Update Starting ------------ [8]
 -- R0 Receiving On Bus    : BP [ACT] pa[0x5dec7f0] r[0] b[3] row[1502] col[799]
 -- MC Issuing On Data Bus    : BP [DATA] pa[0x7edc7e0] r[0] b[2] row[2029] col[799] data[0]=
 ++ Adding Read energy to total energy
 -- MC Issuing On Command Bus : BP [READ_P] pa[0x5dec7f8] r[1] b[3] row[1502] col[799]
== New Transaction - Mapping Address [0x5dec800] (read)
  Rank : 0
  Bank : 0
  Row  : 1502
  Col  : 800
 ++ Adding IDD3N to total energy [from rank 0]
 ++ Adding IDD3N to total energy [from rank 1]
== Printing transaction queue
  8]T [Read] [0x45bbfa4]
  9]T [Write] [0x55fbfa0] [5439E]
  10]T [Write] [0x55fbfa8] [1111]
== Printing bank states (According to MC)
[idle] [idle] [2029] [1502] 
[idle] [idle] [1502] [1502] 

== Printing Per Rank, Per Bank Queue
 = Rank 0
    Bank 0   size : 2
       0]BP [ACT] pa[0x5dec800] r[0] b[0] row[1502] col[800]
       1]BP [READ_P] pa[0x5dec800] r[0] b[0] row[1502] col[800]
    Bank 1   size : 2
       0]BP [ACT] pa[0x5dec810] r[0] b[1] row[1502] col[800]
       1]BP [READ_P] pa[0x5dec810] r[0] b[1] row[1502] col[800]
    Bank 2   size : 2
       0]BP [ACT] pa[0x5dec7e0] r[0] b[2] row[1502] col[799]
       1]BP [READ_P] pa[0x5dec7e0] r[0] b[2] row[1502] col[799]
    Bank 3   size : 1
       0]BP [READ_P] pa[0x5dec7f0] r[0] b[3] row[1502] col[799]
 = Rank 1
    Bank 0   size : 2
       0]BP [ACT] pa[0x5dec808] r[1] b[0] row[1502] col[800]
       1]BP [READ_P] pa[0x5dec808] r[1] b[0] row[1502] col[800]
    Bank 1   size : 2
       0]BP [ACT] pa[0x5dec818] r[1] b[1] row[1502] col[800]
       1]BP [READ_P] pa[0x5dec818] r[1] b[1] row[1502] col[800]
    Bank 2   size : 1
       0]BP [READ_P] pa[0x5dec7e8] r[1] b[2] row[1502] col[799]
    Bank 3   size : 0
\end{lstlisting}
\end{minipage}

\begin{minipage}{\textwidth}
Anything sent on the bus is encapsulated in an BusPacket (BP) object. When
printing, they display the following information:
\begin{lstlisting}
BP [ACT] pa[0x5dec818] r[1] b[1] row[1502] col[800]
\end{lstlisting}
The information displayed  is (in order): command type, physical address, rank
\#, bank \#, row \#, and column \#.
\end{minipage}


\begin{minipage}{\textwidth}
Lines beginning with " -- " indicate bus traffic, ie, 
\begin{lstlisting}
     -- R0 Receiving On Bus       : BP [ACT] pa[0x5dec7f0] r[0] b[3] row[1502] col[799]
     -- MC Issuing On Data Bus    : BP [DATA] pa[0x7edc7e0] r[0] b[2] row[2029] col[799] data[0]=
     -- MC Issuing On Command Bus : BP [READ_P] pa[0x5dec7f8] r[1] b[3] row[1502] col[799]
\end{lstlisting}
Sender and receiver are indicated and the packet being sent is detailed.
\end{minipage}

Lines beginning with " ++ " indicate power calculations, ie, 
\begin{lstlisting}
		 ++ Adding Read energy to total energy
 		 ++ Adding IDD3N to total energy [from rank 0]
 		 ++ Adding IDD3N to total energy [from rank 1]
\end{lstlisting}
The state of the system and the actions taken determine which current draw is used. For further detail about each current value, see Micron datasheet.  


If a pending transaction is in the transaction queue, it will be printed, as seen below:
\begin{lstlisting}
	 == Printing transaction queue		
		1]T [Read] [0x45bbfa4]
		2]T [Write] [0x55fbfa0] [5439E]
		3]T [Write] [0x55fbfa8] [1111]
\end{lstlisting}
Currently, at the start of every cycle, the head of the transaction
queue is removed, broken up into DRAM commands and placed in the
appropriate command queues.  To do this, an address mapping scheme
is applied to the transaction's physical address, the output of 
which is seen below:
\begin{lstlisting}
	== New Transaction - Mapping Address [0x5dec800] (read)
	 Rank : 0
		 Bank : 0
		 Row  : 1502
		 Col  : 800
\end{lstlisting}

If there are pending commands in the command queue, they will be
printed.  The output is dependent on the designated structure for
the command queue.  For example, per-rank/per-bank queues are 
shown below:
\begin{lstlisting}
   = Rank 1
    Bank 0   size : 2
       0]BP [ACT] pa[0x5dec808] r[1] b[0] row[1502] col[800]
       1]BP [READ_P] pa[0x5dec808] r[1] b[0] row[1502] col[800]
    Bank 1   size : 2
       0]BP [ACT] pa[0x5dec818] r[1] b[1] row[1502] col[800]
       1]BP [READ_P] pa[0x5dec818] r[1] b[1] row[1502] col[800]
    Bank 2   size : 1
       0]BP [READ_P] pa[0x5dec7e8] r[1] b[2] row[1502] col[799]
    Bank 3   size : 0
\end{lstlisting}

The state of each bank in the system is also displayed:
\begin{lstlisting}
    == Printing bank states (According to MC)
    [idle] [idle] [2029] [1502] 
    [idle] [idle] [1502] [1502] 
\end{lstlisting}
Banks can be in many states, including idle, row active (shown
with the row that is active), refreshing, or precharging.  These
states will update based on the commands being sent by the 
controller.  

\section{Results Output}

In addition to printing memory statistics and debug information to standard out, DRAMSim2 also produces
a 'vis' file in the \texttt{results/} directory. A vis file is essentially a summary of relevant statistics that is generated
per epoch (the number of cycles per epoch can be set by changing the \texttt{EPOCH\_COUNT} parameter in the \texttt{system.ini} file). 

We are currently working on DRAMVis, which is a cross-platform viewer which parses the vis file and generates graphs that can be used
to analyze and compare results.

\end{document}