/pal

An optimized C library for math, parallel processing and data movement

Primary LanguageCApache License 2.0Apache-2.0

PAL: The Parallel Architectures Library

Build Status

The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronization, and inter-processor communication.

Content

  1. Why?

  2. Design goals

  3. License

  4. Contribution Wanted!

  5. A Simple Example

  6. Build Instructions

  7. Library API reference
    7.0 Syntax
    7.1 Program Flow
    7.2 Data Movement
    7.3 Synchronization
    7.3 Basic Math
    7.5 Basic DSP
    7.4 Image Processing
    7.6 FFT (FFTW)
    7.7 Linar Algebra (BLAS)
    7.8 System Calls

8 Status Report

9 Benchmarking


##Why? Any sane and informed person knows that the future of computing is massively parallel. Unfortunately the energy needed to escape the current "von Neumann potential well" seems to be approaching infinity. The legacy programming stack is so effective and so easy to use that developers and companies simply cannot afford to choose the better (parallel) solution. To make parallel computing ubiquitous our only choice is to rewrite the whole software stack from scratch, including: algorithms, run-times, libraries, and applications. The goal of the Parallel Architectures Library project is to establish the lowest layer of this brave new programming stack.

##Design Goals

  • Fast (Super fast but no "belt AND suspenders")
  • Compact (Small enough to work for memory limited processors with <32KB RAM)
  • Scalable (Thread and data scalable)
  • Portable (Portable across different ISAs and systems)
  • Permissive (Apache 2.0 license to maximize industry adoption)

##License The PAL source code is licensed under the Apache License, Version 2.0. See LICENSE for full license text unless otherwise specified.

##Contribution Our goal is to make PAL a broad community project from day one. If just 100 people contribute one function each, we'll be done in a couple of days! If you know C, your are ready to contribute!!

Instructions for contributing can be found HERE.

##Build Instructions

Install Pre-requisites:

$ sudo apt-get install libtool build-essential pkg-config autoconf doxygen check

Build Sequence:

$ ./bootstrap
$ ./configure
$ make

##A Simple Example The following sample shows how to use PAL launch a simple task on a remote processor within the system. The program flow should be familiar to anyone who has used accelerator programming frameworks.

Manager Code

#include <pal.h>
#include <stdio.h>
#define N 16
int main(int argc, char *argv[])
{

    // Stack variables
    char *file = "./hello_task.elf";
    char *func = "main";
    int status, i, all, nargs = 1;
    char *args[nargs];
    char argbuf[20];

    // References as opaque structures
    p_dev_t dev0;
    p_prog_t prog0;
    p_team_t team0;
    p_mem_t mem[4];

    // Execution setup
    dev0 = p_init(P_DEV_DEMO, 0);        // initialize device and team
    prog0 = p_load(dev0, file, func, 0); // load a program from file system
    all = p_query(dev0, P_PROP_NODES);   // find number of nodes in system
    team0 = p_open(dev0, 0, all);        // create a team

    // Running program
    for (i = 0; i < all; i++) {
        sprintf(argbuf, "%d", i); // string args needed to run main asis
        args[0] = argbuf;
        status = p_run(prog0, team0, i, 1, nargs, args, 0);
    }
    p_wait(team0);    // not needed
    p_close(team0);   // close team
    p_finalize(dev0); // finalize memory

    return 0;
}

Worker Code (hello_task.elf)

#include <stdio.h>
int main(int argc, char* argv[]){
    int pid=0;
    int i;
    pid=atoi(argv[2]);
    printf("--Processor %d says hello!--\n", pid);
    return i;
}

PAL LIBRARY API REFERENCE

##SYNTAX

##PROGRAM FLOW
These program flow functions are used to manage the system and to execute programs. All PAL objects are referenced via handles (opaque objects).

FUNCTION NOTES
p_init() initialize the run time
p_query() query a device object
p_load() load binary elf file into memory
p_run() run a program on a team of processor
p_open() open a team of processors
p_append() add members to team
p_remove() remove members from team
p_close() close a team of processors
p_barrier() team barrier
p_wait() wait for team to finish
p_fence() memory fence
p_finalize() cleans up run time
p_get_err() get error code (if any).

##MEMORY ALLOCATION
These functions are used for creating memory objects. The functions return a unique PAL handle for each new memory object. This handle can then be used by functions like p_read() and p_write() to access data within the memory object.

FUNCTION NOTES STATUS
p_malloc() allocate memory on local processor
p_rmalloc() allocate memory on remote processor
p_free() free memory

##DATA MOVEMENT
The data movement functions move blocks of data between opaque memory objects and locations specified by pointers. The memory object is specified by a PAL handle returned by a previous API call. The exception is the p_memcpy function which copies blocks of bytes within a shared memory architecture only.

FUNCTION NOTES
p_gather() gather operation
p_memcpy() fast memcpy()
p_read() read from a memory object
p_scatter() scatter operation
p_write() write to a memory object

##SYNCHRONIZATION
The synchronization functions are useful for program sequencing and resource locking in shared memory systems.

FUNCTION NOTES
p_mutex_lock() lock a mutex
p_mutex_trylock() try locking a mutex once
p_mutex_unlock() unlock (clear) a mutex
p_mutex_init() initialize a mutex
p_atomic_add() atomic fetch and add
p_atomic_sub() atomic fetch and sub
p_atomic_and() atomic fetch and 'and'
p_atomic_xor() atomic fetch and 'xor'
p_atomic_or() atomic fetch and 'or'
p_atomic_swap() atomic exchange
p_atomic_compswap() atomic compare and exchange

##MATH
The math functions replace the traditional math lib functions and extend them to include support for data as well as task parallelism.

FUNCTION NOTES
p_abs() absolute value
p_absdiff() absolute difference
p_add() add
p_acos() arc cosine
p_acosh() arc hyperbolic cosine
p_asin() arc sine
p_asinh() arc hyperbolic sine
p_cbrt() cubic root
p_cos() cosine
p_cosh() hyperbolic cosine
p_div() division
p_dot() dot product
p_exp() exponential
p_ftoi() float to
p_itof() integer to float conversion
p_inv() inverse
p_invcbrt() inverse cube root
p_invsqrt() inverse square root
p_ln() natural log
p_log10() denary log
p_max() finds max val
p_min() finds min val
p_mean() mean operation
p_median() finds middle value
p_mode() finds most common value
p_mul() multiplication
p_popcount() count the number of bits set
p_pow() element raised to a power
p_rand() random number generator
p_randinit() init random number generator
p_sort() heap sort
p_sin() sine
p_sinh() hyperbolic sine
p_sqrt() square root
p_sub() subtract
p_sum() sum of all vector elements
p_sumsq() sum of all squared elements
p_tan() tangent
p_tanh() hyperbolic tangent

##DSP
The digital signal processing (DSP) functions follow the same convention as the math function set.

FUNCTION NOTES
p_acorr() autocorrelation (r[j] = sum ( x[j+k] * x[k] ), k=0..(n-j-1))
p_conv() convolution: r[j] = sum ( h[k] * x[j-k), k=0..(nh-1)
p_xcorr() correlation: r[j] = sum ( x[j+k] * y[k]), k=0..(nx+ny-1)
p_fir() FIR filter direct form: r[j] = sum ( h[k] * x [j-k]), k=0..(nh-1)
p_firdec() FIR filter with decimation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1)
p_firint() FIR filter with inerpolation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1)
p_firsym() FIR symmetric form
p_iir() IIR filter

##IMAGE PROCESSING
The image processing functions follow the same convention as the math function set.

FUNCTION NOTES
p_box3x3() box filter (3x3)
p_conv2d() 2d convolution
p_gauss3x3() gaussian blur filter (3x3)
p_median3x3() median filter (3x3)
p_laplace3x3() laplace filter (3x3)
p_prewitt3x3() prewitt filter (3x3)
p_sad8x8() sum of absolute differences (8x8)
p_sad16x16() sum of absolute differences (16x16)
p_sobel3x3() sobel filter (3x3)
p_scharr3x3() scharr filter (3x3)

##FFT

  • An FFTW like interface

##BLAS

  • A port of the BLIS library?

##SYSTEM CALLS

  • Bionic libc implementation as starting point..

STATUS REPORT

LINK

BENCHMARKING

LINK

  • TBD

BENCHMARKING

E=Epiphany
X=x86
A=ARM
CC=Clock cycles

FUNCTION E-CC E-SIZE A-CC A-SIZE X-CC X-SIZE
p_add() TBD TBD TBD TBD TBD TBD
... ... ... ... ... ... ...