/corosync

A copy of the Corosync git repo, hosted at git.corosync.org

Primary LanguageCOtherNOASSERTION

Copyright (c) 2002-2004 MontaVista Software, Inc.
Copyright (c) 2006, 2009 Red Hat, Inc.

All rights reserved.

This software licensed under BSD license, the text of which follows:

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

- Redistributions of source code must retain the above copyright notice,
  this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.
- Neither the name of the MontaVista Software, Inc. nor the names of its
  contributors may be used to endorse or promote products derived from this
  software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
THE POSSIBILITY OF SUCH DAMAGE.

-------------------------------------------------------------------------------
This file provides a map for developers to understand how to contribute
to the corosync project.  The purpose of this document is to prepare a
developer to write a service for corosync, or understand the architecture
of corosync.

The following is described in this document:

 * all files, purpose, and dependencies
 * architecture of corosync
 * taking advantage of virtual synchrony
 * adding libraries
 * adding services

-------------------------------------------------------------------------------
 all files, purpose, and dependencies.
-------------------------------------------------------------------------------

*----------------*
*- AIS INCLUDES -*
*----------------*

include/saAmf.h
-----------------
	Definitions for AMF interface.

include/saCkpt.h
------------------
	Definitions for CKPT interface.

include/saClm.h
-----------------
	Definitions for CLM interface.

include/saAmf.h
-----------------
	Definitions for the AMF interface.

include/saEvt.h
-----------------
	Defintiions for the EVT interface.

include/saLck.h
-----------------
	Definitions for the LCK interface.

include/cfg.h
	Definitions for the CFG interface.

include/cpg.h
	Definitions for the CPG interface.

include/evs.h
	Definitions for the EVS interface.

include/ipc_amf.h
	IPC interface between client and server for AMF service.

include/ipc_cfg.h
	IPC interface between client and server for CFG service.

include/ipc_ckpt.h
	IPC interface between client and server for CKPT service.

include/ipc_clm.h
	IPC interface between client and server for CLM service.

include/ipc_cpg.h
	IPC interface between client and server for CPG service.

include/ipc_evs.h
	IPC interface between client and server for EVS service.

include/ipc_evt.h
	IPC interface between client and server for EVT service.

include/ipc_gen.h
	IPC interface for generic operations.

include/ipc_lck.h
	IPC interface between client and server for LCK service.

include/ipc_msg.h
	IPC interface between client and server for MSG service.

include/hdb.h
	Handle database implementation.

include/list.h
	Linked list implementation.

include/swab.h
	Byte swapping implementation.

include/queue.h
	FIFO queue implementation.

include/sq.h
	Sort queue where items are sorted according to a sequence number.  Avoids
	Sort, hence, install of a new element takes is O(1).  Inline implementation.

	depends on list.

*---------------*
* AIS LIBRARIES *
*---------------*
lib/amf.c
---------
	AMF user library linked into user application.

lib/cfg.c
---------
	CFG user library linked into user application.

lib/ckpt.c
---------
	CKPT user library linked into user application.

lib/clm.c
---------
	CLM user library linked into user application.

lib/cpg.c
---------
	CPG user library linked into user application.

lib/evs.c
---------
	EVS user library linked into user application.

lib/evt.c
---------
	EVT user library linked into user application.

lib/lck.c
---------
	LCK user library linked into user application.

lib/msg.c
---------
	MSG user library linked into uer application.

lib/amf.c
---------
	AMF user library linked into user application.

lib/ckpt.c
----------
	CKPT user library linked into user application.

lib/evt.c
----------
	EVT user library linked into user application.

lib/util.c
----------
	Utility functions used by all libraries.

*-----------------*
*- AIS EXECUTIVE -*
*-----------------*

exec/aisparser.{h|c}
	Parser plugin for default configuration file format.

exec/aispoll.{h|c}
	Poll abstraction interface.

exec/amfapp.c
	AMF application handling.

exec/amfcluster.c
	AMF cluster handling.

exec/amfcomp.c
	AMF component level handling.

exec/amf.h
	Defines all AMF symbol names.

exec/amfnode.c
	AMF node level handling.

exec/amfsg.c
	AMF service group handling.

exec/amfsi.c
	AMF Service instance handling.

exec/amfsu.c
	AMF service unit handling.

exec/amfutil.c
	AMF utility functions.

exec/cfg.c
	Server side implementation of CFG service which is used to display
	redundant ring status and reenabling redundant rings.

exec/ckpt.c
	Server side implementation of Checkpointing (CKPT API).

exec/clm.c
	Server side implementation of Cluster Membership (CLM API).

exec/cpg.c
	Server side implementation of closed procss groups (CPG API).

exec/crypto.{c|h}
	Cryptography functions used by corosync.

exec/evs.c
	Server side implementation of extended virtual synchrony passthrough
	(EVS API).

exec/evt.c
	Server side implementation of Event Service (EVT API).

exec/ipc.{c|h}
	All IPC operations used by corosync.

exec/jhash.h
	A hash routine.

exec/keygen.c
	Secret key generator used by corosync encryption tools.

exec/lck.c
	Server side implementation of the distributed lock service (LCK API).

exec/main.{c|h}
	Main function which connects all components together.

exec/mainconfig.{c|h}
	Reads main configuration that is set in the configuration parser.

exec/mempool.{c|h}
	Currently unused.

exec/msg.c
	Server side implementation of message service (MSG API).

exec/objdb.{c|h}
	Object database used to configure services.

exec/corosync-instantiate.c
	instantiates a component by forking and exec'ing it and writing its
	pid to a pid file.

exec/print.{c|h}
	Non-blocking thread-based logging service with overflow protection.

exec/service.{c|h}
	Service handling routines including the default service handler
	description.

exec/sync.{c|h}
	The synchronization service implementation.

exec/timer.{c|h}
	Threaded based timer service.

exec/tlist.h
	Timer list used to expire timers.

exec/totemconfig.{c.h}
	The totem configuration configurator from data parsed with aisparser
	in the configuration file.

exec/totem.h
	General definitions for the totem protocol used by the totem stack.

exec/totemip.{c.h}
	IP handling functions for totem - lowest on stack.

exec/{totemrrp.{c.h}
	The totem multi ring protocool and currently unimplemented.  Between
	totemsrp and totempg.

exec/totemnet.{c.h}
	Network handling functions for totem - between totemip and totemrrp.

exec/totempg.{c|h}
	Process groups interface which is used by all applications - highest on
	stack.

exec/totemrrp.{c.h}
	Redundant ring functions for totem - between totemnet and totemsrp.

exec/util.{c|h}
	Utility functions used by corosync executive.

exec/version.h
	Defines build version.

exec/vsf.h
	Virtual Synchrony plugin API.

exec/vsf_ykd.c
	Virtual Synchrony YKD Dynamic Linear Voting algorithm.

exec/wthread.{c|h}
	Worker threads API.

loc
---
Counts the lines of code in the AIS implementation.

-------------------------------------------------------------------------------
 architecture of corosync
-------------------------------------------------------------------------------

The corosync standards based cluster framework is a generic cluster plugin
architecture used to create cluster APIs and services.  Usually there are
libraries which implement APIs and are linked into the end user application.
The libraries request services from the aisexec process, called the AIS
executive.  The AIS executive uses the Totem protocol stack to communicate
within the cluster and execute operations on behalf of the user.  Finally the
response of the API is delivered once the operation has completed.


           --------------------------------------------------
           |   AMF and more services libraries              |
           --------------------------------------------------
           |                      IPC API                   |
           --------------------------------------------------
           |                 corosync Executive              |
           |                                                |
           |     +---------+ +--------+ +---------+         |
           |     | Object  | |  AIS   | | Service |         |
           |     | Datbase | | Config | | Handler |         |
           |     | Service | | Parser | | Manager |         |
           |     +---------+ +--------+ +---------+         |
           |     +-------+ +-------+                        |
           |     |  AMF  | | more  |                        |
           |     |Service| |svcs...|                        |
           |     +-------+ +-------+                        |
           |                 +---------+                    |
           |                 |  Sync   |                    |
           |                 | Service |                    |
           |                 +---------+                    |
	   |                 +---------+                    |
           |                 |   VSF   |                    |
           |                 | Service |                    |
           |                 +---------+                    |
           | +--------------------------------+ +--------+  |
           | |                 Totem          | | Timers |  |
           | |                 Stack          | |  API   |  |
           | +--------------------------------+ +--------+  |
           |                +-----------+                   |
           |                |   Poll    |                   |
           |                | Interface |                   |
           |                +-----------+                   |
           |                                                |
           -------------------------------------------------

                    Figure 1: corosync Architecture

Every application that intends to use corosync links with the libais library.
This library uses IPC, or more specifically BSD unix sockets, to communicate
with the executive.  The library is a small program responsible only for
packaging the request into a message.  This message is sent, using IPC, to
the executive which then processes it.  The library then waits for a response.

The library itself contains very little intelligence.  Some utility services
are provided:

 * create a connection to the executive
 * send messages to the executive
 * retrieve messages from the executive
 * Poll on a fd
 * create a handle instance
 * destroy a handle instance
 * get a reference to a handle instance
 * release a reference to a handle instance

When a library connects, it sends via a message, the service type.  The
service type is stored and used later to reference the message handlers
for both the library message handlers and executive message handlers.
Every message sent contains an integer identifier, which is used to index
into an array of message handlers to determine the correct message handler
to execute For the library.  Hence a message is uniquely identified by the
message handler ID number and the service handler ID number.

When a library sends a message via IPC, the delivery of the message occurs
to the proper library message handler.  The library message handler is
responsible for sending the message via the totem process groups API to all
nodes in the system.

This simplifies the library handler significantly.  The main purpose of the
library handler should be to package the library request into a message that
can be sent to all nodes.

The totem process groups API sends the message according to the extended
virtual synchrony model.  The group messaging interface also delivers the
message according to the extended virtual synchrony model.  This has several
advantages which are described in the virtual synchrony section.  One
advantage that must be described now is that messages are self-delivered;
if a node sends a message, that same message is delivered back to that
node.

When the executive message is delivered, it is processed by the executive
message handler.  The executive message handler contains the brains of
AIS and is responsible for making all decisions relating to the request
from the libais library user.

-------------------------------------------------------------------------------
 taking advantage of virtual synchrony
-------------------------------------------------------------------------------

definitions:
processor: a system responsible for executing the virtual synchrony model
configuration: the list of processors under which messages are delivered
partition: one or more processors leave the configuration
merge: one or more processors join the configuration
group messaging: sending a message from one sender to many receivers

Virtual synchrony is a model for group messaging.  This is often confused
with particular implementations of virtual synchrony.  Try to focus on
what virtual syncrhony provides, not how it provides it, unless interested
in working on the group messaging interface of corosync.

Virtual synchrony provides several advantages:

 * integrated membership
 * strong membership guarantees
 * agreed ordering of delivered messages
 * same delivery of configuration changes and messages on every node
 * self-delivery
 * reliable communication in the face of unreliable networks
 * recovery of messages sent within a configuration where possible
 * use of network multicast using standard UDP/IP

Integrated membership allows the group messaging interface to give
configuration change events to the API services.  This is obviously beneficial
to the cluster membership service (and its respective API0, but is helpful
to other services as described later.

Strong membership guarantees allow a distributed application to make decisions
based upon the configuration (membership).  Every service in corosync registers
a configuration change function.  This function is called whenever a
configuration change occurs.  The information passed is the current processors,
the processors that have left the configuration, and the processors that have
joined the configuration.  This information is then used to make decisions
within a distributed state machine.  One example usage is that an AMF component
running a specific processor has left the configuration, so failover actions
must now be taken with the new configuration (and known components).

Virtual synchrony requires that messages may be delivered in agreed order.
FIFO order indicates that one sender and one receiver agree on the order of
messages sent.  Agreed ordering takes this requirement to groups, requiring that
one sender and all receivers agree on the order of messages sent.

Consider a lock service.  The service is responsible for arbitrating locks
between multiple processors in the system.  With fifo ordering, this is very
difficult because a request at about the same time for a lock from two seperate
processors may arrive at all the receivers in different order.  Agreed ordering
ensures that all the processors are delivered the message in the same order.
In this case the first lock message will always be from processor X, while the
second lock message will always be from processor Y.   Hence the first request
is always honored by all processors, and the second request is rejected (since
the lock is taken).  This is how race conditions are avoided in distributed
systems.

Every processor is delivered a configuration change and messages within a
configuration in the same order.  This ensures that any distributed state
machine will make the same decisions on every processor within the
configuration.  This also allows the configuration and the messages to be
considered when making decisions.

Virtual synchrony requires that every node is delivered messages that it
sends.  This enables the logic to be placed in one location (the handler
for the delivery of the group message) instead of two seperate places.  This
also allows messages that are sent to be ordered in the stream of other
messages within the configuration.

Certain guarantees are required by virtual synchrony.  If a message is sent,
it must be delivered by every processor unless that processor fails.  If a
particular processor fails, a configuration change occurs creating a new
configuration under which a new set of decisions may be made.  This implies
that even unreliable networks must reliably deliver messages.   The
mplementation in corosync works on unreliable as well as reliable networks.

Every message sent must be delivered, unless a configuration change occurs.
In the case of a configuration change, every message that can be recovered
must be recovered before the new configuration is installed.  Some systems
during partition won't continue to recover messages within the old
configuration even though those messages can be recovered.  Virtual synchrony
makes that impossible, except for those members that are no longer part
of a configuration.

Finally virtual syncrhony takes advantage of hardware multicast to avoid
duplicated packets and scale to large transmit rates.  On 100mbit network,
corosync can approach wire speeds depending on the number of messages queued
for a particular processor.

What does all of this mean for the developer?

 * messages are delivered reliably
 * messages are delivered in the same order to all nodes
 * configuration and messages can both be used to make decisions

-------------------------------------------------------------------------------
 adding libraries
-------------------------------------------------------------------------------

The first stage in adding a library to the system is to develop the library.

Library code should follow these guidelines:

 * use SA Forum coding style for SA Forum APIs to aid in debugging
 * use corosync coding guidelines for APIs that are not SA Forum that
   are to be merged into the corosync tree.
 * implement all library code within one file named after the api.
   examples are ckpt.c, clm.c, amf.c.
 * use parallel structure as much as possible between different APIs
 * make use of utility services provided by util.c.
 * if something is needed that is generic and useful by all services,
   submit patches for other libraries to use these services.
 * use the reference counting handle manager for handle management.

------------------
 Version checking
------------------

struct saVersionDatabase {
	int versionCount;
	SaVersionT *versionsSupported;
};

The versionCount number describes how many entries are in the version database.
The versionsSupported member is an array of SaVersionT describing the acceptable
versions this API supports.

An api developer specifies versions supported by adding the following C
code to the library file:

/*
 * Versions supported
 */
static SaVersionT clmVersionsSupported[] = {
	{ 'B', 1, 1 },
	{ 'b', 1, 1 }
};

static struct saVersionDatabase clmVersionDatabase = {
	sizeof (clmVersionsSupported) / sizeof (SaVersionT),
	clmVersionsSupported
};

After this is specified, the following API is used to check versions:

SaErrorT
saVersionVerify (
	struct saVersionDatabase *versionDatabase,
	const SaVersionT *version);

An example usage of this is
	SaErrorT error;

	error = saVersioNVerify (&clmVersionDatabase, version);

	where version is a pointer to an SaVersionT passed into the API.

error will return SA_OK if the version is valid as specified in the
version database.

------------------
 Handle Instances
------------------

Every handle instance is stored in a handle database.  The handle database
stores instance information for every handle used by libraries.  The system
includes reference counting and is safe for use in threaded applications.

The handle database structure is:

struct saHandleDatabase {
	unsigned int handleCount;
	struct saHandle *handles;
	pthread_mutex_t mutex;
	void (*handleInstanceDestructor) (void *);
};

handleCount is the number of handles
handles is an array of handles
mutex is a pthread mutex used to mutually exclude access to the handle db
handleInstanceDestructor is a callback that is called when the handle
	should be freed because its reference count as dropped to zero.

The handle database is defined in a library as follows:

static void clmHandleInstanceDestructor (void *);

static struct saHandleDatabase clmHandleDatabase = {
	.handleCount			= 0,
	.handles			= 0,
	.mutex		 		=  PTHREAD_MUTEX_INITIALIZER,
	.handleInstanceDestructor	= clmHandleInstanceDestructor
};

There are several APIs to access the handle database:

SaErrorT
saHandleCreate (
	struct saHandleDatabase *handleDatabase,
	int instanceSize,
	int *handleOut);

Creates an instance of size instanceSize in the handleDatabase paraemter
returning the handle number in handleOut.  The handle instance reference
count starts at the value 1.

SaErrorT
saHandleDestroy (
	struct saHandleDatabase *handleDatabase,
	unsigned int handle);

Destroys further access to the handle.  Once the handle reference count
drops to zero, the database destructor is called for the handle.  The handle
instance reference count is decremented by 1.

SaErrorT
saHandleInstanceGet (
	struct saHandleDatabase *handleDatabase,
	unsigned int handle,
	void **instance);

Gets an instance specified handle from the handleDatabase and returns
it in the instance member.  If the handle is valid SA_OK is returned
otherwise an error is returned.  This is used to ensure a handle is
valid.  Eveyr get call increases the reference count on a handle instance
by one.

SaErrorT
saHandleInstancePut (
	struct saHandleDatabase *handleDatabase,
	unsigned int handle);

Decrements the reference count by 1.  If the reference count indicates
the handle has been destroyed, it will then be removed from the database
and the destructor called on the instance data.  The put call takes care
of freeing the handle instance data.

Create a data structure for the instance, and use it within the libraries
to store state information about the instance.  This information can be
the handle, a mutex for protecting I/O, a queue for queueing async messages
or whatever is needed by the API.

-----------------------------------
 communicating with the executive
-----------------------------------

A service connection is created with the following API;

SaErrorT
saServiceConnect (
	int *responseOut,
	int *callbackOut,
	enum service_types service);


The responseOut parameter specifies the file descriptor where response messages
will be delivered.  The callback out parameter describes the file descriptor
where callback messages are delivered.

The service specifies the service to use.

Messages are sent and received from the executive with the following functions:

SaAisErrorT saSendMsgRetry (
	int s,
	struct iovec *iov,
	unsigned int iov_len);

the s member is the socket to use retrieved with saServiceConnect
The iov is the iovector used to send a message.
the iov_len is the number of elements in iov.

This sends an IO-vectorized message.

SaErrorT
saSendRetry (
	int s,
	const void *msg,
	size_t len,
	int flags);

the s member is the socket to use retrieved with saServiceConnect
the msg member is a pointer to the message to send to the service
the len member is the length of the message to send
the flags parameter is the flags to use with the sendmsg system call


This sends a data blob to the exective.

A message is received from the executive with the function:

SaErrorT
saRecvRetry (
	int s,
	void *msg,
	size_t len,
	int flags);

the s member is the socket to use retrieved with saServiceConnect
the msg member is a pointer to the message to receive to the service
the len member is the length of the message to receive
the flags parameter is the flags to use with the sendmsg system call

A message may be send and a reply waited for with the following function:
SaAisErrorT saSendMsgReceiveReply (
        int s,
        struct iovec *iov,
        unsigned int iov_len,
        void *responseMessage,
        int responseLen)

s is the socket to send and receive the response.
iov is the iovector to send.
iov_len is the number of elements in iov.
responseMessage is the data block used to store the response.
responesLen is the length of the data block that is expected to be received.

Waiting for a file descriptor using poll systemcall is done with the api:

SaErrorT
saPollRetry (
	struct pollfd *ufds,
	unsigned int nfds,
	int timeout);

where the parameters are the standard poll parameters.

Messages can be received out of order searching for a specific message id with:

----------
 messages
----------
Please follow the style of the messages.  It makes debugging much easier
if parallel style is used.

An service should be added to service_types enumeration in ipc_gen or in the
case of an external project, a number should be registered with the project.

enum service_types {
        EVS_SERVICE = 0,
        CLM_SERVICE = 1,
        AMF_SERVICE = 2,
        CKPT_SERVICE = 3,
        EVT_SERVICE = 4,
        LCK_SERVICE = 5,
        MSG_SERVICE = 6,
        CFG_SERVICE = 7,
        CPG_SERVICE = 8
};

These are the request CLM message identifiers:

Each library should have an ipc_APINAME.h file in include.  It should define
request types and response types.

enum req_clm_types {
	MESSAGE_REQ_CLM_TRACKSTART = 0,
	MESSAGE_REQ_CLM_TRACKSTOP = 1,
	MESSAGE_REQ_CLM_NODEGET = 2,
	MESSAGE_REQ_CLM_NODEGETASYNC = 3
};

These are the response CLM message identifiers:

enum res_clm_types {
        MESSAGE_RES_CLM_TRACKCALLBACK = 0,
        MESSAGE_RES_CLM_TRACKSTART = 1,
        MESSAGE_RES_CLM_TRACKSTOP = 2,
        MESSAGE_RES_CLM_NODEGET = 3,
        MESSAGE_RES_CLM_NODEGETASYNC = 4,
        MESSAGE_RES_CLM_NODEGETCALLBACK = 5
};

A request header should be placed at the front of every message send by
the library.

typedef struct {
        int size __attribute__((aligned(8)));
        int id __attribute__((aligned(8)));
} mar_req_header_t __attribute__((aligned(8)));

There is also a response message header which should start every response
message:

typedef struct {
        int size; __attribute__((aligned(8)))
        int id __attribute__((aligned(8)));
        SaAisErrorT error __attribute__((aligned(8)));
} mar_res_header_t __attribute__((aligned(8)));

the error parameter is used to pass errors from the executive to the library,
including SA_ERR_TRY_AGAIN for flow control, which is described later.

This is described later:

typedef struct {
        mar_uint32_t nodeid __attribute__((aligned(8)));
        void *conn __attribute__((aligned(8)));
} mar_message_source_t __attribute__((aligned(8)));

This is the MESSAGE_REQ_CLM_TRACKSTART message id above:

struct req_clm_trackstart {
	mar_req_header_t header;
	SaUint8T trackFlags;
	SaClmClusterNotificationT *notificationBufferAddress;
	SaUint32T numberOfItems;
};

The saClmClusterTrackStart api should create this message and send it to the
executive.

responses should be of:

struct res_clm_trackstart

------------
 some notes
------------
* Avoid doing anything tricky in the library itself.  Let the executive
  handler do all of the work of the system.  minimize what the API does.
* Once an api is developed, it must be added to the makefile.  Just add
  a line for the file to EXECOBJS build line.
* protect I/O send/recv with a mutex.
* always look at other libraries when there is a question about how to
  do something.  It has likely been thought out in another library.

-------------------------------------------------------------------------------
 adding services
-------------------------------------------------------------------------------
Services are defined by service handlers and messages described in
include/ipc_SERVICE.h.  These two peices of information are used by the
executive to dispatch the correct messages to the correct receipients.

-------------------------------
 the service handler structure
-------------------------------

A service is added by defining a structure defined in exec/service.h.  The
structure is a little daunting:

struct libais_handler {
	int (*libais_handler_fn) (void *conn, void *msg);
	int response_size;
	int response_id;
	enum corosync_flow_control flow_control;
};

The response_size, response_id, and flow_control for a library handler are
used for flow control.  A response message will be sent to the library of the
size response_size, with the header id of response_id if the totem message
queue is full.  Some library APIs may not need to block in this condition
(because they don't have to use totem), so they should specify
COROSYNC_FLOW_CONTROL_NOT_REQUIREDin the flow control field.

The libais_handler_fn is a function to be called when the library handler is
requested to be executed.

struct corosync_exec_handler {
	void (*exec_handler_fn) (void *msg, unsigned int nodeid);
	void (*exec_endian_convert_fn) (void *msg);
};

The exec_handler_fn is a function to be called when the executive handler is
requested to execute.

The exec_endian_convert_fn is a function to be called to convert the endianess
of the executive message.  Note messages are not stored in big or little endian
format before transmit.  Instead they are transmitted in either big endian or
little endian depending on the byte order of the transmitter and converted to
the host machine order on receipt of the message.

struct corosync_service_handler {
	unsigned char *name;
	unsigned short id;
	unsigned int private_data_size;
	int (*lib_init_fn) (void *conn);
	int (*lib_exit_fn) (void *conn);
	struct corosync_lib_handler *lib_service;
	int lib_service_count;
	struct corosync_exec_handler *exec_service;
	int (*exec_init_fn) (struct objdb_iface_ver0 *);
	int (*config_init_fn) (struct objdb_iface_ver0 *);
	void (*exec_dump_fn) (void);
	int exec_service_count;
	void (*confchg_fn) (
		enum totem_configuration_type configuration_type,
		const unsigned int *member_list, size_t member_list_entries,
		const unsigned int *left_list, size_t left_list_entries,
		const unsigned int *joined_list, size_t joined_list_entries,
		const struct memb_ring_id *ring_id);
	void (*sync_init) (void);
	int (*sync_process) (void);
	void (*sync_activate) (void);
	void (*sync_abort) (void);
};

name is the name of the service.

id is the identifier of the service.

private_data_size is the size of the private data used by the connection
which the library and executive handlers can reference.

lib_init_fn is the function executed when a library connection is made to
the service handler.

lib_exit_fn is the function executed when a library connection is exited
either because the application closed the file descriptor, or the OS
closed the file descriptor.

lib_service is an array of corosync_lib_handler data structures which define
the library service handler.

lib_service_count is the number of elements in lib_service.

exec_service is an array of corosync_exec_handler data structures which define
the executive service handler.

exec_init_fn is a function used to initialize the executive service.  This
is only called once.

config_init_fn is called to parse config files and populate the object
database.

exec_dump_fn is called when SIGUSR2 is sent to the executive to dump the
current state of the service.

exec_service_count is the number of entries in the exec_service array.

confchg_fn is called every time a configuration change occurs.

sync_init is called when the service should begin synchronization.

sync_process is called to process synchronization messages.

sync_activate is called to activate the current service synchronization.

sync_abort is called to abort the current service synchronization.

--------------
 flow control
--------------
The totem protocol includes flow control so that it doesn't send too many
messages when the network is completely full.  But the library can
still send messages to the executive much faster then the executive can send
them over totem.  So the library relies on the group messaging flow control to
control flow of messages sent from the library.  If the totem queues are full,
no more messages may be sent, so the executive in ipc.c automatically detects
this scenario and returns an SA_ERR_TRY_AGAIN error.

When a library gets SA_ERR_TRY_AGAIN, the library may either retry, or return
this error to the user if the error is allowed by the API definitions.  The
The other information is critical to ensuring that the library reads the correct
message and size of message.  Make sure the libais_handler matches the messages
used in the handler function.

------------------------------------------------
 dynamically linking the service handler plugin
------------------------------------------------

The service handler needs some special magic to dynamically be linked into
corosync.

/*
 * Dynamic loader definition
 */
static struct corosync_service_handler *clm_get_service_handler_ver0 (void);

static struct corosync_service_handler_iface_ver0 clm_service_handler_iface = {
        .corosync_get_service_handler_ver0       = clm_get_service_handler_ver0
};

static struct lcr_iface corosync_clm_ver0[1] = {
        {
                .name                   = "corosync_clm",
                .version                = 0,
                .versions_replace       = 0,
                .versions_replace_count = 0,
                .dependencies           = 0,
                .dependency_count       = 0,
                .constructor            = NULL,
                .destructor             = NULL,
                .interfaces             = NULL
        }
};

static struct lcr_comp clm_comp_ver0 = {
        .iface_count                    = 1,
        .ifaces                         = corosync_clm_ver0
};

static struct corosync_service_handler *clm_get_service_handler_ver0 (void)
{
        return (&clm_service_handler);
}

__attribute__ ((constructor)) static void clm_comp_register (void) {
        lcr_interfaces_set (&corosync_clm_ver0[0], &clm_service_handler_iface);

        lcr_component_register (&clm_comp_ver0);
}

Once this code is added (substitute clm for the service being implemented),
the service will be loaded if its in the default services list.

The default service list is specified in service.c:default_services.  If
creating an external plugin, there are configuration parameters which may
be used to add your plugin into the corosync scanning of plugins.

---------------------------------
 Connection specific information
---------------------------------
Every connection may have specific connection information if private data
is greater then zero for the service handler.  This is used to allow each
library connection to maintain private state to that connection.  The private
data for a connection can be retrieved with:
struct service_pd service_pd = (struct service_pd *)corosync_conn_private_data_get (conn);

where service is the name of the service implemented and conn is the connection
information likely passed into the library handler or stored in a
message_source structure for later use by an executive handler.

------------------------------
 sending responses to the api
------------------------------

A message is sent to the library from the executive message handler using
the function:

extern int corosync_conn_send_response (void *conn_info, void *msg,
	int mlen);

conn_info is passed into the library message handler or stored in the
executive message.  This member describes the connection to send the response.

msg is the message to send
mlen is the length of the message to send

Keep in mind that struct res_message should be at the beginning of the response
message so that it follows the style used in the rest of corosync.

--------------------------------------------
 deferring response to an executive message
--------------------------------------------

The message source structure is used to store information about the source of a
message so a later executive message can respond to a library request.  In
a library handler, the source field should be set up with:

message_source_set (&req_exec_ZZZZZZZ.source, conn);
gmi_mcast (req_exec_ZZZZZZZ)

In this case conn_info is passed into the library message handler

Then the executive message handler determines if this processor is responsible
for responding:

if (message_source_is_local (conn)) {
	corosync_conn_send_response ();

}

---------------
 Using totempg
---------------
To send a message to every processor and the local processor for self
delivery according to virtual synchrony semantics use:

The totempg interface supports multiple users at one time and if you need
to use a full totempg interface (defined in totempg.h) please ask for
assistance on the mailing list.  If you simply want to use multicast
transmissions in corosync, do the following:

       assert (totempg_groups_mcast_joined (corosync_group_handle, &req_exec_clm_iovec, 1, TOTEMPG_AGREED) == 0);

-----------------
 library handler
-----------------
Every library handler has the prototype:

static int message_handler_req_clm_init (void *conn, void *msg);

The start of the handler function should look something like this:

int message_handler_req_clm_trackstart (void *conn *conn,
	void *msg)
{
        struct req_clm_trackstart *req_clm_trackstart =
		(struct req_clm_trackstart *)message;

 { package up library handler message into executive message }
 { multicast message using totempg interface }
}

This assigns the void *message to a structure that can be used by the
library handler.

The conn field is used to indicate where the response should respond to.
Use the tricks described in deferring a response to the executive handler to
have the executive handler respond to the message.

avoid doing anything tricky in a library handler.  Do all the work in the
executive handler at first.  If later, it is possible to optimize, optimize
away.

-------------------
 executive handler
-------------------
Every executive handler has the prototype:

static int message_handler_req_exec_clm_nodejoin (void *msg,
	unsigned int nodeid);

The start of the handler function should look something like this:

static int message_handler_req_exec_clm_nodejoin (void *msg,
	unsigned int nodeid);
{
        struct req_exec_clm_nodejoin *req_exec_clm_nodejoin = (struct req_exec_clm_nodejoin *)message;

 { do real work of executing request, this is done on every node }
}

The conn_info structure is not available.  If it is needed, it can be stored
in the message sent by the library message handler in a source structure.

The msg field contains the message sent by the library handler

The nodeid is a unique node identifier of the node that originated the message.

--------------------
 the libais_init_fn
--------------------
This should be used to initialize any state for the connection.

--------------------
 the libais_exit_fn
--------------------
This function is called every time a service connection is disconnected by
the executive.  Free memory, change structures, or whatever work needs to
be done to clean up.

If the exit_fn couldn't complete because it is waiting for some event, it may
return -1, which will allow the executive to make some forward progress.  Then
exit_fn will be called again.  Return 0 when the exit was completed.  This is
most useful when toteom should be used to queue a message, but the queue is
full.  In this case, waiting a few more seconds may open up the queue, so
return -1, and then the executive will try again to call exit_fn.  Do NOT
return -1 forever or the ais executive will spin.

If -1 is returned, ENSURE that the state of the library hasn't changed so much that
exit_fn cannot be called again.  If exit_fn returns -1, it WILL be called again
so expect it in the code.

----------------
 the confchg_fn
----------------
This function is called whenever a configuration change occurs.  Some
services may not need this function, while others may.  This is a good way
to sync up joining nodes with the current state of the information stored
on a particular processor.

-------------------------------------------------------------------------------
Final comments
-------------------------------------------------------------------------------
GDB is your friend, especially the "where" command.  But it stops execution.
This has a nasty side effect of killing the current configuration.  In this
case GDB may become your enemy.

printf is your friend when GDB is your enemy.

If stuck, ask on the mailing list, send your patches.  Alot of time has been
spent designing corosync, and even more time debugging it.  There are people
that can help you debug problems, especially around things like message
delivery.

Submit patches early to get feedback, especially around things like parallel
style.  Parallel style is very important to ensure maintainability by the
corosync community.

If this document is wrong or incomplete, complain so we can get it fixed
for other people.

Have fun!