A preprocessor is a tool that reads a text file and, if some special instructions, called preprocessing directives, are found in the input, the output may not be an identical copy, for example some parts can be skipped. We document here the preprocessor.
Let us assume here that the standalone preprocessor has been built
with dune
, therefore the executable is PreprocMain.exe
. We can
obtain the available command-line options like so:
$ PreprocMain.exe --help
resulting in the output
Usage: PreprocMain.exe [<option> ...] -- [<input>]
where <input> is the source file (default: stdin),
and each <option> (if any) is one of the following:
-I <paths> Inclusion paths (colon-separated)
-h, --help This help
-v, --version Commit hash on stdout
--cli Print given options (debug)
--columns Columns for source locations
--show-pp Print result of preprocessing
First, we see that the preprocessor follows the
GNU getopt
conventions on short and long option names. The input is an anonymous
argument that must be preceded by --
. The options are
-
-I <paths>
for finding files given to the directive#include
which are not found by appending their path to the current directory. -
--columns
for using column numbers instead of horizontal offsets when reporting errors in the input (the first character on a line is on the column 1, but has offset 0). -
--show-pp
for actually printing onstdin
the result of preprocessing the input. If not given this option, the preprocessor will only print onstderr
errors, if any.
This preprocessor features different kinds of directives:
-
directives found in the standard preprocessor for the language
C#
; -
a directive from
cpp
, theC
preprocessor, enabling the textual inclusion of files.
Importantly, strings are
handled the way cpp
does, not C#
.
In the following subsections, we shall briefly present those directives. Here, we state some properties which hold for all of them.
-
They must start with a
#
symbol at the beginning of a line. -
Wrongly spelled directives or unsupported ones are ignored without warning, and therefore will appear in the output.
-
They can have arguments in the form of free text or strings. (Anything after the directive name is considered a potential argument.)
-
Strings arguments must be enclosed between double quotes and cannot span over two or more lines.
-
The valid preprocessing of a directive leaves in its place an empty line (that is, a newline character) or another directive, to be picked up by other tools, like lexers.
-
Newline characters are never discarded, to preserve the line numbers of copied text.
When debugging or putting in place directives in an already existing
input, it is sometimes useful to force the preprocessor to stop and
emit an error. This is possible thanks to the #error
directive,
which is followed by an error message as free text until the end of
the line, like so:
#error Not implemented/tested yet
The preprocessor enables the conditional copying of its input. At the start, its default mode is copy, meaning that characters in the input are copied to the output. Conditional directives enable another mode: skip, by means of which the following characters are discarded, and only newline characters are copied.
Conditional directives follow the familiar syntax of some of their cousins in programming languages. At the very least,
-
they start with the
#if
directive, followed by a Boolean expression as argument, -
and they are closed by
#endif
.
It is also possible to use
-
one
#else
directive before#endif
; -
a series of
#elif
directive after#if
(as a short-hand for a#else
immediately followed by an#if
, except that only one#endif
will close the conditional).
A trivial example would be:
#if false
This is NOT copied to the output, except the newline character
#endif
where false
is a predefined symbol acting like a Boolean value. The
output is
# 1 "Tests/test.txt"
Note what looks like an anonymous preprocessing directive # 1 "Tests/test.txt"
. We will explain its meaning when presenting
The Inclusion Directive. (Remark: cpp
would not output blank lines followed by the end of the file.) Their
use is clearer if we add text before and after the conditional, like
so:
---
#if false
A
#endif
---
whose preprocessed output is then
# 1 "Tests/test.txt"
---
---
Here is how to use the #else
directive:
#if false
This is NOT copied to the output, except the newline character.
#else
This IS copied to the output.
#endif
The booleans false
and true
are predefined. The way to define
symbols (that is the traditional name of those identifiers) consists
in using the #define
directive, followed by the symbol, like so:
#define SYM
#if SYM
This IS copied to the output.
#else
This is NOT copied to the output, except the newline character.
#endif
This opens the possibility to use Boolean expressions made of
true
andfalse
already mentioned;||
for the disjunction ("or");&&
for the conjunction ("and");==
for equality;!=
for inequality;!
for negation;(
and)
around expressions to specify priorities.
Directives are processed in sequence in the input file. This
preprocessor, like that of C#
, allows us to undefine a symbol,
that is, giving it the Boolean value false
, like so:
#define SYM
#undef SYM
#if SYM
This is NOT copied to the output, except the newline character.
#else
This IS copied to the output.
#endif
The result is
# 1 "Tests/undef.txt"
This IS copied to the output.
Note: If you wish to redefine a symbol, you must undefine it first.
When we want to write a cascade of conditionals, the preprocessor
enables a shortcut by means of the #elif
directive, like so:
#if ...
...
#elif ...
...
#elif ...
...
#endif
Basically, a #elif
directive is equivalent to #else
followed by
#if
, but we only need to close with only one #endif
.
The rationale for using conditional directives is to enable in a single file several versions of a standard.
#if STANDARD_1
...
#elif STANDARD_2
...
#else
#error Standard not implemented
#endif
A real life example could be Dexter. It provides another interesting use of a conditional directive, where a record type depends on the version of the standard.
The solution provided by the conditional directives with symbol
definition to manage several standards is improved upon by physically
separating the input into different files. This is where the
#include
directive comes handy. Basically, it takes an argument made
of a string containing a path to the file to be textually included,
like so:
#include "path/to/standard_1.ligo"
and the preprocessor replaces the directive with the contents of the
file path/to/standard_1.ligo
, whose contents is then preprocessed as
well. This can in theory create a loop, for example, if two files try
to include each other.
In fact, the preprocessor does more than simply include the given
file. To enable the consumer of the output to keep track of
inclusions, in particular, to maintain the line numbers of the input
that has been copied, the preprocessor inserts two special directives
in the output, called
linemarkers,
one in the stead of the #include
directive and one after the
inclusion. Let us consider the following example where a.txt
includes b.txt
, which, in turn, includes c.txt
. Here is the
contents of a.txt
:
Start of "a.txt"
#include "b.txt"
End of "a.txt"
Then b.txt
:
Start of "b.txt"
#include "c.txt"
End of "b.txt"
and, finally, c.txt
:
Start of "c.txt"
End of "c.txt"
If we gather the files in a Tests
directory and run the
standalone preprocessor built with dune
like so
$ PreprocMain.exe --show-pp -- Tests/a.txt
we obtain on stdout
:
# 1 "Tests/a.txt"
Start of "a.txt"
# 1 "Tests/b.txt" 1
Start of "b.txt"
# 1 "Tests/c.txt" 1
Start of "c.txt"
End of "c.txt"
# 3 "Tests/b.txt" 2
End of "b.txt"
# 3 "Tests/a.txt" 2
End of "a.txt"
There are three forms of linemarkers:
# <line number> "path/to/file"
# <line number> "path/to/file" 1
# <line number> "path/to/file" 2
The first kind is used only at the start of the output file and states
that the line after the linemarker has number <line number>
and
belongs to the file path/to/file
. Therefore Start of "a.txt"
has
line number 1
in file Tests/a.txt
.
The second kind is used when including a file. The #include
directive is discarded in the output, except the newline character,
which explains the empty line after Start of "a.txt"
. Then a
linemarker ending in 1
is printed, which means that we went to the
file path/to/file
when processing the input.
The third kind is inserted in the output upon returning from an
included file. For example, # 3 "Tests/b.txt" 2
means that the next
line has number 3
and we return to file Tests/b.txt
.
Linemarkers need to be handled by the consumer of the output. In the context of a compiler, the lexer reads the output of the preprocessor, therefore scans for linemarkers.
When using the preprocessor with a compiler, the #include
directive
should only occur at the top level according to the grammar, that is,
either at the beginning of the input file, in between file-level
declarations or at the end. (This property is checked by the parser.)
The rationale for this restriction is to avoid fragments of input
files that are syntactically incorrect, and yet assembled into a
correct one.
Strings and comments are recognised by the preprocessor, even in
pieces of the input that are not copied. (This last point is a
difference between cpp
and the C#
preprocessor.) The rationale for
doing so when copying the input is twofold:
-
We do not want the preprocessor to interpret a directive that is actually in a comment. This can happen when commenting out a piece of the source code that contains a preprocessing directive: we do not want that directive to be interpreted.
-
We do not want the preprocessor to interpret a directive that is actually in a string. This can happen if the source code is that of a bootstrapped compiler, that is, a compiler for its own language. Another scenario is that of a test: the source code is actually printing what is happening.
When the processor is in skip mode, that is, the input is not copied,
strings and comments are also recognised. This ensures that a string
or a comment containing a conditional directive, for example #endif
,
does not start to interact with previous directives, like #if
, or
raises an error when switching from copy mode to skip mode. In other
words, the interpretation of strings and comments should always be the
same. For example, we want the input
#if true
"#endif"
#endif
and
#if true
"#endif"
#endif
to not raise an error.
Strings are enclosed between double quotes.
Comments can follow one of the following combinations:
-
(*
and*)
for blocks, and//
for lines; -
/*
and*/
for blocks, and//
for lines; -
/*
and*/
for blocks, and#
for lines
See section Adding a Comment or String for new combinations.
As we explained in section Contents, the module CLI
has
a twofold purpose: to deal with command-line options for the
standalone preprocessor, and also export data structures about the
configuration meant for the library client.
The module signature COMMENTS
is
module type COMMENTS =
sig
type line_comment = string (* Opening of a line comment *)
type block_comment = <opening : string; closing : string>
val block : block_comment option
val line : line_comment option
end
It exports optional values denoting two kinds of comments: one-line
comments and block comments (that is, multi-line comments). Due to the
use of metaprogramming by means of ocamllex
to implement the module
API
, where comments are defined as regular expressions, the client
of CLI
must make sure to choose line and block comments that are
actually recognised by API
.
The module signature S
in CLI
gathers data structures
parameterising the behaviour of the preprocessor. If building the
standalone preprocessor, those data will come from the
command-line. They are as follows:
val input : string option (* input file *)
val extension : string option (* file extension *)
val dirs : string list (* -I *)
val show_pp : bool (* --show-pp *)
val offsets : bool (* negation of --columns *)
So, for example, if offsets
is true
, error messages will report
regions of the input with horizontal offsets (a la Emacs) instead of
column numbers (a la Vim).
Information about how the parsing of command-line options went is
given by the type and value status
:
type status = [
`Done
| `Version of string
| `Help of Buffer.t
| `CLI of Buffer.t
| `SyntaxError of string
| `FileNotFound of string
]
val status : status
-
The constructor
`Done
means that no error occurred. -
The constructor
`Version
carries the commit hash of the source code, as shown about the standalone preprocessor. -
The
`Help
constructor carries the help displayed by--help
. -
The constructor
`CLI
carries a string buffer containing the internal values of the command-line options which were given. This is used for debugging the moduleCLI
. See--cli
. -
The constructor
`SyntaxError
reports an error when parsing the command-line options. -
The constructor
`FileNotFound
denotes the input file not being found.
The rationale for this structure is to enable the CLI to be augmented
when composing the preprocessor with another tool that consumes its
output. For example, the lexing library in vendored-dune/LexerLib
features
its own CLI
module, exporting its own type
status
:
type status = [
`Done
| `Version of string
| `Help of Buffer.t
| `CLI of Buffer.t
| `SyntaxError of string
| `FileNotFound of string
]
which reuses the one in vendored-dune/Preprocessor/CLI.mli
.
On word on the implementation CLI.ml
. We designed the CLI of the
preprocessor library so it can be composed with other tools, like a
lexer based on the lexer library in vendored-dune/LexerLib
. In order to
achieve this goal, we do not want the exception Getopt.Error
to be
raised when finding an unknown option, one that is destined to another
tool, like the lexer. Therefore, we designed a module Argv
in
vendored-dune/ligo-utils/simple-utils
that filters out unknown options but
leaves correct ones, even if their syntax is invalid and will result
in an exception Getopt.Error
raised by
Getopt.parse_cmdline
. Importantly, we assume that there are no
concatenated short options (here, the only possible combinations are
-hv
and -vh
) and that anonymous arguments (here, a unique text
file) is given after --
.
let opt_wo_arg =
let open SSet in
empty
|> add "--show-pp"
|> add "--columns"
(* The following options are present in all CLI *)
|> add "--cli" (* For debugging *)
|> add "--help" |> add "-h"
|> add "--version" |> add "-v"
let opt_with_arg =
let open SSet in
empty
|> add "-I"
let argv_copy = Array.copy Sys.argv
let () = Argv.filter ~opt_wo_arg ~opt_with_arg
First, we make a backup copy of Sys.argv
. Second, we filter it in a
list by calling Argv.filter
. That function performs a side effect on
Sys.argv
: unknown options are removed and compacted. That is why
Sys.argv
has to be restored from the backup after parsing the
command-line: another parse is now possible by another client.
The API
module is the heart of the preprocessor. Perhaps it is best
to start from its interface. The type of a preprocessor is
type 'src preprocessor = State.config -> 'src -> result
Clearly, the type parameter 'src
is the kind of input. The type
State.config
is an object type gathering information about the input
and how the preprocessor is parameterised by the user. See section
The State. It could be that this information comes from
the module CLI
if building the standalone preprocessor, but it could
come from a compiler, which uses the preprocessor as a library and has
its own command-line interface. The type result
is
type result = (success, Buffer.t option * message) Stdlib.result
In case of error, the preprocessor (as a function) returns an error message and an optional string buffer. That buffer contains the output up to when the error occurred. This can be used to debug the preprocessor. In case of success,
type module_deps = (file_path * module_name) list
type success = Buffer.t * module_deps
the preprocessor returns a string buffer containing the output and a
list of module dependencies, directly gathered from the #import
directives.
Finally, different preprocessing functions are exported, according to
the type parameter 'src'
:
val from_lexbuf : Lexing.lexbuf preprocessor
val from_channel : in_channel preprocessor
val from_string : string preprocessor
val from_file : file_path preprocessor
Let us turn our focus now on the implementation. We shall not explain
how ocamllex
works. For that, please refer to its
official documentation.
First of all, we decided to have one automaton. A preprocessor has
basically two fundamental states: at a given position in the input,
either the current character is copied to the output, or it is
discarded. This fact could have led to two automata, each state having
a dual in the other (copy versus skip). We decided not to do that and
only have one automaton, and this is why we have the type mode
:
type mode = Copy | Skip
The mode will be threaded along the states of the automaton, which will be therefore not duplicated.
In the same vein, checking the well-formedness of conditional
directives require to remember what kind of clause is valid at any
time. For example, when finding an #elif
directive, it can only be
correct if either the previous conditional directive was another
#elif
of an #if
. This could be checked by an automaton that
duplicate its states according to what is valid (that is, was read
before). Just like with the modes above, we decided to have unique
states and instead keep track of the syntax of the conditional
directive with an argument to the scanner, of type trace
:
type cond = If of mode | Elif of mode | Else
type trace = cond list
It is a stack of previously encountered conditional directives, and their modes (because even if they occur in regions of the input that are skipped, they have to be well formed).
Already, we have seen two kind of information (modes, conditionals)
that need to be threaded along the states of the automaton. There is
actually more that needs threading, all gathered in the
type State.t
:
type state = {
config : config;
env : E_AST.Env.t;
mode : mode;
trace : trace;
out : Buffer.t;
chans : in_channel list;
incl : file_path list;
import : (file_path * module_name) list
}
We find the fields config
, mode
and trace
we considered earlier,
but also additional information.
-
The field
env
records the symbols defined by#define
and not undefined by#undef
. It acts as a value environment when computing the Boolean expressions of#if
and#elif
. -
The field
out
is the output buffer. -
The field
chans
is the list of input channels opened when processing an#include
directive, in order to close them when we are done, and thus avoid a memory leak (channels are not collected by the OCaml collector). -
The field
incl
is isomorphic to the file system path to the current input file, and it is changed to that of any included file. More precisely, its a stack on top of which directories are pushed by means ofAPI.push_dir
, and from which a path can be obtained backAPI.mk_path
. -
The field
import
is the list of modules from their defining file, as given by the#import
The entry point in the automaton is API.scan
:
rule scan state = parse
We can see the parameter state
that will be threaded along the
scanning rules.
There are two functions to copy input characters to the output buffer.
The first is API.copy
:
let copy state buffer =
Buffer.add_string state.out (Lexing.lexeme buffer)
and the other is proc_nl
("process newline character"):
let proc_nl state buffer =
Lexing.new_line buffer; copy state buffer
It is necessary to call proc_nl
when copying newline characters.
Remember too that newline characters are always copied, independent of
the mode.
When an #if
directive is found, the trace is extended by calling
extend (If state.mode) state region
during the evaluation of which the syntactic validity of having
encountered an #if
is checked (for example, it would be invalid had
an #elif
been last read). Note that the current mode is stored in
the trace with the current directive --- that mode may be later
restored (see below for some examples). Moreover, the directive would
be deemed invalid if its current position within the line (that is,
its offset) were not preceded by blanks or nothing, otherwise the
function API.expr
is called to scan the Boolean expression
associated with the #if
: if it evaluates to true
, then the
resulting mode is Copy
, meaning that we may copy what follows,
otherwise we skip it --- the actual decision depending on the current
mode. That new mode is used if we were in copy mode, and the offset is
reset to the start of a new line (as we read a new line in expr
);
otherwise we were in skipping mode and the value of the conditional
expression must be ignored (but not its syntax), and we continue
skipping the input.
When an #else
is matched, the trace is extended by calling
API.extend
with Else
as an argument:
extend Else state region
and then the rest of the line is scanned and discarded by means of
the rule skip_line
:
and skip_line state = parse
nl { proc_nl state lexbuf }
| eof { rollback lexbuf }
| _ { skip_line state lexbuf }
(Keep in mind that newline characters are always copied.) If we were
in copy mode, the new mode toggles to skipping mode; otherwise, the
trace is searched for the last encountered #if
of #elif
and the
associated mode is restored.
The case of #elif
is the result of the fusion (in the technical
sense of functional programming) of the code for dealing with an
#else
followed by an #if
.
When an #endif
is matched, the trace is reduced, that is, all
conditional directives are popped until an If mode
is found and its
mode
is restored as the current mode.
Consider the following four cases, where the modes (Copy
/Skip
) are
located between the lines:
Copy ────┐ Copy ────┐
#if true │ #if true │
Copy │ Copy │
#else │ #else │
┌── Skip ──┐ │ ┌── Skip ──┐ │
#if true │ │ │ #if false │ │ │
│ Skip │ │ │ Skip │ │
#else │ │ │ #else │ │ │
└─> Skip │ │ └─> Skip │ │
#endif │ │ #endif │ │
Skip <─┘ │ Skip <─┘ │
#endif │ #endif │
Copy <───┘ Copy <───┘
┌── Copy ────┐ Copy ──┬─┐
#if false │ │ #if false │ │
│ Skip │ Skip │ │
#else │ │ #else │ │
└─> Copy ──┐ │ ┌─┬── Copy <─┘ │
#if true │ │ #if false │ │ │
Copy │ │ │ │ Skip │
#else │ │ #else │ │ │
Skip │ │ │ └─> Copy │
#endif │ │ #endif │ │
Copy <─┘ │ └───> Copy │
#endif │ #endif │
Copy <───┘ Copy <───┘
The following four cases feature #elif
. Note that we put between
brackets the mode saved for the #elif
, which is sometimes restored
later.
Copy ──┐ Copy ──┐
#if true │ #if true │
Copy │ Copy │
#elif true ┌──[Skip] │ #elif false ┌──[Skip] │
│ Skip │ │ Skip │
#else │ │ #else │ │
└─> Skip │ └─> Skip │
#endif │ #endif │
Copy <─┘ Copy <─┘
┌── Copy ──┬─┐ ┌── Copy ────┐
#if false │ │ │ #if false │ │
│ Skip │ │ │ Skip │
#elif true └─>[Copy] │ │ #elif false └─>[Copy]──┐ │
Copy <─┘ │ Skip │ │
#else │ #else │ │
Skip │ Copy <─┘ │
#endif │ #endif │
Copy <───┘ Copy <───┘
Note how #elif
indeed behaves like an #else
followed by an #if
,
and the mode stored with the data constructor Elif
corresponds to
the mode before the virtual #if
.
For a step-by-step explanation of how file inclusion works, please refer to the source code.
Across the semantic actions of the ocamllex
specification API.mll
,
we see calls to the function rollback
:
let rollback buffer =
let open Lexing in
let len = String.length (lexeme buffer) in
let pos_cnum = buffer.lex_curr_p.pos_cnum - len in
buffer.lex_curr_pos <- buffer.lex_curr_pos - len;
buffer.lex_curr_p <- {buffer.lex_curr_p with pos_cnum}
That function restores the logical view of lexer engine over the lexing buffer so that it would appear that its contents has not been read. When called in a semantic action, it applies to the lexing buffer that was matched by the corresponding regular expression.
As we saw in the section about
preprocessing strings and comments,
several combinations of block and line comments are possible. We also
saw above how the type API.config
, in the section
The Interface, gathers parametric information about
the behaviour of the preprocessor. In particular, we saw the signature
COMMENTS
in the section CLI signature that gathers the
comment opening and closing markers from the client's perspective. It
is therefore important that what the client requests is actually
possible according to the regular expressions.
The regular expression block_comment_openings
in API.mll
:
let block_comment_openings =
pascaligo_block_comment_opening
| cameligo_block_comment_opening
| michelson_block_comment_opening
matches all the statically possible opening markers for block comments.
The regular expression line_comments
:
let line_comments =
pascaligo_line_comment
| cameligo_line_comment
| michelson_line_comment
gathers the valid opening markers for line comments.
After matching all possible block comment openings (block_comments_openings), we need to check whether the matched opening is the one requested by the client:
| block_comment_openings {
let lexeme = Lexing.lexeme lexbuf in
match state.config#block with
If not, we rollback the lexing buffer state and scan again
the characters that were matched (scan_n_char n state lexbuf
), and
then resume scanning with scan
:
| Some _ | None ->
let n = String.length lexeme in
begin
rollback lexbuf;
assert (n > 0);
scan (scan_n_char n state lexbuf) lexbuf
end }
For example, if the client specifies that block comments start with
the sequence (*
and we matched /*
, we cannot consider the string
/*
as a whole (a comment opening), but, instead as a normal
input. It has length 2
, so scan_n_char 2
will scan 2
characters
without trying to match a comment opening.
If the matched comment opening is the one expected by the client,
then, if in copy mode, the opening is copied and a scanning rule is
called, depending on the kind of comment: in_block
for block
comments, or in_line
for line comments.
If we consider the definition of in_block
, we see that it tries to
match three regular expressions on the lexing buffer:
-
a string delimiter (it is an opening, but opening and closing are expected to be the same);
-
a block comment opening;
-
a block comment closing.
Actually, string delimiters and block comment opening share the same
semantic action because comments can contain strings and strings
comments. The idea here is to scan a string or a sub-comment, or to
treat the matched string as regular comment contents. Since comments
can be nested, we need, in case of error, like an open comment, to
refer the user to the opening of the comment. This is the purpose of
the opening
parameter:
and in_block block opening state = parse
The module PreprocMainGen
exports a functor which consumes a module
of signature S
in CLI
and returns a signature
val check_cli : unit -> unit
val config : API.config
val preprocess : unit -> API.result
This signature packages the command-line options as defined by the
module CLI
into an API
version, both suitable for the standalone
preprocessor, and also for any library client, for example, a
compiler. This signature also provides two ready-made functions
check_cli
and preprocess
to check the command-line options and run
the preprocessor (on the input specified in the argument CLI
),
respectively.
The module PreprocMain
is only used as the entry point of the
standalone preprocessor, that is, PreprocMain.exe
, if using dune
,
or PreprocMain.byte
, if using the Makefile. As such, it assumes that
there are no comments and no strings, so it may be run on all texts
that are not programming languages, to be tested.
This section explains how to perform some maintenance tasks aiming at adding new features, but, whatever you modify in this directory, you should re-read this file to ensure that the quoted code is current.
When using the standalone preprocessor, we may want to add
another command-line option, for
example to dump debug information. For that purpose, the module CLI
needs to be considered. The new option needs to be added to the
signature CLI.S
. If it is a debug option, it should not lead to an
error by being incompatible with other options, therefore there is not
need to extend the type status
and the variant `Done
will
do. The function [make_help] needs to be emended so --help
displays
the new option. If it is for debugging, the description could be For debugging.
or Not documented.
A global reference needs to be
created, which will hold the OCaml value corresponding to the option,
and it must be initialised with a default value, so an option
may be
handy sometimes. See for example the reference input
. If your option
is without arguments, add it to the value opt_wo_arg
, like
--show-pp
is added, for instance. If it has an argument, then add it
to opt_with_arg
, like -I
. Then re-export the OCaml value without a
reference, for example see
let dirs = !dirs
to comply with the signature CLI.S
, where no reference escapes to
the client. You may want the new option to be debugged itself, by
adding a string to the value options
:
let options = [
"CLI options";
sprintf "input = %s" (string_of_opt (fun x -> x) !input);
sprintf "dirs = %s" string_of_dirs;
sprintf "show-pp = %b" show_pp;
sprintf "columns = %b" (not offsets)
] in
If the new option could cause new errors to occur, then the type
status
may have to be extended with another variant. Those variants
are polymorphic because they enable such an extension.
It is more likely that you may want to add a new preprocessing directive. In that case, you need first to think of its arguments and ask yourself: How many? Any optional? Of what OCaml type? Can it fail?
The first step is to insert the name of the directive in the list
directives
, in alphabetic order. Next, you need to add the
corresponding case to the pattern matching match id with
. If the new
directive needs an argument, you can follow the example of another
directive that has a similar one. For example, if a string is
expected, you might considier how #include
works. Anyway, the
general idea is that you want an error if the string is not valid or
missing (assuming it is not optional here). That is why you need a
scanning rule for the arguments. This is why, for instance, the
handling of #include
needs calling scan_include
:
and reg, incl_file = scan_include state lexbuf in
and #import
features a call to scan_import
, like so:
let reg, import_file, imported_module = scan_import state lexbuf in
In those rules, always have a catch-all clause and, if the argument is
the last, a clause matching eof
, like so:
and new_rule state = parse
...
| eof { ... } (* If parsing the last argument *)
| _ { ... }
Assuming the new rule might raise an error, you need to add it to the
type Error.t
, both the implementation and the interface. You need
then to add a case for it in Error.to_string
. In the semantic
actions of new_rule
, always fail by calling either the function
fail
or stop
. In particular, do not raise your own exception or
even the local exception Error
. That exception should not escape
the module API
. It is catched when calling the entry point of the
lexer, in function from_lexbuf
:
match preproc state buffer with
state ->
List.iter close_in state.chans;
Stdlib.Ok (state.out, state.import)
| exception Error (buffer, msg) ->
Stdlib.Error (Some buffer, msg)
and in the utility function apply
:
let apply transform region state =
match transform state with
Stdlib.Ok state -> state
| Stdlib.Error err -> fail state region err
The internal exception Error
is only raise by the function fail
and expr
, which parses the Boolean expressions of directives #if
and #elif
.
If the new directive is expected to gather information during the
preprocessing, you then need to extend the type State.t
with
dedicated accumulators, and add functions to State
to operate on
those accumulators. See the simplest examples, like push_import
. The
rationale for modules Error
and State
is to limit the clutter of
API
, which is the heart of the preprocessor. When the preprocessor
finished scanning the input, you need a way to export the new
information. Consider then the module API
and the type result
there. At least, you need to augment the type success
to return the
data gathered by the new directive in case of success, but you might
also return it in case of error, in which case you need to augment the
type error
.
If the new directive opens input channels, make sure to register them
(push them) in the field chans
of the state
, so they are closed
and no memory leak occurs (see function from_lexbuf
, which is called
by all the from_
exported functions).
If you use the preprocessor for a language whose comments follow a convention that is different from the one already implemented in the preprocessor, here is what to do. First, create your own combination of block and line comment. Perhaps it is useful you look at how the convention for Michelson is implemented:
let michelson_block_comment_opening = "/*"
let michelson_block_comment_closing = "*/"
let michelson_line_comment = "#"
Then
let block_comment_openings =
...
| michelson_block_comment_opening
and
let block_comment_closings =
...
| michelson_block_comment_closing
and
let line_comments =
...
| michelson_line_comment
That is all, as far as the library is concerned. The caller needs to
be made aware of the new convention in the documentation, so values of
the type State.config
are valid (see methods block
and line
).
The preprocessor is aware of strings as being enclosed between double quotes. If you consider adding a new kind of string, for example, enclosed between simple quotes, a substantial change is needed. Please talk to the maintainer.