Refactorings and Redesigns to improve maintainability and portability
Opened this issue · 0 comments
mtullsen commented
What
- the objective of this issue is to discuss and decide yes/no/modify on the below proposal
Motivations
The key motivations of the below "proposal" are
- make the *.ddl code that specifies PDF easier to debug by removing/reducing primitives.
- make the pdf spec (spec/*.ddl) and its interactions with the Haskell (at the moment) driver code a little simpler.
- if possible, make changes that make porting to C++ easier.
The Current State of Play
Here's a slight oversimplification of the Haskell driver code:
-- pdf-cos/src/PdfMonad/Transformer.hs
type ObjIndex = Map R ObjLoc -- Maps opbject ids to their locations.
-- pdf-driver/src/driver/Main.hs:
driverValidate =
do
idx <- findStartXRef ...
(refs, ...) <- parseXRefsVersion1 topInput idx
res <- liftIO (runParser refs Nothing (pCatalogIsOK root) topInput)
mapM_ (... parseDecl ...) refs
Notes:
parseXRefsV1
processes all the updates to create a final/merged xref table.- This is a Haskell function (200-ish lines)
- it calls Daedaelus generated code to parse dictionaries, streams, values, and traditional xref tables.
- it does not (cannot, may not!) call the
resolve
primitive directly or indirectly.
- This is a Haskell function (200-ish lines)
refs (:: ObjIndex)
is not exposed to the rest of the ddl code, but rather is used to implement theresolve
primitive, this is whyrefs
is passed torunParser
Proposing an Alternative Design
An alternative design is to
- remove the ObjIndex (akin to a state variable) from the Parser monad:
- during many of the parser calls, an object index does not exist.
- it's really not general enough for all our needs - remove the
resolve
primitive from daedalus
- it's quite ad hoc, it's PDF specific and again, not general enough for all our needs. - The
parseXRefsV1
will return adaedalus
Map (not a native Haskell Map), and the code will change thus:
driverValidate =
do
idx <- findStartXRef ...
(refs, ...) <- parseXRefsV1 topInput idx
- res <- liftIO (runParser refs Nothing (pCatalogIsOK root) topInput)
+ res <- liftIO (runParser Nothing (pCatalogIsOK refs root) topInput)
mapM_ (... parseDecl ...) refs
- All win/win so far, but we will also need to refactor the
spec/*.ddl
code to
- replaceresolve
with lookups in the explicit Xref map.
- pass around the xref map (to the few parsers that do need this), of course, implicit parameters might make this easier. - Rewrite the Haskell driver to not use the
ObjIndex
- Easy, this is how the pdf-dom code does it.
- and we should just remove the ObjIndex from the Parser monad
Assumptions
- We can, in Haskell, create a daedalus Map that could be passed to a daedalus generated parser for which a Map was its argument.
Assessments
- con : guessing that the refactor of spec/*.ddl could take @mtullsen 2-3 days, the other changes all together, maybe 1 day.
- pro : we get to remove the
resolve
primitive. - pro : the Haskell specific code becomes just
parseXRefsV1
, thus a smaller amount that needs to be ported to C++ - pro : this completely fits in with the approach in our Trust Chain Paper where we refine the Xref table to the Object table (DOM) in multiple steps, this can be done in daedalus (need not be done in Haskell).
- con : We need to do this step before things are working.
- pro : we'll have improved our implementation: removed a current DOS vulnerability and will also reject PDFs that we should be rejecting.
- con : We need to do this step before things are working.
- the resulting spec/*.ddl code
- con : will be a little klunkier, passing the xref around.
- pro : will not have the
resolve
primitive - pro : will be clearer when, and when not, a PDF parser assumes it has a DOM at hand to lookup values from.