Refactorings and Redesigns to improve maintainability and portability

Question

Refactorings and Redesigns to improve maintainability and portability

Opened this issue 2 years ago · 0 comments

mtullsen commented 2 years ago

What

the objective of this issue is to discuss and decide yes/no/modify on the below proposal

Motivations

The key motivations of the below "proposal" are

make the *.ddl code that specifies PDF easier to debug by removing/reducing primitives.
make the pdf spec (spec/*.ddl) and its interactions with the Haskell (at the moment) driver code a little simpler.
if possible, make changes that make porting to C++ easier.

The Current State of Play

Here's a slight oversimplification of the Haskell driver code:

-- pdf-cos/src/PdfMonad/Transformer.hs
type ObjIndex = Map R ObjLoc -- Maps opbject ids to their locations.

-- pdf-driver/src/driver/Main.hs:
driverValidate = 
  do
  idx <- findStartXRef ...
  (refs, ...) <- parseXRefsVersion1 topInput idx
  res <- liftIO (runParser refs Nothing (pCatalogIsOK root) topInput)
  mapM_ (... parseDecl ...) refs

Notes:

parseXRefsV1 processes all the updates to create a final/merged xref table.
- This is a Haskell function (200-ish lines)
  - it calls Daedaelus generated code to parse dictionaries, streams, values, and traditional xref tables.
  - it does not (cannot, may not!) call the resolve primitive directly or indirectly.
refs (:: ObjIndex) is not exposed to the rest of the ddl code, but rather is used to implement the resolve primitive, this is why refs is passed to runParser

Proposing an Alternative Design

An alternative design is to

remove the ObjIndex (akin to a state variable) from the Parser monad:
- during many of the parser calls, an object index does not exist.
- it's really not general enough for all our needs
remove the resolve primitive from daedalus
- it's quite ad hoc, it's PDF specific and again, not general enough for all our needs.
The parseXRefsV1 will return a daedalus Map (not a native Haskell Map), and the code will change thus:

  driverValidate = 
    do
    idx <- findStartXRef ...
    (refs, ...) <- parseXRefsV1 topInput idx
-   res <- liftIO (runParser refs Nothing (pCatalogIsOK root) topInput)
+   res <- liftIO (runParser Nothing (pCatalogIsOK refs root) topInput)
    mapM_ (... parseDecl ...) refs

All win/win so far, but we will also need to refactor the spec/*.ddl code to
- replace resolve with lookups in the explicit Xref map.
- pass around the xref map (to the few parsers that do need this), of course, implicit parameters might make this easier.
Rewrite the Haskell driver to not use the ObjIndex
- Easy, this is how the pdf-dom code does it.
- and we should just remove the ObjIndex from the Parser monad

Assumptions

We can, in Haskell, create a daedalus Map that could be passed to a daedalus generated parser for which a Map was its argument.

Assessments

con : guessing that the refactor of spec/*.ddl could take @mtullsen 2-3 days, the other changes all together, maybe 1 day.
pro : we get to remove the resolve primitive.
pro : the Haskell specific code becomes just parseXRefsV1, thus a smaller amount that needs to be ported to C++
pro : this completely fits in with the approach in our Trust Chain Paper where we refine the Xref table to the Object table (DOM) in multiple steps, this can be done in daedalus (need not be done in Haskell).
- con : We need to do this step before things are working.
  - pro : we'll have improved our implementation: removed a current DOS vulnerability and will also reject PDFs that we should be rejecting.
the resulting spec/*.ddl code
- con : will be a little klunkier, passing the xref around.
- pro : will not have the resolve primitive
- pro : will be clearer when, and when not, a PDF parser assumes it has a DOM at hand to lookup values from.