/ADES-Master

ADES implementation based on a master XML file

Primary LanguagePython

01-Sep-2023 - Modified python code in Python/bin
       # The python script have now the .py suffix in the name
       # The initial 'python' statement has been removed at the beginning of each script
       # More tests have been added in the new_tests directory (the tests can be run using 'pytest')
       # The scripts in Python/bin can now be called as scripts or as routine inside your python code

25-Apr-2022 - Incremented ADES version from v2017 to v2022

03-Feb-2022 - Added a few new fields and other minor revisions
       # Added shapeOcc, obsSubID and trkMPC elements.
       # obsID can be up to 25 alphanumeric characters
       # Minor typographical and layout corrections

15-Jan-2019 - Some changes to the schema were made to reflect historical data
       # PermIDType for permID needs to accept `1I' and any more `I' objects
       # ProvIDType should restrict to P-L, T-1, T-2 and T-3 only and not allow T-L or P-3
       # CatType for astCat and photCat needs to accept the '.' character (e.g., GSC1.2)
       # ObsIDType for obsID should allow up to 25 characters
       # TrkIDType for trkID should allow the hyphen `-`
       # TrkSubType for trkSub should allow the hyphen `-`
       # (Not for submissions) TrkSubType for trkSub should allow these characters: "/",
       "\", "(", ")", "@", "?", ".", "+"
       # (Not for submissions) ProvIDType for provID should allow pre-1925 values of the
       form "A902 AA"
       # TimePrecType for precTime should allow additional values (prec not in submissions)
            41667 (integer hours)
            4167 (tenths of an hour)
            694 (integer minutes)
            69 (tenths of a minute)
       # Expand length of remarks to 300 characters

13-Jul-2018 - Minor fixes were applied to the documentation and schema. 
              See ADES_Description.pdf for details.






   CONTENTS:
     xml/     The adesmaster.xml file lives here.  This is not
              the place for example xml files

               adesmaster.xml


               The adesmaster.xml file is transformed by various .xlst
               files into .xsd files and .tex files fo ps and pdf documentation

     xslt/util/  location for xslt files used by the /bin files as helpers
          Currently only has adestables.xslt

          adestables.xlst
   
     xslt/xsd/   location for xslt files used to create xsd files.  I didn't
                 include the xsd files themselves since they can be made
                 with applyxslt.py.  They'd go in a top-level xsd/
                 directory anyway

          distribhumanxsd.xslt #currently not used
          distribxsd.xslt #currently not used
          generalhumanxsd.xslt #currently not used
          generalxsd.xslt
          submithumanxsd.xslt #currently not used
          submitxsd.xslt


     xslt/latex/  Location for xslt files used to translate adesmaster.xml
                  into latex input.

          docades.xslt
          docelementstable.xslt
          docgrouptypestable.xslt
          docsimpletypestable.xslt


     tests/    Location of test files.   It has its own README.
               The runtests script must be run when in the 
               tests/ directory -- it creates some extra dirs
               and knows about the sub-directories.
                                                               

     xsd/    Contains generated xsd files and makexsdfiles
         
          makexsdfiles generates xsd files if run in this directory

          Currently only submit.xsd and general.xsd are needed
      
    doc/ contains pdf and ps files documenting ADES tables
          ades.ps  # generated ades documenation file
          ades.pdf # generated ades documenation file
          docsrc contains code to build these in latex.  It uses
                 xslt to generate the tex files from adesmaster.  
                 You'll need to edit the makedoc file to point to 
                 latex your tex installation.
     
                 ./makedoc will generate ades.ps and ades.pdf
                           in this directory.   Copy those to doc/
                           to update the documentation if adesmaster.xml
                           or the xslt files have changed.

                 ./cleanum removes the evidence since the latex temp
                           files should not be in github.
                


 

     There are example programs demontrating how to read
     and write xml files using lxml in 

     Fortran/readxmlfox.f90
     Fortran/writexmlfox.f90
     C/src/readxmlc.c
     C/src/writexmlc.
     Python/bin/readxmlpy
     Python/bin/writexmlpy
  
     These all use the xml library.  

     Python: install lxml
     C: make sure liblxml2 is available
     Fortran: install FoX

     The Python and FoX libarires use liblxml2

 
   INSTALLATION and PREREQUISITES:
      Untar this tarball.  

      Python: Ensure you have a correctly installed
      python 2 or 3 and know its path.  You can have both.  

         You'll have to install the python lxml module for 
         your python separately; the best way to do that is 
         to build from source using a compatible C compiler.  
         See google for instructions, which change regularly.

         Alternatively, install Python package requirements
         using pip: 
         $ python -m pip install -r ./Python/requirements.txt 

      C: Ensure you have a correctly installed C/C++ compiler 
      and you know its path.  

         You will need liblxml2.a and liblxml2.so, which normally 
         come installed as prt of the compiler installation.  If 
         not, you'll need to obtain and install this library

      Fortran: Ensure you have a correctly installed Fortran
      compiler and you know its path


         You will need to install FoX, a Fortran XML library 
         (or something similar).  This is available (it has 
         a FreeBSD-like license) from:

         https://github.com/andreww/fox

         You'll retrieve fox-master.zip.  Unzip that into
         the Fortran directory


    BUILD C Examples:

      To build the C programs, go to the C/ directory, configure
      to build Makefile.config, and then cd into src and type 'make'.
      The README file in C/ has more details.  If you're on a MAC OS X,
      you'll need to read it since the instructions are different.


    BUILD Fortran Examples:

      First, build FoX.  Go to the fox-master directory and
      run the ./configure, which may pick up the wrong 
      fortran.  If it does, edit the "configure" file and
      edit the two lines containing "gfortran" so that your
      Fortran compiler is *first* in the list.  The make
      sure your Fortran compiler in in you PATH and run
      ./configure again.
 
      The run "make" and "make check" to build FoX.  Documentation
      for FoX is in FoX/DoX as html.

      After that, go to the Fortran directory and run "make" to 
      build writexmlf90 and readxmlf90 using FoX.

   USAGE:  


   The following are the main executables available from Python.
   All of these work in python 2 and 3 although they pick 
   /usr/bin/env python
   if run as commands.   

   These require the Python lxml library, available both for Python 2 and 3

   adestest/Python/bin/

       psvtoxml <psvfile> <xmlfile>  # converts psv file to xml file
       xmltopsv <xmlfile> <psvfile>  # converts xml file to psv file

       
       # the mpc80col converters are incomplete.  They do not translate
       # header records or Satellite observations.
       mpc80coltoxml <mpc80colfile> <xml file>
       xmltompc80col <xmlfile> <mpc80colfile>  

       valall <xml file>     # validates against all possible formats
                             #    using both human-readable and non-   
                             #    human-readable xslt-generated xsd files
       valsubmit <xml file>  # validates against submit format
       valgeneral <xml file> # validates against general format

       applyxslt      # <xml file> <xslt file>  > <output file>
          # example to create the submit schema
          Python/bin/applyxslt xml/adesmaster.xml xslt/xsd/submitxsd.xslt > submit.xsd

       writexml       # example script to write xml file


   There is code in C for the all of the above except mpc80coltoxml and
   xmltompc80col, in adestest/C/src.  To build it, 
   run "./configure" "cd src; make".   
   If your are on a Mac, source the forMacOS... file first before running 
   configure.  

        mpc80coltoxml and xmltompc80col are not yet in C, but the above
        programs all work the same way.


   TEST CASES:

      The "adestest/tests" directory contains numerous correct and incorrect 
      test cases.  To run them, "cd tests" and run 


      .runtests prog_python2   # to test python 2
      .runtests prog_python3   # to test python 3, if python3 is in your path
      .runtests prog_c         # to test in C, if you built the C


      Also, the tests/mpc/ directory has some mpc 80-column examples.  The
      test cases for these are not yet finished

   DOCUMENTATION:

      adestest/doc/ contains pdf and ps files documenting ADES tables
      adestest/doc/src contains code to build these in latex.  It uses
                       xslt to generate the tex files.  You'll need to
                       edit the makedoc file to point to your tex
                       installation.
 

-----------------------------------------
These are the README file for some previous distribution tests.  Some
of the information may be useful but some may be obsolete.

2016 Dec  GMH --- older notes
This is a not-quite-ready-for-prime-time attempt at a distribution.

Known Issues:
   1) xmltopsv produces different header orders on different systems for
      the headers whose order is not specified.  This round-trips OK
      but shows diffs in the tests.   I'm not sure what the right order
      should be.

   2) The WINDOWS-1252 codec is broken on some systems in the library

   3) Different xml libraries use ' or " for attribute quoting of the
      <? xml version="1.0"    or  '1.0'   line.  This is fine and
      legal but makes testing hard.  Other legal differences are possible

*  4) I've decided to make the main interface the DOM and not some
      C struct.   This is mainly because most use cases fill less
      than half of the struct and memory management is tricky.  
    
      I've written an example program (writexml) in both C and Python for writing
      a new xml file using the ElementStack interface.   I don't
      have a design for reading yet but we need to know what we want.

   5) This code words on complete documents.  Using SAX/iterparse for
      large files is possible with pretty much the same interface.  

   6) The timings are dominated by program launch times for the
      100-item examples.   I'm not sure how much performance is
      needed.

   7) The code needs some organization.   I wanted to put out something
      working.
     


Specific distribution notes:
   1)   This uses the python lxml module, which is not part of
      the default python.  There are numerous clever ways to 
      try to do binary installs but the most reliable thing
      to do is obtain a source tarball (such as lxml-3.6.4.tar.gx)
      and run "python setup build" and make and so forth on 
      your machine.  Just Google "python packages lxml" and
      poke around untill you find  the source tarball.   
         This is important because all the web sites try 
      to help you by guessing what your configuration is, 
      and they guess wrong all the time.  Find the source tarball 
      and go from that.  This is especially important if your 
      want to make both a python 2 and python 3 installation.

   2)  The runtests script source's a script for picking
       up the executables it uses.  This makes it easy
	to test your own executables

Several issues remain:

   The tests are imcomplete.  You can help by expanding them :-)

   The runtests point out that between python2, python3 and C there
   is a disagreement about the order of fields in PSV.  The ones
   we specifiy are all fine, but the order of extra ones can be
   arbitrary.  All the files round-trip just fine, so this may
   only be a problem for testing.
   
   xmlUTF8Strlen does not return the *width* of a unicode string
   but rather just the number of unicode characters (I *think*
   it handles the combining characters correctly).  This means
   padding to achieve justification in Chinese etc. will be wrong.

   NOTE:  although the maximum allowed field width is 200, that
          means 200 unicode characters.  This may even be longer
          than 200 unicode code points because of combining
          characters.  Python handles memory management properly;
          in C you're on your own. 

   
Usage:
   The executables in the varous bin/ directories (should) have
   the same interface.   To run tests, go to the tests directory
   and run 
     ./runtests prog_python2
     ./runtests prog_python3
     ./runtests prog_c

   Run these into a file since the output can be long.

   prog_python2 assumes #!/usr/bin/env python is python 2.7
   prog_python3 needs to point to your python3 not mine
   prog_c script uses python for the encoding check. xmltopsv
                 and psvtoxml are in C.   Note that the C
                 code my version seems to use single quotes
                 instead of double quotes on the version line
                 <?xml version="1.0" encoding="UTF-8"?>
                 vs.
                 <?xml version='1.0' encoding='UTF-8'?>

                 This confuses diff.  The attributes in the
                 doc are coded the same way.  Notice the EBCDIC
                 and UTF-7 encodings are fine, but the quote
                 differences make them look different.


Notes:

   For now, all the executables start by transforming the 
   xml/adesmaster.xml file into the internal tables using
   xslt/util/tableades.xslt.  This is hard-coded into
   the executables.  Eventually we may want to have the
   tables hard-coded into the executables instead once
   things stabilize.

   For now, all the xsd files are generated from adesmaster.xml
   using xslt/xsd/<name>xsd.xslt files.  We could create 
   external xsd files once we know what the final format will be.

   Those two above items add surprisingly little to program start 
   overhead.


   Everything works by converting input files, including input
   files, into an internal xml etree and doing operations on 
   that.   We may want to use iterparse to handle large files
   but so far this is not an issue.  I'm not sure what large
   means.

   It's really important for performance to not have memory
   leaks.  Memory management is tested with the C executables
   through some commented-out code using the "nMemoryTest" 
   #define in ades.h.


-----------------------

This directory has several sub-directories:


C/

  ./configure creates Makefile.config.
  cd src; make clean; make # builds and puts executables in bin
  cd src; make realclean; # removes executables from bin

  README
  configure.ac
  configure
  install.sh # what a mess
  aclocal.m4 # yup, a mess
  forMacOSXwithout_pkg_config # did I say a mess
  Makefile.config.in
  src/  # make puts executables in bin
  include/
  bin/  # same interface as Python.  At least they're supposed to :=)
           Executables:
              psvtoxml  # psvtoxml <psv file> <xml file>
              xmltopsv  # xmltopsv <xml file> <psv files>
              valall    # valall <xml file>
              valades   # see tests/runtests
              unittest  # this is woefully incomplete
              writexml  # writexml myfile

              The encoding flags for PSV files do not work.  They
              always assume the PSV encoding is UTF-8
Python
  bin/     python executable files and modules.  The modules are
           not executable and are in bin because I didn't want to
           bother with setting pythonpath yet.  

           All the python scripts are good with python2 and python3
               <script>   # runs a script with #!/usr/bin/env python
               <python2> <script>   # runs a script with python2
               <python3> <script>   # runs a script with python3

               Python/bin/xmltopsv <args>
               python xmltopsv <args>
               python3 xmltopsv <args>

           Executables:
              applyxlst
              validate
              encoding

              psvtoxml  # psvtoxml <psv file> <xml file>
              xmltopsv  # xmltopsv <xml file> <psv files>
              valall    # valall <xml file>
              valades   # see tests/runtests
              unittest  # this is woefully incomplete
              writexml  # writexml myfile

writexml myfile <encoding> works in both Python and C++.  
The C and Python conversions don't match, at least on my
machine, because one of the says
<?xml version='1.0' encoding='UTF-8'?>
and the other
<?xml version="1.0" encoding="UTF-8"?>
Both of these are legal.


"writexml myfile UTF-7" is interesting.  



---------------------------------
Some other thoughts:

A) Use iterparse to process documents as a stream

   Both the Python and C work on xml documents, which mean the entire
   input is in memory as an xml tree (even psv input is converted to 
   an xml tree.

   Larger documents may require an iterparse structure. 


B) User interface

    Right now I don't have much for this.   The basic idea
    is to use xml documents for everything an supply routines
    to walk through them.   

    To make a new document, build an xml document, 
    validate it, and then write it either as xml or psv.

    To read a document, read it into an xml tree and
    use methods on the tree.


    Obviously we can build a layer on top of this but I haven't
    given that much work yet.  I think it is not a good idea
    to make a big struct of xmlChar* pointers, since that's 
    going to 

      1) be a recipe for memory leaks
      2) be slow because it's mostly going to be empty

    I think going through the node interface by strings is better,
    In C++ and Python that's easy.  In C and Fortran this is harder
    but I think we should be dealing with the xml directly or 
    indirectly (but conceptually)  in all cases.



C) Unicode handling

  -> Use native UTF-8 whenever possible

  Note python3 will not write UTF-8 to stdout unless the right 
  environment variables are set.  This is going to be a bigger
  problem in the future.  While C/C++ will write bytes, having
  improper terminal settings can create surprises.

      Recommendation:  Transform from file to file.  View files
                       with an editor that supports utf-8 or
                       use file:// on you web browser, which 
                      is happy with utf-8.

-----------------------------------