avwx-rest/avwx-engine

Refactor parsing

Closed this issue · 6 comments

I think a quick(ish) refactor of the parsing modules could pay dividends in the future. Currently the parsing logic looks through a string and extracts based on string manipulation. Then in the same context, the parsing function determines what type of translation should be used for each of the plucked strings (lets call them ‘atoms’).

What I propose is a lightweight framework that predefines ‘atoms’ as a regular expressions. This allows for multi-worded atoms to be searched for with the same amount of effort as any normal encoded/non-encoded atom. Each atom can be predefined and a translator is responsible for translating/decoding each atom into whatever. This way, each module can just build the atoms it expects to find in whatever it’s going to translate, add them to translators that decode even the most complicated ones (this will allow for those awful ones like tornados and what not), and then in the parser modules, you’d never have to change anything and can add as many handlers as you want.

An example in the remarks could be

PEAK_WIND = r”””
\bPK WND 
(?<direction>\d{3})
(?<velocity>\d{2,3})
/
(?<hours>\d{2})?  # rarely present
(?<minutes>\d{2})
\b”””

Using that, some class...

class PeakWindAtom(Atom):
    pattern = PEAK_WIND
    search(str) -> match or None
    __contains__(str) -> bool

would be used by

class PeakWindHandler(AtomHandler):
    atom: some Atom
    <<abstract>> can_translate(str) -> bool : atom in is string
    <<abstract>> translate(str) -> str: decode the data

would be used by

class RemarksParser(Parser):
    - handlers: List[handlers]
    + parse(remarks_string) -> Dict[str, str]: raw-decoded pairs
    + add_handler

The patterns and atoms could be reused. And simple ones like ACFT MSHP could be build using a simple factory (so the API doesn’t have to change for simple ones or complex ones). All of your existing decoding logic could stay in place, or slowly be changed to regex group dicts for easier changes.

No tests would need to be changed as far as I can tell. Added tests will be in their own module that you can place wherever you like best before any merge.

I think starting with the remarks module is a good place to start. Before I’d get started and submit a PR, is this something you’d be open to?

Interesting. The parser handling is roughly how I started organizing the PIREP parser, but that's function-based. I'd be interested to see how effective regex will be in extracting remarks elements as those are by far the most open-ended.

My one recommendation with the code above is to compile the regex expression that is assigned to the class. I'm interested to see how that turns out. I've been meaning to split up the avwx._core module into parts and refactor it anyway.

I was tooling around with the idea last night and came to a similar conclusion. I think that only a single class would be needed if you wanted to commit to regex based internals, but a function-based class could be subclassed to share the interface that the handlers would use as well. If that is the case, we should ensure they are implemented to return something immutable when they look for the presence of an atom in a string. And that should reveal the atom's location in the string, so that it can be extracted if you wanted destructive parsing, and the actual raw match from the input string that a handler would use for the translation.

# all subclass searching should be able to make this
class AtomSpan(NamedTuple):
    """
    Finding within a string. Can be used for slicing as `end` is non-inclusive.
    """
    match: Optional[str]
    start: Optional[int]
    end: Optional[int]


class BaseAtom(abc.ABC):
    """Atom represents a single encoded entity in an encoded string"""

    @abc.abstractmethod
    def to_atom_span(self, item: str) -> AtomSpan:
        """Return an `AtomSpan`. If no match found, return `AtomSpan(None, None, None)`"""
        pass

    def is_in(self, string) -> bool:
        """Return True if the atom is in the string"""
        return self.to_atom_span(string).match is not None

Some subclass that uses regex instead of callables

class RegexAtom(BaseAtom):
    """Atom defined by regex pattern"""

    def __init__(self, regex_pattern: re.Pattern, name: Optional[str] = None):
        self.regex = regex_pattern
        self.name = name

    @property
    def regex(self) -> re.Pattern:
        return self._regex

    @regex.setter
    def regex(self, regex: re.Pattern):
        if not isinstance(regex, re.Pattern):
            raise TypeError("regex must be a compiled `re.Pattern`")
        self._regex = regex
    
    @classmethod
    def from_pattern_string(
        cls, pattern: str, *flags: re.RegexFlag, name: Optional[str] = None
    ) -> "RegexAtom":
        """
        Return a new instance of `RegexAtom` from a string pattern.
        :param pattern: string to be compiled into regex pattern
        :param flags: flags to pass to `re.compile`
        :param name: name for repr
        """

        if flags:
            regex = re.compile(pattern, sum(flags))
        else:
            regex = re.compile(pattern)

        out = cls(regex, name=name)

        return out

    def to_atom_span(self, string: str) -> AtomSpan:
        """
        Search the string for the atom
        """
        match = self.regex.search(string)
        if match:
            start, stop = match.span()
            return AtomSpan(match=match.group(), start=start, end=stop)
        return AtomSpan(None, None, None)

    def search(self, string: str) -> Optional[re.Match]:
        """
        Return `re.Match` object or None
        """
        return self.regex.search(string)

and a handler

class RegexAtomHandler:
    """Handle the individual translation of an atom"""

    def __init__(
        self, atom: "RegexAtom", translation_callable: Callable[["Match"], str]
    ):
        self.atom = atom
        self.translation_callable = translation_callable

    def __repr__(self):
        return f"{type(self).__name__}(atom={self.atom!r})"

    def __call__(self, *args, **kwargs):
        return self.translate_atom(*args, **kwargs)

    @property
    def translation_callable(self):
        return self._translation_callable

    @translation_callable.setter
    def translation_callable(self, new_callable: Callable[["Match"], str]):
        if not callable(new_callable):
            raise TypeError("translation_callable must be a callable")
        self._translation_callable = new_callable

    def can_handle(self, atom_string: str) -> bool:
        """Return True if handler is qualified to handle translation"""
        return self.atom.is_in(atom_string)

    # todo: allow for default error handling option
    def translate_atom(self, string: str) -> str:
        """Perform translation on the atom string"""
        match = self.atom.search(atom_string)
        if not match:
            raise CanNotHandleError(
                f"{self.atom!r} has nothing to translate from {atom_string!r}"
            )
        result = self.translation_callable(match)

        return result

Quick one-one handlers could still use regex with simple strings and keep the regex internals for parsing.

# atom instance
ACFT_MSHP = RegexAtom.from_pattern_string(' ACFT MSHP ')

# handler class code
...
    @classmethod
    def new_simple_handler(cls, atom, translation: str):
        _callable = lambda match: return translation
        return cls(atom, _callable)
...

...
# handler instance code
acft_mshp_handler = RegexAtomHandler.new_simple_handler(ACFT_MSHP, "Aircraft Mishap")
...

# some parser in the remarks.py module
parser.add_handler(acft_mshp_handler)

translations = parser.parse_into_translations(long_remarks_string)  # Dict[str, str]

But with that same client code...

# some patterns module
RVR_PATTERN = re.compile(r"""\b
R(?P<runway>\d{2}[LCR])
/
(?P<plus>\?P)
(?P<visibility>\d{4})
(?P<units>\w{2})?
V?
(?P<upper>\d{4})
(?P<units>\w{2})?
""", re.M)

# remarks module
rvr_atom = RegexAtom(RVR_PATTERN, some_callable_that_uses_the_match)

...

parser.add_handler(rvr_atom)

I think the benefit would be that even though regex is evil incarnate, you could test that pattern with as many variations as you could muster without affecting the parser code. All the translation callable would need is the regex.match object it could call match.groupdict() on and get all of the parsed units out of.

So I did a little tooling around and drummed up a rough stub of a parsing implementation that could sit on top of any new translations you were thinking up. Since you were looking to start parsing the airep soon, would you want to try this out? And/or, would you want to open a branch for the refactoring of the remarks module? I added a few translations for things like Tornado activity and CIG/VIS as secondary locations for a proof of concept.

The AIREP parsing is on hold for now. The PIREP parser is in place, but it is very error prone when it encounters more descriptive report elements. I can put together a list of reports encountered that break the current version.

I can open a refactor branch in the source we can use.

That sounds like a good place to start. If you open one, I can open a PR against it with the rough work I did and we can poke around at it from there. The changes I made to the existing code are really only impactful in that it adds two sub packages. Nothing else is hooked up to the existing parsing code. The tests I wrote were all pytest based, would you prefer them as unittest testcases?

As long as running pytest can discover them, doesn't matter to the CI. I'm not using unittest.TestCase in an optimal fashion right now anyway.

I've added the refactor branch.