[Proposal] Optics language rework
SekoiaTree opened this issue · 2 comments
The optics language, as it is, is usable, but not extremely useful as a DSL. It could be entirely expressed as JSON. Additionally, the current syntax is strangely restrictive in some aspects. Actions always need to be at the end of a rule, for example. Commas are... semi-optional. These are mostly quirks of how the LR(1) language is written, because Optics is constructed to not even really be compiled, just parsed.
A future objective is to make the language actually work like a language, to allow for more flexible filtering. I would like this issue to serve as a central repository for this rework, serving both as a place to think this out cleanly, as well as a sort of initial documentation. I'm going to write
Background
The current optics pipeline goes from parsing RawRule
s and similar into their parsed versions. The parsed versions are mostly identical to their raw versions, with a few simplifications and reductions to make them easier to manage. Optics are then converted into Tantivy queries every time a request is made. Additionally, this conversion to Tantivy queries is done in core.
Several other places in core convert things to optics,
Internal architecture
Optics should be kept in their own crate as much as possible. The crate itself should only output Tantivy requests, and optics should be opaque from a core crate perspective. This allows a much more flexible system, with Optic language reworks, alternative syntaxes etc. only being another trait implementation away, and not requiring touching the core at all.
My proposed pipeline is
Optic "program" -Lexing/Parsing-> Executable -Execution-> Unrealized Query -Realization-> Query
Lexing/parsing and execution is done beforehand, and only depends on the optic. Realization is done by the core, calling onto the unrealized query, turning into a Tantivy query. This is necessary because Matching
requires a schema as well as a FastFieldReader
to be turned into a Tantivy query. Realization should be close to simply cloning data directly and filling up some blanks.
I haven't fully investigated, but it seems like neither Schemas nor FastFieldReaders change... pretty much ever (FastFieldReaders get cloned every time, but they only have an inner Arc, so... idk?), so potentially realization could be done once, which allows realization to be slower.
The advantage of this structure is that it allows the Optic language to be much more complex, and affords a large margin in the implementation of the execution (meaning said implementation is infinitely easier as it doesn't require nearly as much thinking). Since it removes Optic compilation from the code executed during a search, searches using optics can also be faster.
A possible language
Basing off of the existing language:
- A
Matching
is a condition that can either be true or false for a specific page - A
Matching
can either be a composite of several otherMatching
s (more on that later), or a condition on a field of a page. - An
Action
is something that impacts a search result, such as boosting/demoting or discarding. - A
Rule
is aMatching
and anAction
. Only one action, because two different actions can always be combined into 1 rather easily. - A program (mostly) outputs a list of
Rule
s. Rules are applied independently.
These definitions effectively define all of the non-standard types of almost any possible Optic language: Matching
, Action
, Rule
.
Notice that effectively all complex programs are essentially iterations on a list. This would therefore be an array of strings. Often, these are "iterated" over, and turned into a single matching with a straightforward mapping and combination. This is why I propose a functional-ish language for Optic, based off of Rust's functional features. Here's a syntax example.
let urls = ["https://google.com", "https://yahoo.com"];
let search_engines = urls
map url -> Match.Site(url)
combine Match.And;
publish search_engines, Action.Discard;
Please note that I am very much open to better syntaxes.
This language has bools, ints, floats, strings, and lists. Since it's functional, functions are also types.
Note that all variables are immutable.
Standard library:
List<I>
:
map(apply: I -> G) -> List<G>
zip(other: List<I2>) -> List<(I, I2)>
(the tuple will really be a list, but for docs this is clearer)combine(combiner: List<I> -> G) -> G
call combiner on this list. Mostly syntax sugar.- ???
There are a lot of other iterator functions, but these three work for almost all optics. More can be added later.
Match
: Site
,Domain
etc: Return a newMatch
that matches that field. The base Matches.And(in: List<Match>) -> Match
: Returns a newMatch
that only matches when all of its interiorMatch
es match. If the input list is empty, always matches (NOTE TO IMPLEMENTORS: BooleanQuery by default never matches when empty).Or(in: List<Match>) -> Match
: Returns a newMatch
that matches when any of its interiorMatch
es match. If the input list is empty, never matches.Not(self) -> Match
: Returns a newMatch
that only matches when its interior match doesn't match.Never
: A Match that never matches.Always
: A Match that matches everything.
Implementation-wise, many combinations of these can be simplified to simpler Tantivy queries, since a BooleanQuery is more general than just And, Or or Not.
Action
: Boost(value: float)
, Downrank(value: float)
, Discard
, Keep
.
publish
is a global function/syntax sugar which takes in a Match
and an action, and applies it as a Rule. Boost and Downranks combine, and Keep always takes priority over Discard.
NOTE: this means DiscardNonMatching
goes from magic to just publish Match.Always, Action.Discard;
at the start. However it doesn't seem easy to define as a BooleanQuery. Maybe a custom Query type is necessary for this kind of linear combination. Hey, maybe that can be another kind of Match
, linearly applying one rule after another. Food for thought idk.
Something to think about is how to allow large amounts of data to be inputted without mixing code and data. One possibility is to have some kind of "data section" at the end of the optic, which can contain constants, to dump lists of websites. Something like data urls;
which promises that we'll define urls
in the data section.
Liking and disliking websites becomes a global function like publish, and take a string. I don't like it much, but internally liking stuff is really distinct from rules, so... yeah.
First of all, this is an incredibly well thought out proposal! I very much agree that the ergonomics of optics could be much better.
We have a bit of an odd requirement for optics in the sense that we need to make sure it is not turing complete, and that we can prove that any program written in the language halts. Otherwise, someone could easily take down the production servers. This imposes a fair bit of constraints on the features we can allow directly in the language.
I would therefore propose that we essentially use the current structure as 'assembly instructions' and build some nice libraries in existing languages (python, prolog, haskell etc.) that in a way compiles the optic. The compiled optic file can then be changed to be a gzipped binary encoding of the current optic structure.
I think this approach would have multiple benefits over the approach we use today
- The gzipped binary encoded format would reduce size and parse times substantially.
- We can remove
optics-lsp
crate and the associated extension as people would use existing languages to compile the optic, which would make it easier to maintain. - We can maintain backwards compatibility with the current optics for some time.
- After backwards compatibility window, we can remove the lalrpop parser.
- It would be way easier to create optics as one doesn't have to learn the DSL but can instead use a library in existing languages.
Perhaps we should also go ahead and remove the ranking signal adjustments from optics and expose them as a HasMap<Signal, f64>
parameter in the API instead. I think this is something that hardly anyone would use in optics anyway, and it makes the optic too tightly coupled with the search engine that executes it (which signals are allowed? what are their value distributions? etc.).
I agree that we should internally move as much as possible out from the core
crate into optics
so that core
just gets a set of tantivy queries to execute. This would indeed make it way easier to maintain and remove the magic around DiscardNonMatching
as you proposed.
What do you think about this? I don't really know what the syntax for the libraries should be yet, but I think we can come up with something that's really nice to use.
(huge credits to @oeb25 here)
We have a bit of an odd requirement for optics in the sense that we need to make sure it is not turing complete
This sounds a lot like configuration languages like Nickel, Dhall, Starlark and others. Would it be worth adopting one of these existing languages that already support many/all of the features described in the top post?