Introduce a Lightweight Formatter

Question

Introduce a Lightweight Formatter

Closed this issue 3 years ago · 0 comments

Problem

The current Arbori program is based on the default provided in SQL Developer (SQLcl). After a product release the changes have to be identified and incorporated. This is time-consuming and cumbersome. It's difficult to identify the root of the change and to find an example to test the effect of the change.

The current formatting approach replaces all existing whitespace between tokens. The default is a single space between tokens. Then the formatter handles all cases where a spaces between tokens is "not good enough". However, there are a lot of cases and the formatter does not handle all of them. That's nearly impossible. This leads to an unsatisfactory result, especially when using less common SQL constructs.

Idea

The current guidelines v4.0 contain only the following formatting rules:

Rule	Description
1	Keywords and names are written in lowercase.
2	3 space indention.
3	One command per line.
4	Keywords `loop`, `else`, `elsif`, `end if`, `when` on a new line.
5	Commas in front of separated elements.
6	Call parameters aligned, operators aligned, values aligned.
7	SQL keywords are right aligned within a SQL command.
8	Within a program unit only line comments `--` are used.
9	Brackets are used when needed or when helpful to clarify a construct.

It allows the developer a lot of freedom. For example all variants of the following SQL statement follow these rules:

-- variant 1
create or replace view v as select empno, deptno from emp;

-- variant 2
create or replace view v as 
select empno, deptno, ename 
  from emp;

-- variant 3
create or replace view v
as 
select empno
     , deptno
     , ename
  from emp;

So the idea of of this "Lightweight Formatter" is to do the following:

Keep all existing whitespace between tokens
For each formatter rule, check if it is violated and if yes, automatically fix the whitespace

So let's evaluate the approach. Step by step.

Evaluate approach

First of all I have to say that I gained some experience applying this approach in a customer project. We have implemented some dozen rules so far. It's a perfect approach for incremental development of a formatter. The only disadvantage I see is, that you cannot apply the formatter to ensure conformity regardless of the input. That's the expected downside. There are a lot of undefined areas where no rule exist. And the developer has in fact the freedom to choose a fitting formatting style in such cases. That's okay IMO. In fact, I think most of the developers will love it.

Preprocessing: Keep existing whitespace. ✅

That's doable. The code for that already exists (and can be reused).

Rule 1: Keywords and names are written in lowercase. ✅ ⛔️

SQLDev provides on option for that and the formatter implements this feature outside of the Arbori program. The only thing the Arbori program has to provide is the list of identifiers. We should do that.

However, the default configuration will be to keep the case of identifiers and keywords as is. Changing the case of keywords/identifiers is more than formatting. This can break the code. See #1.

Rule 2: 3 space indention. ✅

SQLDev provides an option for that. You can choose if you want to use tabs or spaces for indentation. If you use spaces you can define the number of spaces for an indentation. The Arbori program should honor this configuration. To simplify the approach I would not support tabs.

The real work in this rule is to calculate the indentation. The rule does not define what exactly needs an indentation. This has to be defined during the implementation. The grammar of every statement needs to be consulted to identify where an indentation starts and ends. Expressions with parenthesis need to be considered as well. - It's a lot of work, but doable.

I think that existing indentations should be removed to ensure that the indentations are consistent. However, this will lead to some special indentation treatment of select lists, conditions in where clauses etc. And that could make the formatter heavyweight again.

Another option is to keep the existing indentation and add missing spaces when necessary. This is much simpler. Let's start with this approach.

Rule 3: One command per line. ✅

This means that a

SQL*Plus (SQLcl) command
SQL statement
PL/SQL statement

must start on a new line.

That's easy to enforce. The indentation calculated for rule 2 is used here.

Rule 4: Keywords `loop`, `else`, `elsif`, `end if`, `when` on a new line. ✅

Similar to rule 3.

Rule 5: Commas in front of separated elements. ✅

SQLDev provides an option for that. However, this option must be applied only when elements are separated on multiple lines.

This is relatively easy to enforce. The only problem are comments between separated elements. The newlines after a multi-line comment cannot be modified (limitation of SQLDev 20.4.1 / SQLcl 21.1.0). And the newline after a single-line comment must not be removed. It's possible to identify comments. If changes are necessary the formatting result will most probably not be satisfactory.

Rule 6: Call parameters aligned, operators aligned, values aligned. ✅

I think this affects all type of parameter definitions, assignments and procedure/function calls using named arguments. So these highlighted SQLDev options are relevant and should be honored:

This is doable.

Rule 7: SQL keywords are right aligned within a SQL command.

SQLDev provides an option for that. When disabled the keywords should be left-aligned.

There will be some thoughts about which keyword to consider and whether this rule should be applied for the merge statement as well (I don't think so). Nevertheless, implementing that is doable as well.

Rule 8: Within a program unit only line comments `--` are used. ⛔️

This rule must be applied while writing code. The formatter must not change existing multiline comments to single-line comments. That's changing code and definitively more than formatting (see also rule 1).

Rule 9: Brackets are used when needed or when helpful to clarify a construct. ⛔️

This rule must be applied while writing code. Missing brackets/parenthesis will lead to compile error. Other cases cannot be handled by a formatter.

Max char line width ✅

This is a SQLDev option. The serializer component of the formatter adds a new line when exceeding the defined width. However, we should override the behavior to at least add an indentation for the subsequent lines.

That's doable, since it is possible to calculate the column of each token in an Arbori program.

Other SQLDev options

This has to be analyzed option by option. By default I would ignore them. This means the code is kept "as is".