elastic/elasticsearch

Input validation

Closed this issue · 11 comments

In #6736 I started trying to define specs for valid IDs, index names, field names etc, to avoid problems such as conflicts created by having an ID called _mapping.

I think this is the wrong approach - a significant number of users will find that they have used identifiers which are no longer illegal. Instead, we should identify where conflicts are possible and implement escaping to allow the unambiguous avoidance of conflict.

Possible conflicts can arise in:

  • directory or file names (eg index "C:/")
  • URL paths (eg an ID called _mapping or _create)
  • query string (eg a routing value containing a comma)
  • field paths (eg a field containing a .)
  • scripting (eg a fieldname called _id)

We should provide a simple consistent escaping mechanism with minimal restrictions on allowable characters.

Directory or file names

Today we use the actual index name to create a directory on the filesystem. Instead, the original index name should be stored in the index metadata, and the directory name should be escaped as printable ASCII only (eg percent encoding?).

There are more restrictions that need to be applied to filesystem names, most of which are already enforced today:

  • must be lowercase
  • must NOT start with an underscore, a + or a -
  • index names starting with '.' are reserved
  • . and .. are forbidden, though they can occur in the name
  • \, /, *, ?, ", <, >, |, ,, are forbidden
  • illegal windows filenames are forbidden: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Also avoid these names followed immediately by an extension; for example, NUL.txt is not recommended.

The same should apply to snapshot names.

Filename length should be checked after escaping, not before.

URL paths and query strings

Unicode characters and / are already handled by standard URI encoding. The only special characters are:

  • a leading underscore _
  • a comma ,

These can be escaped automatically by prepending \ (so we would need to escape literal \'s as well).

The * is used as a wildcard for various URL parameters, eg index and type names. We cannot automatically escape the wildcard (because usually it should be a plain wildcard). Similarly, the user can't escape it as my_literal_\*, because the \ would itself be escaped as: my_literal_\\*, meaning that the wildcard would still function as a wildcard.

The * should not be allowed in any identifiers which support wildcards (eg index, type, node names etc).

Index wildcards can use a leading + or - to include/exclude patterns, but these characters should already be excluded by the index naming rules above.

Field paths and Scripting

A field containing a . can be escaped as foo\.bar (and a field containing \ as foo\\bar).

Fields beginning with _ can clash with metadata fields like _id or _timestamp, both in queries, aggregations, sorting, and scripting.

There are four options:

  • Forbid fieldnames beginning with an underscore - this is too stringent and will impact existing data
  • Reserve existing metadata field names - this makes it difficult to add more
  • Allow \ escaping to distinguish the metadata _id field from the body \_id field
  • Add a single namespace for metadata fields, eg _root._id or _doc._id (_meta is already taken unfortunately)

Also see #5558

Forbid fieldnames beginning with an underscore - this is too stringent and will impact existing data

Why is this too stringent? I can't imagine there are a ton of people that have fields beginning with underscores, but even for those that it does, it doesn't seem like this solves the issue. How would these escaped field names show up in results: as unescaped or escaped? If unescaped, then you have the same problem, just on output instead of input. If escaped, then the user still has to change their code to handle the escaped values, so why is it any different than changing to a name not beginning with an underscore?

Forbid fieldnames beginning with an underscore - this is too stringent and will impact existing data

Why is this too stringent?

We're trying to avoid breaking existing applications, eg see #6736 (comment)

I can't imagine there are a ton of people that have fields beginning with underscores, but even for those that it does, it doesn't seem like this solves the issue. How would these escaped field names show up in results: as unescaped or escaped? If unescaped, then you have the same problem, just on output instead of input.

On output, metadata fields are displayed at the same level as the _source field, which is why there is no conflict. It is only when we refer to fields by name (eg in queries, aggs, sorting, scripts) that there is potential for conflict.

That said, I still don't know which of the four options I mentioned is the best solution.

As described in #8042 we should also consider aggregation names too

As per my original ticket: #6589

Creating index names that are physically unwritable on the filesystem could be tested first with a simple write test to see if they're writeable, and if not, reject the index creation. This would save the awful scenario where an index creation is attempted by elasticsearch but is unable to allocate primary shards due to the base inability to even create the directory name.

So a simple test of file path validity first to avoid: "java.io.IOException: Invalid file path" and reject if it fails the test.

The result when this is not handled can be a cluster in red state until the user/admin can decipher what special encoding to use (in the example of an index name with bad chars like control codes) in order to delete the index and restore cluster to green state.

An example of an index with bad chars as described is:

whatever\u0000\u0001whatever

which shows up this way under /_aliases?pretty and is only removed using a curl -XDELETE against:

localhost:9200/whatever%00%01whatever

These characters are control codes that need to be properly escaped to be dealt with.

http://www.fileformat.info/info/unicode/char/0001/index.htm
http://www.fileformat.info/info/unicode/char/0000/index.htm

So my general point being, some really low level tests should be added first that validates if a given string is even usable all the way through and it can fail with a simple test to see if it would succeed or not before continuing on to the next step. Then the reason for it failing matters less as it just won't do it if it's not possible on whatever platform/combination. You can lay these sorts of tests down first before trying to guess all the scenarios where it might fail in order to manually validate

@clintongormley I think we should consider doing strict validation as a start, and adding support for crazy names as a follow up. I think it is more important to get validation into 2.0 than work out how to support crazy names.

@clintongormley I think we should consider doing strict validation as a start, and adding support for crazy names as a follow up. I think it is more important to get validation into 2.0 than work out how to support crazy names.

+1 for simple, strict rules like were proposed in the original #6736 ... this would have prevented the horrible issues like #13858.

+1 for simple strict rules

@clintongormley is there anything left to do from the original issue? Can we close it so long as we promise to live with its spirit in our hearts?

@nik9000 this is so out of date that I don't think the thread is really relevant anymore. It is probably still worth writing up rules for what is and is not allowed - these don't exist today, except in code. And there are still places that you can name yourself into trouble, although users seldom get into trouble with these anymore. Let's close.

although users seldom get into trouble with these anymore. Let's close.

It seems to be still a problem, it still confuses users and I cannot find where it is documented at the moment https://discuss.elastic.co/t/legal-character-set-for-field-names/190796

Should this be filed as a docs issue now then, it seems like quite a large scope though.
Our current tests would be trying invalid characters and words in various places to ensure we don't regress, maybe we can more easily sift the collection of what's invalid from our tests?
This is a scenario where use of something like Swagger would allow much easier collection of such definitive lists for (automatic) documentation?