[API Proposal]: System.Diagnostics.CodeAnalysis.StringSyntaxAttribute
stephentoub opened this issue Β· 70 comments
Background and motivation
VS provides nice colorization of regular expressions when passed to known methods on Regex, e.g.
It also supports putting a comment before any string to mark the language being used in that string:
But there's currently no way to define an arbitrary API and annotate a string parameter as to what language is expected, e.g. you don't get this colorization today for RegexGenerator, because VS hasn't been taught about this API specifically:
While these examples are about regex, more generally this applies to arbitrary domain-specific languages. If VS adds colorization for JSON, for example, it'd be nice if every API that accepts JSON could be annotated appropriately.
API Proposal
namespace System.Diagnostics.CodeAnalysis
{
[AttributeUsage(AttributeTargets.Parameter | AttributeTargets.Field | AttributeTargets.Property, AllowMultiple = false, Inherited = false)]
public sealed class StringLanguageAttribute : Attribute
{
public StringLanguageAttribute(string language);
public string Language { get; }
}
}
API Usage
public sealed class RegexGeneratorAttribute : Attribute
{
public RegexGeneratorAttribute([StringLanguage("regex")] string pattern);
}
Alternative Designs
No response
Risks
No response
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.
Issue Details
Background and motivation
VS provides nice colorization of regular expressions when passed to known methods on Regex, e.g.
It also supports putting a comment before any string to mark the language being used in that string:
But there's currently no way to define an arbitrary API and annotate a string parameter as to what language is expected, e.g. you don't get this colorization today for RegexGenerator, because VS hasn't been taught about this API specifically:
While these examples are about regex, more generally this applies to arbitrary domain-specific languages. If VS adds colorization for JSON, for example, it'd be nice if every API that accepts JSON could be annotated appropriately.
API Proposal
namespace System.Diagnostics.CodeAnalysis
{
[AttributeUsage(AttributeTargets.Parameter)]
public sealed class StringLanguageAttribute : Attribute
{
public StringLanguageAttribute(string language);
public string Language { get; }
}
}
API Usage
public sealed class RegexGeneratorAttribute : Attribute
{
public RegexGeneratorAttribute([StringLanguage("regex")] string pattern);
}
Alternative Designs
No response
Risks
No response
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: |
|
Milestone: | 7.0.0 |
cc: @CyrusNajmabadi, @jasonmalinowski, @terrajobst
Related to dotnet/roslyn#34640
cc: @joperezr
FWIW, i would love this. Note: we're also leaving design space open with Raw String Literals to allow for a direct way in the language to do it. Specifically:
var x = """regex
\G(\w+\s?\w*),?
""";
Embedded languages being a first class concept would be great. We would also like to do dotnet/roslyn#42634 so that 3rd parties could then implement the LS support for a particular embedded language (instead of Roslyn being responsible for it).
AttributeTargets.Parameter
Though I can't think of any concrete use cases off-hand, my gut reaction is that Field
and Property
seem like reasonable additions since this is meant to be more general than just Regex
.
Though I can't think of any concrete use cases off-hand, my gut reaction is that Field and Property seem like reasonable additions since this is meant to be more general than just Regex.
Yeah, I had that initially, but it seemed like there would be various places where it'd be challenging to reflect that in a tool, e.g. passing such a field by ref would lose that information, and potentially ambiguities as to whether it should be applied, e.g. on a property, is that meant to impact assigning to the property and/or the code in the property returning a string. Combined with my not having scenarios in mind for it, I left it off. But, I'd leave this up to @CyrusNajmabadi and @jasonmalinowski and friends, if they expect it would be useful and feasible initially. It could always be added later.
so that 3rd parties could then implement the LS support for a particular embedded language
Cool. We have APIs in runtime that accept regexes, XML, and JSON; I expect we'd want to annotate many of those with this, e.g.
We'll just need to standardize on a set of names for at least the languages we use, although maybe we would just snap to something like what GitHub uses for markdown language names. Thankfully, in general, the naming choice is obvious.
potentially ambiguities as to whether it should be applied, e.g. on a property, is that meant to impact assigning to the property and/or the code in the property returning a string. Combined with my not having scenarios in mind for it, I left it off.
Allowing it to be set on a property/field so the IDE can directly use that for the assignment/initializer. I'm guessing there's an API out there somewhere like:
rules.Add(new Rule { RegEx = "..." });
And the IDE could process that too. I don't see a huge risk in allowing the attribute against field/property even if we don't immediately support it, in the IDE; there's absolutely ecosystem risk of not allowing it against field/property until "later" and the forcing libraries to have to wait until later .NET versions before they can target that, or they have to start #ifdefing more stuff to deal with downlevel.
Should we have a place where we document known languages as const fields?
Should we have a place where we document known languages as const fields?
You mean like adding:
public sealed class StringLanguageAttribute
{
...
public const string Regex = "regex";
public const string Xml = "xml";
public const string Json = "json";
}
?
Yes. Not sure I'd want them on this particular class (because the class name is quite lengthy and AFAIK the field wouldn't be in scope when applying the attribute, so users would have to qualify the field references).
Yes. Not sure I'd want them on this particular class (because the class name is quite lengthy and AFAIK the field wouldn't be in scope when applying the attribute, so users would have to qualify the field references).
How about System.Diagnostics.CodeAnalysis.KnownStringLanguages
? That way it's closely related to StringLanguageAttribute, but still separate.
Not sure I'd want them on this particular class (because the class name is quite lengthy
I don't have a strong opinion here. But for the number of places this will be used, I personally would be fine with e.g. [StringLanguage(StringLanguageAttribute.Json)]
. If we want to spend another type on it and can make it much shorter and still discoverable, that's fine, too.
I agree with @jasonmalinowski here. Let the attribute appear in these places. Ide can then add support as necessary.
- We think we should rename "language" to "syntax" to avoid any connotations to localization
- We should only expose the language constants as we're adding support for them
namespace System.Diagnostics.CodeAnalysis
{
[AttributeUsage(AttributeTargets.Parameter |
AttributeTargets.Field |
AttributeTargets.Property,
AllowMultiple = false,
Inherited = false)]
public sealed class StringSyntaxAttribute : Attribute
{
public StringSyntaxAttribute(string syntax);
public string Syntax { get; }
public const string Regex = "regex";
// As we're adding more support we can add new languages like:
// public const string Xml = "xml";
// public const string Json = "json";
}
}
@stephentoub Would like to get involved with this. Feel free to assign it to me.
In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language.
In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language
How do you define "unsupported"? VS isn't the only possible consumer, nor are we the arbiters for all possible languages. Someone should be able to write a library that accepts strings for any language they like and author tooling components to recognize those names.
In my opinion, we should add a code analyzer for it, which gives a warning if you pass an unsupported language
How do you define "unsupported"? VS isn't the only possible consumer, nor are we the arbiters for all possible languages. Someone should be able to write a library that accepts strings for any language they like and author tooling components to recognize those names.
You are right :). I assumed that only mentioned languages should be supported.
@CyrusNajmabadi, @jasonmalinowski, are you happy with the approved api? Any roadblocks to roslyn moving forward with that name/shape?
Looks generally fine to me; I can't say I'm excited for putting the strings in the attribute type due to the repetition but it seems like that's the least bad option. "Language" seems easy for me to understand, but hey, I'm biased and I work on the language teams. π
Something I didn't think about earlier: what does this look like if you also want to carry along language-specific info? For example, would JSON or XML want to have some way the schema could be specified, so we could also squiggle errors in usage? Or Regexes too: there's diferent Regex options that change what various tokens mean -- if we had a way to know the generator example that @stephentoub started with was going to pass a string with RegexOptions.None, then we could tweak the messages accordingly. Those aren't huge dealbreakers, but something to think about is how this might grow if we want to do those things -- is the expectation that the domain-specific info is carried in a separate attribute (maybe one that derives from this attribute), or do we do something else?
(if the assumption is that are doing inheritence, then it means we might just have [RegexSyntax] as an attribute which also is conveniently shorter than [StringLanguage(StringLanguageAttribute.Regex)])
@CyrusNajmabadi, @jasonmalinowski, are you happy with the approved api? Any roadblocks to roslyn moving forward with that name/shape?
Jason raises a good point about inheritance.
I will also mention that this may be non-trivial for us to support any attribute story. Today we operate by looking very specifically for certain APIs with certain names. e.g. new Regex("...")
. This allows us to shortcircuit from doing analysis on most strings.
We would now have to always do teh work with any string-literal parameter to check the call and see if there is such an attribute on the parameter.
I don't anticipate this being a problem. But i also can't promise that it's certain to be ok. Before you guys ship this, would it make sense for roslyn to at least prototype our side of supporting such an API to make sure its feasible?
Before you guys ship this, would it make sense for roslyn to at least prototype our side of supporting such an API to make sure its feasible?
Yes, please, that's why I asked :-)
then it means we might just have [RegexSyntax] as an attribute which also is conveniently shorter
How would that work when Roslyn can't necessarily see the argument passed to the base ctor?
I'm not personally concerned about verbosity with usage of the attribute.
How would that work when Roslyn can't necessarily see the argument passed to the base ctor?
Ah, you're right, that won't.
Yes, please, that's why I asked :-)
:)
@jinujoseph for visibility. A strong ask from Runtime team to have a new attribute they define that can help us light up embedded language scenarios (like telling us that an API takes a regex so that we turn on our special regex features for that API). This takes us away from teh current situation where Roslyn hardcodes in all the APIs of the interest. But it means roslyn will have to do more work dynamically to figure out if we shoudl be running.
I can spend a couple of days next week prototyping our side of it so that i can benchmark and see if there are any concerns here.
@sharwell as well for thoughts.
Something I didn't think about earlier: what does this look like if you also want to carry along language-specific info? For example, would JSON or XML want to have some way the schema could be specified, so we could also squiggle errors in usage? Or Regexes too: there's diferent Regex options that change what various tokens mean -
This is a good point. For example, RegexOptions influence the way a regex is parsed. If we wanted an analyzer that flagged arguments to attributed parameters as being invalid and that same method also took an options argument, the analyzer should really know about those options. Or if the api author hardcoded the options used, the analyzer would ideally be able to pick those up from the attribute. This may not impact VS colorization today, but it certainly could in the future. And definitely other tooling. So, yeah, it needs more thought. Maybe the attribute should just have an additional Arguments property that's interpreted according to rules defined per syntax, e.g. for regex it could either be another parameter name or a RegexOptions value.
This is a good point. For example, RegexOptions influence the way a regex is parsed.
Yup. And it's worth noting that Roslyn respects this. So if you do new Regex("...", RegexOptions.IgnorePatternWhitespace)
then we understand and do the right thing.
Also, for Json, my prototype understood if you're using Newtonsoft and/or using 'strict mode' (which changes a lot of what is allowed).
for regex it could either be another parameter name or a RegexOptions value.
Yup. Those seem to eb the two main scenarios we'd care about. Either the API will hardcode different options itself (and the attribute can let us know), or the user is supposed to pass them along, and we need to know which parameter matters.
We could be a little more loosygoosy and for the latter case just look at the rest of hte arguments to see if any options are passed along, presuming that they would be for the string. It might not be the right thing in all cases, but it would likely be the right thing in most cases.
We could be a little more loosygoosy and for the latter case just look at the rest of hte arguments to see if any options are passed along
That's fine if the options are strongly typed -- do we have any interesting examples where it might just be a string? My schema examples for XML/JSON it'd be hard to know which might be the extra parameter, but I admit those examples are a bit more contrived than I'd like.
Note: another case where Roslyn IDE lights up is in DateTime strings. There we show things like completion for format strings like so:
So it would be good to have a lang name for this @stephentoub
Note: another case where Roslyn IDE lights up is in DateTime strings. There we show things like completion for format strings like so:
So it would be good to have a lang name for this @stephentoub
To extend on this, wouldn't it be possible to include this on the "well known" ToString formats? e.g. int.ToString("E3")
- Maybe the "lang name" should be something like ToString:Type
?
How would that work when Roslyn can't necessarily see the argument passed to the base ctor?
Ah, you're right, that won't.
We could make it convention-based such that the syntax name is inferred from the type name. For instance, XxxSyntaxAttribute
would be considered Xxx
if XxxSyntaxAttribute
extends StringSyntaxAttribute
. This would also get rid of the list of constants as this would be represented in the docs as "types deriving from StringSyntaxAttribute
From a readability perspective, I'd prefer that but I'm not feeling super strongly about that.
How about DateTimeFormat
?
That was my suggestion for the name of the language
MOre info. When it comes to 'regex options' we always did it as a heuristic anyways. Given an argument we detected to be a regex-string, we just look at sibling arguments to see if any have a constant type whose symbol type is RegexOptions. If so, we presume that that must be controlling the regex-arg-string.
So i'd be fine continuing that strategy. We would need some way to still say what the options are if we want to make it so that no extra arg is passed along and instead the attribute just declares this. in that case, we likely would want to have the attribute-subclass approach where the user would then do:
RegexLanguage(RegexOptions.Whatever | ...)
We'd use immo's approach of matcing hte XxxLanguage name. And we'd then use the rest of the args to determine things like options.
DateTimeFormat
That was my suggestion for the name of the language
Ah. yes, i think that's right. We don't want to call it DateTime as that's ambiguous (is this for parsing string literals, or formatting them?). So DateTimeFormat makes sense to me.
We would need some way to still say what the options are if we want to make it so that no extra arg is passed along and instead the attribute just declares this. in that case, we likely would want to have the attribute-subclass approach where the user would then do
Should we have another ctor that just takes a params object[]?
public StringSyntaxAttribute(string syntax, params object[] arguments);
public object[] Arguments { get; }
That way we wouldn't require subclassing, though we could still make subclassing work if desired.
Should we have another ctor that just takes a params object[]?
public StringSyntaxAttribute(string syntax, params object[] arguments); public object[] Arguments;
A theoretical approach of mine would also be to define the type of arguments using the new attribute generics (C# 10 Feature). Then you have a type safety and don't have to cast the objects.
public StringSyntaxAttribute(string syntax, params T[] arguments);
public T[] Arguments;
This, of course, includes a class without a generic type parameter.
public StringSyntaxAttribute(string syntax);
Then you have a type safety and don't have to cast the objects.
That would require the type to be generic, which would then require you to specify the type even when not providing any arguments at all, and it would necessitate every argument is of the same type. I don't see the benefit.
Then you have a type safety and don't have to cast the objects.
That would require the type to be generic, which would then require you to specify the type even when not providing any arguments at all, and it would necessitate every argument is of the same type. I don't see the benefit.
True. Then your suggestion is probably the best choice.
I'll try that out Stephen. As long as we can grab that value out from teh roslyn API (which i think is doable) that should work.
I'll try that out Stephen. As long as we can grab that value out from teh roslyn API (which i think is doable) that should work.
Thanks. Then we would just need to decide, on a syntax-by-syntax basis, what the arguments mean. For regex, we've discussed two possibilities: argument name specifying the RegexOptions, and the hardcoded RegexOptions. With a params object[]
, we could say that if it's a string it's interpreted as an argument name and if it's a RegexOptions it's obvious. If no arguments are specified, Roslyn could continue to employ its heuristic.
Thanks. Then we would just need to decide, on a syntax-by-syntax basis, what the arguments mean. For regex, we've discussed two possibilities: argument name specifying the RegexOptions, and the hardcoded RegexOptions. With a params object[], we could say that if it's a string it's interpreted as an argument name and if it's a RegexOptions it's obvious. If no arguments are specified, Roslyn could continue to employ its heuristic.
I was going to go simpler. If it's a RegexOptions in the attribute, we use it. Otherwise we fallback to the sibling arguments. It might be possible that this might not work in some cases... but those cases seem soooo niche. i feel like the above approach would be a 99.99% solution. Do you envision cases where that would not suffice?
Do you envision cases where that would not suffice?
No, I just like being explicit rather than relying on heurstics, especially when those heuristics might need to be replicated in multiple tools, e.g. Roslyn for colorization, an analyzer that's validating syntax as part of the build, etc.
No, I just like being explicit rather than relying on heurstics,
That's fine with me. Supporting both a string-parameter-name, or RegexOptions being stored in the Arguments would be trivial i believe. Let me validate that. But given likely very little additional effort, and your own preferences here, i think this approach is fine :)
@bartonjs, opinions on the additional members?
#62505 (comment)
No major opinions. Public arrays are always a fun "is it shared, or is it yours?", but it's an attribute, so it's probably not going to be passed around.
The lack of static typing will mean that anything touching an arguments-based syntax highlighter needs to be prepared for garbage... but they're generally dealing with probably-not-compilable code as input, so hopefully they're nice and defensive. (Documentation will, of course, be a pain)
Will/can we add a constant for sql
? I realize that there are multiple SQL dialects out there, but there is enough commonality to make IDE syntax highlighting helpful. This would be great on DbCommand.CommandText
as well as for libraries like Dapper.
@stephentoub Thanks for the demo about this today. You said it is available in VS17.2, and I'm trying with preview 1, and not seeing it working. Is it supposed to work and I'm doing something wrong, or not in the first preview? (I included the attribute code as you mentioned would still work so I get it in versions prior to .NET7)
D'oh! I just made the argument for why the "Json" text should be a constant, because I spelled it all lower-case. Works once properly cased
Will/can we add a constant for sql? I realize that there are multiple SQL dialects out there, but there is enough commonality to make IDE syntax highlighting helpful. This would be great on DbCommand.CommandText as well as for libraries like Dapper.
We have no plan on this currently. Primarily because there's no real way to estimate the cost of this work currently. For one thing, we have no tech available to us to even suitable lex/parse even one dialect of sql, let alone the myriad dialects actually used in practice. We'd need a strong proposal on how to actually achieve this.
Json has hte benefit of being a staggeringly simple language to add support for. And for regex we have hte canonical impl in teh runtime that we were able to ape in order to add this support.
we have no tech available to us to even suitable lex/parse even one dialect of sql, let alone the myriad dialects actually used in practice
@CyrusNajmabadi is it a requirement that there be full lexing/parsing capabilities just to annotate this? Even very trivial support in an IDE (think highlighting common keywords like SELECT, WHERE, and JOIN plus operators) would make blocks of SQL text a lot easier on the eyes.
@stephentoub @bartonjs did we add any other constants for well-known languages? The approved shape only had one for Regex.
Here is my understanding of the final shape:
namespace System.Diagnostics.CodeAnalysis
{
[AttributeUsage(AttributeTargets.Parameter |
AttributeTargets.Field |
AttributeTargets.Property,
AllowMultiple = false,
Inherited = false)]
public sealed class StringSyntaxAttribute : Attribute
{
public StringSyntaxAttribute(string syntax);
public StringSyntaxAttribute(string syntax, params object[] arguments);
public string Syntax { get; }
public object[] Arguments { get; }
public const string Regex = "regex";
// As we're adding more support we can add new languages like:
// public const string Xml = "xml";
// public const string Json = "json";
}
}
The string is Json
(Upper case J). Lower case like above doesnβt work (at least in preview 1) and threw me off for a while.
@madelson I'm not opposed to adding annotations for languages that Roslyn has no plans to support; this would allow other tools such as Rider to add support. However, I think we'll want at least one party that does something useful with it to ensure the annotations are sensible. For example, if you only ever want syntax highlighting, then sure, "SQL"
might be good enough with a large dictionary of the various keywords used in SQL. However, if an IDE wants to do something more, such as code completion, then the string "SQL"
might no longer be good enough. In the context of the BCL, "regex"
basically means System.Text.RegularExpression
, so that's in practice sufficiently unique. However, in .NET today there are various SQL dialects in use (MS SQL Server, Oracle, Postgress, etc). I can be convinced that this isn't a problem, but I'd like to hear this from an implementing party, not from a consumer because we'd likely make faulty assumptions.
did we add any other constants for well-known languages? The approved shape only had one for Regex.
We added Json, which is in the shown final shape as being fine to add when there's support, and we added DateTimeFormat, as you proposed and was discussed here #62505 (comment).
@stephentoub in the Community Standup you mentioned keeping StringSyntaxAttribute
internal
if we implement it for codebases not targeting .NET 7.
For a multi-targeting public API (think nuget package) the best workaround I can think of without risking assembly conflicts is
public class Class2
{
public void MyMethod([StringSyntax("Regex")] string pattern) { }
}
#if !NET7_0_OR_GREATER
internal sealed class StringSyntaxAttribute : Attribute { ... }
#endif
Is there a way to expose StringSyntaxAttribute
to older TFMs without assembly conflicts or is this a technical limitation?
For a multi-targeting public API (think nuget package) the best workaround I can think of without risking assembly conflicts is
What's wrong with that approach? I don't think that's so much a workaround as it is the recommended course of action if you need to use the attribute downlevel.
Making it internal works beautifully for me in my multi targeted project
Sorry, thought I had tested this and thought the attribute had to be public to take effect.
It works brilliantly - this is awesome!
sorry for dumb question: I ask about SQL. If I know which dialect we are using in our project, is there any way to colorize SQL keywords? I mean parse SQL not by VS itself, but I already have a piece of code that can grab SQL from the code (via Roslyn) + can parse it to the tokens, so I have an information about tokens and theoretically can provide this info to VS visualization layer somehow. is it feasible now?
@terrajobst I hear what you're saying about there being different dialects, but in many if not most cases SQL strings are being passed to a more generic SQL library (DbCommand.CommandText
, EF, Dapper, etc) so even if the IDE had the ability to handle specific flavors of SQL in most cases it would probably have to fall back to lowest common denominator anyway.
Waiting to get buy-in from an implementer makes sense; just offering that as a consumer this would be valuable; definitely more so for my use-cases than JSON or date formats. I also suppose that there doesn't need to be an official constant here so long as the library authors and IDEs agree on a value to use for this.
For SQL, see this comment: #65634 (comment)
[...] even if the IDE had the ability to handle specific flavors of SQL in most cases it would probably have to fall back to lowest common denominator anyway.
I'm not sure what "lowest common denominator" means in SQL. Sure, there's a certain subset of statements which work everywhere, but it's really quite restricted. The moment you e.g. want to get only X rows, SQL Server has TOP, PG/SQLite have LIMIT/OFFSET, etc. IMHO for anything to be useful around SQL, there needs to be some setting somewhere that says what the dialect is, which the analyzer or IDE would pick up.
We are making this extensible at the Roslyn layer. But it will be up to a different party entirely to provide any level of SQL support.
I'm not sure what "lowest common denominator" means in SQL. Sure, there's a certain subset of statements which work everywhere, but it's really quite restricted.
I imagine that if VS/VSCode/Rider were to implement this then it would start with just the base keywords (SELECT, JOIN, WHERE, etc) and maybe some operators. I've worked with SQL in editors with this behavior and a little syntax highlighting goes a long way towards making the code easier to read.
IDEs can of course go beyond this by offering configurable dialect highlighting like JetBrains does.
I'm not sure what "lowest common denominator" means in SQL. Sure, there's a certain subset of statements which work everywhere, but it's really quite restricted.
I imagine that if VS/VSCode/Rider were to implement this then it would start with just the base keywords (SELECT, JOIN, WHERE, etc) and maybe some operators. I've worked with SQL in editors with this behavior and a little syntax highlighting goes a long way towards making the code easier to read.
IDEs can of course go beyond this by offering configurable dialect highlighting like JetBrains does.
We could indeed support only ANSI/ISO SQL (ISO/IEC 9075-1) by means of the identifier sql
and then build on it in the long run with sql+mysql
, sql+mssql
/sql+tsql
(maybe with version identifier? - sql+mysql8.0
)...
That would be my approach now, anyway. Feel free to leave feedback.
build on it in the long run with sql+mysql, sql+mssql/sql+tsql...
As mentioned previously, likely most applications of this would be SQL dialect agnostic members like DbCommand.CommandText
, parameters to various methods in Dapper, EF, etc. The value is in the enhanced readability of SQL blocks in an IDE as opposed to perfect parsing. Many text editors/IDEs/wikis/etc offer this sort of agnostic SQL syntax highlighting indicating that it is indeed useful despite the differences in SQL dialects.