codebuddies/backend

[Tagging] Implementation of Tag Validation

Closed this issue ยท 9 comments

Background

While implementing tagging tests, a number of bugs with the slugify() code and tagging were revealed. In particular, issue #123 and #124 have started a discussion around what we want to allow or disallow in tags, how we want validation and error handling to work with tagging, and weather or not we want to include related tags from a particular endpoint (e.g. resources, hangouts, learning paths, discussions, projects, etc.). Since much of that discussion happened inside the PR for tagging tests, we decided we needed to port it to it's own issue, as we work out the intended behavior and archetecture.

Relevant section of PR #121 around the implementation of tag validation is below:


chris48s** 7 days ago

Author Member

In these two cases, the tests pass based on the values from the fixtures but it seems like these aren't generating particularly useful slugs (esp when you consider the objective of slugging is to make the text unique and URL-safe). What should we be doing here?
If a single emoji isn't a valid tag name, shoudl we just be validating here so that trying to save a tag called "๐Ÿธ" throws ValidationError?

BethanyG 7 days ago

Member

I think that if we can't get emojis to work correctly through slugifying, then yes - we should throw a validation error. I can log an issue about that...along with our issues around Hindi and Telugu. I really don't want to have to transliterate, but we may need to go that route if we can't find an alternative to what slugify is doing to those languages.

BethanyG 7 days ago

Member

Logged issue #123 for the script fail and #124 for the emoji.

lpatmo 7 days ago

Member

(fwiw from a product perspective, I think it's totally fine for us not to support emojis in tags. Validating against it sounds reasonable!)

๐Ÿ‘ 1

BethanyG 20 hours ago

Member

So - I think adding in validators (and tossing errors) for emoji and picto-ascii would be a good thing. But we have some behaviors to think through for the api/app. I am seeing multiple scenarios - but also have questions:

  1. Single or repeated unicode or ascii emoji/symbols as the sole content of a tag : reject out of hand.
    • validator for DRF should toss an error here. HOWEVER - does that mean the entire creation of the resource fails? I don't think so, since tags are not technically part of a resource....but that then creates some issues.
    • if a validation error is thrown, do we create the resource, but hand back a message to the effect that one or more of the tags was invalid? HTML 207 with an embedded 400 and a message of "Your tag text contains one or more unsupported characters, and was not saved."? This needs some detailed scoping.
    • if a validation error is thrown, do we create the resource but omit all the tags and hand back a message?
    • if only one of multiple tags has a problem, do we drop the "bad" tag, or do we fail the whole set of tags?
  2. Single or repeated unicode or ascii emoji/symbols as part of a tag : do we drop these silently - creating the tag/slug without them, or do we throw a validation error for the entire tag? And if we do throw a validation error for the entire tag, which scenario from above do we apply when creating the resource ?
  3. When we validate, will we be validating for both the name and the slug?

chris48s 12 hours ago

Author Member

Single or repeated unicode or ascii emoji/symbols as the sole content of a tag : reject out of hand.

Done.

Single or repeated unicode or ascii emoji/symbols as part of a tag : do we drop these silently - creating the tag/slug without them, or do we throw a validation error for the entire tag?
When we validate, will we be validating for both the name and the slug?

So far, the only validation I've done applies to the slug, so basically whichever variation claims the slug first gets it i.e: "Javascript ๐Ÿ™‚" is a legal tag name, and if you create "Javascript ๐Ÿ™‚" first, that tag now owns the slug javascript.

If you want to apply the same rules to the name, I wonder if it is actually useful for the tag name and slug to be different?


The other questions on API behaviour are probably best moved to another issue to avoid trying to boil the ocean in one PR as this is already getting long. Lets just focus on the model behaviour here.

My gut instinct though is that working out these behaviours would probably be much easier if the operations of:

  • creating tags
  • creating resources
  • applying tags to resources

were 3 separate endpoints/operations. Then you can know if your tag is valid before you apply it to a resource. You can still tie those 3 things into what feels like a single 'page' on the frontend for a nice UX.

lpatmo 3 hours ago

Member

Ohh, I see -- so after a user finishes typing a tag, we can fire off a POST request to a /tags endpoint to check whether that tag is valid. Then they submit the form.

applying tags to resources

Can you say more about how this endpoint would work?


HOWEVER - does that mean the entire creation of the resource fails?

From a naive user perspective, if I was submitting to create a resource and had a tag that was invalid in some way (e.g. had an emoji character), then having the entire form submission fail on submit (e.g. no tags or resources is created) is reasonable to me. I'd expect to see an error about some invalid character in one of the tags, change it appropriately, then re-submit the form.

BethanyG (#121 (comment))

Member

Humm.. as @chris48s suggested -- we probably should take this to a separate issue. So I will make that shortly. Especially since this convo will be hidden once the PR is merged in.

I think the web UX could be fairly straightforward, and could include both a UI message (the following characters are not allowed for tags) & blocking the user from typing said emoji/ascii/character sets (a background form/field validation that dropped or highlighted any unicode or ascii characters in that range, similar to what I have seen with some set password fields). I don't think there is even a need there to fire off a POST - we just need to agree which characters/character sets are going to be off limits, and then fail them before they can get to the backend (theoretically).

But the API itself is a bit different (since we're building for more than the website, and can't give feedback to users in the same way) -- and in that case, for clarity, we probably do want to have a whole creation failure of an object with a message about why. ("your tags contain invalid/out of bounds characters. Characters in the following ranges are not allowed. ...").

OR we have a a separate tagging endpoint & action (pass in the GUID of the object to be tagged, along with a list of tags and get back a verification that tagging for that object has succeeded or failed w/ a message about why).

Now - what characters would cause that is still a bit open....

I would disallow emoji and picto ascii altogether in either tag names or slugs. Why go there? Doesn't seem to add significant value to our users at the moment.

On the other hand, having the ability to tag or note something in your native language does have value - so I still want to noodle on that problem.

lpatmo](https://github.com/lpatmo)** 22 minutes ago

Member

Sounds good! (Btw, to clarify my earlier statement, I was talking about the error message from an API call as well -- that is, even though we'd have front-end validation, I'd like to see an error message returned if I say make a POST request in Postman to create a resource with problematic tags included. Then I'd expect the entire request to fail, if that makes sense. :P)

image
--> error


Decisions We've Made So Far:

  1. The backend/api will throw a ValdiationError for any tag that contains what we consider "invalid" characters.
  2. A validation error will cause the POST request to create a new resource to fail with a return message along the lines of "One or more of your tags contain invalid/unsupported characters."

What We Still Need to Decide/Discuss

  1. What sets of characters will cause a ValidationError?
  2. Will both the name of the tag and the slug of a tag be validated?
  3. Will we continue having tags returned as part of a resources endpoint, or will we move toward having tagging as its own endpoint?
  4. If tagging becomes its own endpoint, what does that look like, and what is the expected flow/interaction?

Just picking out one point from the above ;)

applying tags to resources

Can you say more about how this endpoint would work?

My thought is that you want the ability to apply tags to a resource independently of resource creation. For example, if I find a resource about using pandas and numpy which is already tagged as "python" but not "data science", I should be able to tag it as "data science" at a later date.. but maybe that is just implemented as a PATCH request to /resource/<guid> not its own endpoint though?

Ah thanks, got it! Yeah, a PATCH request to /resource/<guid> makes sense to me as the easiest way to update it. :)

A PATCH is the way it is currently implemented....ish. A PATCH right now replaces tags, but we could modify it to do an "add onto" instead.

I don't think this is the behavior we want when a duplicate tag is submitted:

image

Instead, I think we need to think through what the process will be for checking for duplicates and returning an appropriate message or tag. Failing a resource because of a duplicate tag slug is .. not friendly, IMHO.

Hmm. Good thought. Is this new behaviour as a result of applying 558b06d or is this independent?

I think it's a result of 558b06d. There weren't any other changes made to the tagging code (just all the work you did around the tests). I'll play with it...but suspect we'll need extended view/serializer logic that does (off the top of my head):

  1. A validation for "illegal" characters - fail both the resource and tag if any tags have them.
  2. A slugification of each tag if basic validation above passes
  3. A check for those slugified tags in the DB (a basic "are there duplicates" test)
  4. If a duplicate is found .... use the duplicate as the tag (I'll need to dig in the docs for how we want to do that)
  5. Once the list is compiled, "tag" the resource (save the tag refs).

So tempted to bail on the slugs about now. Sigh.

So....went in and commented out the save() method in tagging/models.py in 558b06d:

class CustomTag(TagBase):
    guid = models.UUIDField(default=uuid.uuid1, editable=False)
    slug = models.SlugField(
        verbose_name=_("Slug"),
        unique=True,
        max_length=100,
        allow_unicode=True
    )
    name = models.CharField(verbose_name=_("Name"), unique=True, max_length=100)

    class Meta:
        db_table = "tagging_custom_tag"
        verbose_name = _("Tag")
        verbose_name_plural = _("Tags")
        app_label = 'tagging'

    # save(self, *args, **kwargs):
      #  self.full_clean()
      #  return super().save(*args, **kwargs)

    def slugify(self, tag, i=None):
        return slugify(tag, allow_unicode=True)

Which then stops the rejection based on duplicate or already existing tags...instead, the code appears to use whichever tag + slug is already in the DB for the slug in question (see tagging/serializers.py for the down and dirty) ..but we'll need to do some more poking at it to make sure that is indeed what it is doing ... and also thinking about if that is exactly what we want it to do here...or not. We also need some additional tests around overlapping & duplicate tags...I should have thought of that....but didn't! Eeek.

Aah no, so 558b06d prevents tagging a resource with the same tag twice (which is a constraint that taggit implements, but only at the manager layer, for.. reasons).

cdfe87a is the commit where we made this change.

The original behaviour was that the tag name "Javascript" generates the slug javascript and then "+java/script" also generates the slug javascript so we change it to javascript_1 and save both. We replaced that with throwing a ValidationError if a new tag generates the same slug as another tag that already exists.

Yeah - neither of those feels great as a behavior, but rejecting the saving of a new resource because someone didn't guess the magic slug for their JavaScript vs Javascript vs java/script** tag feels more wrong.

Isn't "hash collision" fun? The "long way" of fixing this would be to map names that slug to the same thing in the DB to their slug but retain the name as entered (so...bucketing based on slug), and letting the user think whatever they enter is totally fine. is it?? That's the big question here. The lazy workaround hack was to slap a number on the end of the slug whenever it collided...but that just makes the queries for related terms harder to glue together.

Have I mentioned how annoying tags are? ๐Ÿ˜‰ Joking, of course...but they are involved.

I did a little digging (not a lot), and found this (see the final comment).

I am interpreting this to mean that while taggit does indeed have most of the functionality described in my previous comment (a lookup to see if the tag name exists already, and if it does -- use the existing tag/slug pair), they didn't take it as far as maybe they should have (there aren't any up-front sanitation settings).

While lookup for deduping can be set to case insensitive (and we have set that), the first "badly capitalized" tag that calls save() wins as the standard...and that's at the root of many of our issues here. So I think we need to precede any of that taggit saving or lookup logic with lowercasing & what we term illegal character(s) checking. Otherwise, we'll never get a proper dedupe hit -- and then we'll have even more slug collisions.

This post from SO from 2011 talks about a branch that made a "force lowercase" setting, but that branch is gone.

So it looks like we'd have to write something ourselves for the lowercasing. Given how this behaves, we'd want the illegal character checks and the lower-casing to happen in the serializers.py file (i.e. on intake, before taggit proper is evoked). That means validation for illegal characters happens on the name passed in. I am also thinking that the slugify code needs to be changed to "pass through" unicode that falls into the abugida scripts range, so that slugs for those languages are not mangled.

Thoughts?


I am going to pull @lpatmo current changes into a branch and take a shot at the following:

  1. Write some code in the serializer.py file that checks for "illegal characters". Right now, I am going to define those as any picto-ascii or Unicode in the emoji range. A validation error will be thrown for the tag (and the resource) if any of the tags contain any illegal characters.

  2. Add some code in the serializer.py file that (if no illegal character validation error is thrown) forces the tags to lower case before they're checked against the DB. This of course will only apply to those languages for which lower() works. I may need more research here, because this may be locale-dependent behavior for python.

  3. Add some code in the slugify() method that bypasses whatever the heck it is doing to abugida scripts. Yes - this makes URIs dangerous. Propose we debate / deal with that later, since we do not yet have a call for or need for tag urls in the first place.

  4. Play with / test if 1 and 2 fix enough of our issue that we can leave the duplicate tags check in place for the save. But I suspect we won't need it if we get 1 & 2 in a better place.

I also propose we limit tag entry on the front end to lowercase only - but as a soft limit. I think failing an API save for case-sensitivity is harsh, and we can express the requirement in our docs or messaging without doing a hard failure (and we can force-lowercase in the processing code). I also think that if someone browses the tag list and only sees lowercased tags, the requirement is fairly clear. We can also do a "look ahead" or "lookup while typing".