Add in normalization for Twitter hashtags

Question

Add in normalization for Twitter hashtags

Closed this issue 5 years ago · 3 comments

This is because Danbooru now normalizes hashtags due to danbooru/danbooru#4243.

I rewrote all of the regexes from that issue so that they can be used with Javascript, plus I wrote them to use a capture group instead of substitution to get the normalized tag.

const COMMON_TAG_REGEXES = [
    /(.+?)生誕祭(?:\d*)?$/,
    /(.+?)版もうひとつの深夜の真剣お絵描き60分一本勝負(?:_\d+)?$/,
    /(.+?)版深夜の真剣お絵描き60分一本勝負(?:_\d+)?$/,
    /(.+?)深夜の真剣お絵描き60分一本勝負(?:_\d+)?$/,
    /(.+?)版深夜のお絵描き60分一本勝負(?:_\d+)?$/,
    /(.+?)版真剣お絵描き60分一本勝(?:_\d+)?$/,
    /(.+?)版お絵描き60分一本勝負(?:_\d+)?$/
];

So if there is a regex match, then use the 1st regex match group, otherwise pass the tag as is. The following is an example of how to do this.

function normalizeHashtag(hashtag) {
    for (let i = 0; i < COMMON_TAG_REGEXES.length; i++) {
        let match = hashtag.match(COMMON_TAG_REGEXES[i]);
        if (match) {
            return match[1];
        }
    }
    return hashtag;
}

Of course, you can also remove the first capture group and try to do a substitution instead like on Danbooru, however that has issues without the negative lookbehind (which isn't available on Firefox), namely that you have to check for exact term matches so that there isn't a bad hit since some tags (like Touhou) use some of those terms in their exact form.

Answer 1 · 2020-01-10T07:59:31.000Z

I didn't understand about bad hits.
And it looks like substitution has better performance.

What about the birthday tag that uses 誕生祭 at the end instead of 生誕祭.?

And it is better to do it for all sites or only for Twitter?

Answer 2 · 2020-01-10T15:23:17.000Z

Yeah, I forgot to include that birthday tag in the pull request I did. I made another pull request just now (danbooru/danbooru#4255).

Anyways, what I meant by bad hits is that you may get a regex substitution, but you'll end up with an empty string at the end.

For example, the following hashtag is used for Touhou (wiki link):

深夜の真剣お絵描き60分一本勝負

That also happens to be one of the common regexes above. So if the script sees that tag and does a substitution, you'll end up with an empty string and the original hashtag won't be checked, meaning that it won't discover the Touhou tag.

I suppose one thing that could be done would be to check if the result of the substitution is an empty string, and if so, return the original tag.

const COMMON_TAG_REGEXES = [
    /生誕祭(?:\d*)?$/,
    /版もうひとつの深夜の真剣お絵描き60分一本勝負(?:_\d+)?$/,
    /版深夜の真剣お絵描き60分一本勝負(?:_\d+)?$/,
    /深夜の真剣お絵描き60分一本勝負(?:_\d+)?$/,
    /版深夜のお絵描き60分一本勝負(?:_\d+)?$/,
    /版真剣お絵描き60分一本勝(?:_\d+)?$/,
    /版お絵描き60分一本勝負(?:_\d+)?$/
];

function normalizeHashtag(hashtag) {
    for (let i = 0; i < COMMON_TAG_REGEXES.length; i++) {
        let normalized_hashtag= hashtag.replace(COMMON_TAG_REGEXES[i], "");
        if (normalized_hashtag !== hashtag) {
            if (normalized_hashtag !== "") {
                return normalized_hashtag;
            }
          break;
        }
    }
    return hashtag;
}

Answer 3 · 2020-01-10T15:41:10.000Z

And it is better to do it for all sites or only for Twitter?

I haven't seen those tags used on other sites. They usually have their own tag patterns though, for instance Pixiv has /\d+users入り\z/i .