/UrlNormalization

URL normalizer to canonicalize (standardize) the text representation of a URL to determine if differently-formatted URLs are identical

Primary LanguageC#Apache License 2.0Apache-2.0

Code Coverage Nuget

Toimik.UrlNormalization

.NET 8 C# URL normalizer.

Features

URL normalization, also known as URL canonicalization, is the process of normalizing (standardizing) the text representation of a URL to determine if differently-formatted URLs are identical.

All URLs

  • Duplicate slashes are removed
    file://example.com/foo//bar.htmlfile://example.com/foo/bar.html

  • Default port is removed
    ftp://example.com:21/ftp://example.com/

  • Dot-segments are removed
    file://example.com/foo/./bar/baz/../quxfile://example.com/foo/bar/qux

  • Empty path is converted to "/"
    ftp://example.comftp://example.com/

  • Percent-encoded triplets are uppercased
    ftp://example.com/foo%2aftp://example.com/foo%2A

  • Percent-encoded triplets of unreserved characters are decoded
    ftp://example.com/%7Efooftp://example.com/~foo

  • Scheme and host are lowercased
    FTP://User@Example.COM/Fooftp://User@example.com/Foo

HTTP-specific URLs

  • Directory index can be removed (optional, via removableDirectoryIndexNames)
    http://example.com/default.asphttp://example.com/
    http://example.com/a/index.htmlhttp://example.com/a/

  • Fragment can be removed (optional, via isFragmentIgnored)
    http://example.com/bar.html#section1http://example.com/bar.html

  • Scheme can be changed (optional, via PreferredScheme)
    https://example.com/http://example.com/

  • Query parameters are sorted
    http://example.com/display?lang=en&article=fredhttp://example.com/display?article=fred&lang=en

  • User-info can be removed (optional, via isUserInfoIgnored)
    http://user:password@example.comhttp://example.com/

  • Empty query is removed
    http://example.com/display?http://example.com/display

Quick Start

Installation

Package Manager

PM> Install-Package Toimik.UrlNormalization

.NET CLI

> dotnet add package Toimik.UrlNormalization

Usage

UrlNormalizer.cs

// Use default arguments
// var normalizer = new UrlNormalizer();

// Use custom arguments
var normalizer = new UrlNormalizer(isAdjacentSlashesCollapsed: false);

var url = ...
var normalizedlUrl = normalizer.Normalize(url);

HttpUrlNormalizer.cs

// Use default arguments
// var normalizer = new HttpUrlNormalizer();

// Use custom arguments
var normalizer = new HttpUrlNormalizer(
    preferredScheme: "https",
    isUserInfoIgnored: false,
    removableDirectoryIndexNames: new HashSet<string>(0), // override the default
    isFragmentIgnored: false);

var url = ...
var normalizedlUrl = normalizer.Normalize(url);