qsv: A Rust repository from tmtmtmtm

qsv: Ultra-fast CSV data-wrangling toolkit

	Table of Contents
_{^{logo details}}	qsv is a command line program for indexing, slicing, analyzing, filtering, enriching, validating & joining CSV files. Commands are simple, fast & composable. * Available Commands * Installation Options * Whirlwind Tour * Cookbook * FAQ * Changelog * Performance Tuning * Benchmarks * Environment Variables * Feature Flags * Testing * NYC School of Data 2022 slides * Sponsor

ℹ️ NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's May 2018 release. On top of xsv's 20 commands, it adds numerous new features; 31 additional commands; 6 apply subcommands & 35 operations; and 5 to subcommands (for a total of 97). See FAQ for more details.

Available commands

Command	Description
apply ❇️🚀🧠	Apply series of string, date, math, currency & geocoding transformations to a CSV column. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex & language detection).
applydp 🚀	applydp is a slimmed-down version of `apply` with only Datapusher+ relevant subcommands/operations (`qsvdp` binary variant only).
behead	Drop headers from a CSV.
cat	Concatenate CSV files by row or by column.
count 📇	Count the rows in a CSV file. (Instantaneous with an index.)
dedup 🗜️🚀	Remove duplicate rows (See also `extdedup`, `extsort`, `sort` & `sortcheck` commands).
diff 🚀	Find the difference between two CSVs with ludicrous speed! e.g. compare two CSVs with 1M rows x 9 columns in under 600ms!
enum	Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value.
excel	Exports a specified Excel/ODS sheet to a CSV file.
exclude 📇	Removes a set of CSV data from another set based on the specified columns.
explode	Explode rows into multiple ones by splitting a column value based on the given separator.
extdedup	Remove duplicate rows from an arbitrarily large CSV/text file using a memory-mapped, on-disk hash table. Unlike the `dedup` command, this command does not load the entire file into memory nor does it sort the deduped file.
extsort 🚀	Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm.
fetch ❇️🧠	Fetches data from web services for every row using HTTP Get. Comes with HTTP/2 adaptive flow control, jql JSON query language support, dynamic throttling (RateLimit) & caching with optional Redis support for persistent caching.
fetchpost ❇️🧠	Similar to `fetch`, but uses HTTP Post. (HTTP GET vs POST methods)
fill	Fill empty values.
fixlengths	Force a CSV to have same-length records by either padding or truncating them.
flatten	A flattened view of CSV records. Useful for viewing one record at a time. e.g. `qsv slice -i 5 data.csv \| qsv flatten`.
fmt	Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
foreach ❇️	Loop over a CSV to execute bash commands. (not available on Windows)
frequency 📇🏎️	Build frequency tables of each column. Uses multithreading to go faster if an index is present.
generate ❇️	Generate test data by profiling a CSV using Markov decision process machine learning.
headers	Show the headers of a CSV. Or show the intersection of all headers between many CSV files.
index	Create an index for a CSV. This is very quick & provides constant time indexing into the CSV file. Also enables multithreading for `frequency`, `split`, `stats` & `schema` commands.
input	Read CSV data with special quoting, trimming, line-skipping & UTF-8 transcoding rules. Typically used to "normalize" a CSV for further processing with other qsv commands.
join 📇	Inner, outer, cross, anti & semi joins. Automatically creates a simple, in-memory hash index to make it fast.
jsonl	Convert newline-delimited JSON (JSONL/NDJSON) to CSV. See `tojsonl` command to convert CSV to JSONL.
luau ❇️	Create a new computed column, filter rows or compute aggregations by executing a Luau script for every row of a CSV file.
partition	Partition a CSV based on a column value.
pseudo	Pseudonymise the value of the given column by replacing them with an incremental identifier.
py ❇️	Create a new computed column or filter rows by evaluating a python expression on every row of a CSV file. Python's f-strings is particularly useful for extended formatting, with the ability to evaluate Python expressions as well.
rename	Rename the columns of a CSV efficiently.
replace	Replace CSV data using a regex.
reverse 🗜️	Reverse order of rows in a CSV. Unlike the `sort --reverse` command, it preserves the order of rows with the same key.
safenames	Modify headers of a CSV to only have "safe" names - guaranteed "database-ready" names.
sample 📇	Randomly draw rows (with optional seed) from a CSV using reservoir sampling (i.e., use memory proportional to the size of the sample).
schema 📇🏎️	Infer schema from CSV data, replete with data type & domain/range validation & output in JSON Schema format. Uses multithreading to go faster if an index is present. See `validate` command to use the generated JSON Schema to validate if similar CSVs comply with the schema.
search	Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows.
searchset	Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows.
select	Select, re-order, duplicate or drop columns.
slice 📇	Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
sniff	Quickly sniff CSV metadata (delimiter, header row, preamble rows, quote character, flexible, is_utf8, number of records, number of fields, field names & data types).
sort 🚀🗜️	Sorts CSV data in alphabetical (with case-insensitive option), numerical, reverse, unique or random (with optional seed) order (See also `extsort` & `sortcheck` commands).
sortcheck 📇	Check if a CSV is sorted. With the --json options, also retrieve record count, sort breaks & duplicate count.
split 📇🏎️	Split one CSV file into many CSV files of N chunks. Uses multithreading to go faster if an index is present.
stats 📇🗜️🏎️	Compute summary statistics (sum, min/max/range, min/max length, mean, stddev, variance, nullcount, sparsity, quartiles, IQR, lower/upper fences, skewness, median, mode/s, antimode/s & cardinality) & make GUARANTEED data type inferences (Null, String, Float, Integer, Date, DateTime) for each column in a CSV. Uses multithreading to go faster if an index is present.
table 🗜️	Show aligned output of a CSV using elastic tabstops.
to ❇️🚀	Convert CSV files to PostgreSQL, SQLite, XLSX, Parquet and Data Package.
tojsonl 📇🏎️	Smartly converts CSV to a newline-delimited JSON (JSONL/NDJSON). By scanning the CSV first, it "smartly" infers the appropriate JSON data type for each column. See `jsonl` command to convert JSONL to CSV. Uses multithreading to go faster if an index is present.
transpose 🗜️	Transpose rows/columns of a CSV.
validate 📇🚀	Validate CSV data with JSON Schema (See `schema` command) & put invalid records into a separate file & a validation error report file. If no jsonschema file is provided, validates if a CSV conforms to the RFC 4180 standard.

❇️: enabled by a feature flag on qsv. Not available on qsvlite. qsvdp has luau & applydp pre-enabled.
📇: uses an index when available. join creates its own in-memory index automatically.
🗜️: loads entire CSV into memory, though dedup, stats & transpose have "streaming" modes as well.
🧠: expensive operations are memoized (cached) with available inter-session Redis caching for fetch commands.
🏎️: multithreaded when an index is available.
🚀: multithreaded even without an index.

Installation Options

Option 1: Download Prebuilt Binaries

Full-featured prebuilt binary variants of the latest qsv version for Windows, Linux & macOS are available for download, including binaries compiled with Rust Nightly (more info).

Option 2: Homebrew

For macOS and Linux (64-bit), you can quickly install qsv with Homebrew. However, only the apply feature is enabled.

brew install qsv

Option 3: Install with Rust

If you have Rust installed, you can also install from source using Rust's cargo command¹:

cargo install qsv --locked --features all_full

The binary will be installed in ~/.cargo/bin.

To install different variants and enable optional features, use cargo --features (see Feature Flags for more info):

# to install qsv with all features enabled
cargo install qsv --locked --features apply,generate,luau,fetch,foreach,python,to,self_update,full
# or shorthand
cargo install qsv --locked -F all_full

# or to install qsvlite
cargo install qsv --locked -F lite

# or to install qsvdp
cargo install qsv --locked -F datapusher_plus

Option 4: Compile from Source

Compiling from source also works similarly¹:

git clone https://github.com/jqnatividad/qsv.git
cd qsv
cargo build --release --locked --features all_full

The compiled binary will end up in ./target/release/.

To compile different variants and enable optional features:

# to compile qsv with all features enabled
cargo build --release --locked --features apply,generate,luau,fetch,foreach,python,to,self_update,full
# shorthand
cargo build --release --locked -F all_full

# for qsvlite
cargo build --release --locked -F lite

# for qsvdp
cargo build --release --locked -F datapusher_plus

Variants

There are three binary variants of qsv:

qsv - feature-capable(❇️), with the prebuilt binaries enabling all applicable features except Python ²
qsvlite - all features disabled (~33% of the size of qsv)
qsvdp - optimized for use with DataPusher+ with only DataPusher+ relevant commands; applydp, a slimmed-down version of the apply feature; embedded luau interpreter; the --progressbar option disabled; and the self-update only checking for new releases, requiring an explicit --update (~33% of the the size of qsv).

Regular Expression Syntax

The --select option and several commands (apply, schema, search, searchset, select & replace) allow the user to specify regular expressions. We use the regex crate to parse, compile and execute these expressions. ³

"Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and back references. In exchange, all searches execute in linear time with respect to the size of the regular expression and search text."

Its syntax can be found here.

File formats

qsv recognizes UTF-8/ASCII encoded, CSV (.csv) & TSV files (.tsv & .tab). CSV files are assumed to have "," (comma) as a delimiter, and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter command-line option or with the QSV_DEFAULT_DELIMITER environment variable or automatically detected when QSV_SNIFF_DELIMITER is set.

When using the --output option, qsv will UTF-8 encode the file & automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, tab for .tsv & .tab files.

JSONL/NDJSON files are also recognized & converted to/from CSV with the jsonl and tojsonl commands respectively.

The fetch & fetchpost commands also produces JSONL files when its invoked without the --new-column option & TSV files with the --report option.

The excel, safenames, sniff, sortcheck & validate commands produce JSON files with their JSON options.

The schema command produces a JSON Schema Validation (Draft 7) file with the ".schema.json" file extension, which can be used with the validate command.

The excel command recognizes Excel & Open Document Spreadsheet(ODS) files (.xls, .xlsx, .xlsm, .xlsb & .ods files).

The to command produces produces .xlsx, Parquet & Data Package files, and populates PostgreSQL and SQLite databases.

RFC 4180

qsv validates against the RFC 4180 CSV standard. However IRL, CSV formats vary significantly & qsv is actually not strictly compliant with the specification so it can process "real-world" CSV files. qsv leverages the awesome Rust CSV crate to read/write CSV files.

Click here to find out more about how qsv conforms to the standard using this crate.

UTF-8 Encoding

The following commands require UTF-8 encoded input (of which ASCII is a subset) - dedup, exclude, fetch, fetchpost, frequency, join, schema, sort, stats & validate.

For these commands, qsv checks if the input is UTF-8 encoded by scanning the first 8k & will abort if its not unless QSV_SKIPUTF8_CHECK is set. On Linux & macOS, UTF-8 encoding is the default.

This was done to increase performance of these commands, as they make extensive use of from_utf8_unchecked so as not to pay the repetitive utf-8 validation penalty, no matter how small, even for already utf-8 encoded files.

Should you need to re-encode CSV/TSV files, you can use the input command to transcode to UTF-8. It will replace all invalid UTF-8 sequences with �. Alternatively, there are several utilities you can use to do so on Linux/macOS & Windows.

Windows Usage Note

Unlike other modern operating systems, Microsoft Windows' default encoding is UTF16-LE. This will cause problems when redirecting qsv's output to a CSV file & trying to open it with Excel (which ignores the comma delimiter, with everything in the first column):

qsv stats wcp.csv > wcpstats.csv

Which is weird, since you would think Microsoft's own Excel would properly recognize UTF16-LE encoded CSV files. Regardless, to create a properly UTF-8 encoded file on Windows, use the --output option instead:

# so instead of redirecting stdout to a file
qsv stats wcp.csv > wcpstats.csv

# do this instead
qsv stats wcp.csv --output wcpstats.csv

Interpreters

For complex data-wrangling tasks, you can use Luau and Python scripts. The qsv binary variant can embed luau and python interpreters, enabled by identically named feature flags. The qsvdp binary variant has the luau interpreter embedded by default.

Luau

Luau is a fast, small, safe, gradually typed embeddable scripting language derived from Lua. It lies at the heart of Roblox technology - powering all it's user generated content, with Roblox's own internal code having more than 2 millions lines of Luau.

It has sandboxing, type-checking, additional operators & increased performance while maintaining compatibility with Lua.

Lua is much faster than Python, and Luau is even faster still - more so, as qsv precompiles Luau scripts into bytecode. In addition, luau is embedded into qsv, has debug logging, can do aggregations with its --prologue & --epilogue options & has no external dependencies unlike the py command.

Python

The python feature is NOT enabled by default on the prebuilt binaries, as doing so requires it to dynamically link to python at runtime, which presents distribution issues, as various operating systems have differing bundled Python versions.

If you wish to enable the python feature - you'll just have to install/compile from source, making sure you have the development libraries for the desired Python version (Python 3.6 to 3.11 are supported) installed when doing so.

If you plan to distribute your manually built qsv with the python feature, qsv will look for the specific version of Python shared libraries (libpython* on Linux/macOS, python*.dll on Windows) against which it was compiled starting with the current directory & abort with an error if not found, detailing the Python library it was looking for.

Note that this will happen on qsv startup, even if you're not running the py command.

When building from source - PyO3 - the underlying crate that enables the python feature, uses a build script to determine the Python version & set the correct linker arguments. By default it uses the python3 executable. You can override this by setting PYO3_PYTHON (e.g., PYO3_PYTHON=python3.6), before installing/compiling qsv. See the PyO3 User Guide for more information.

Consider using the luau command instead of the py command if the operation you're trying to do can be done with luau - as luau is faster than py and can do aggregations.

The py command cannot do aggregations because PyO3's GIL-bound memory limitations will quickly consume a lot of memory (see issue 449 for details). To prevent this, the py command processes CSVs in batches (default: 30,000 records), with a GIL pool for each batch, so no globals are available across batches.

Environment Variables

Variable	Description
`QSV_DEFAULT_DELIMITER`	single ascii character to use as delimiter. Overrides `--delimeter` option. Defaults to "," (comma) for CSV files & "\t" (tab) for TSV files when not set. Note that this will also set the delimiter for qsv's output to stdout. However, using the `--output` option, regardless of this environment variable, will automatically change the delimiter used in the generated file based on the file extension - i.e. comma for `.csv`, tab for `.tsv` & `.tab` files.
`QSV_SNIFF_DELIMITER`	if set, the delimiter is automatically detected. Overrides `QSV_DEFAULT_DELIMITER` & `--delimiter` option. Note that this does not work with stdin.
`QSV_NO_HEADERS`	if set, the first row will NOT be interpreted as headers. Supersedes `QSV_TOGGLE_HEADERS`.
`QSV_TOGGLE_HEADERS`	if set to `1`, toggles header setting - i.e. inverts qsv header behavior, with no headers being the default, & setting `--no-headers` will actually mean headers will not be ignored.
`QSV_AUTOINDEX`	if set, automatically create an index when none is detected. Also automatically updates stale indices.
`QSV_COMMENT_CHAR`	set to an ascii character. If set, any lines(including the header) that start with this character are ignored.
`QSV_MAX_JOBS`	number of jobs to use for multithreaded commands (currently `apply`, `dedup`, `extsort`, `frequency`, `schema`, `sort`, `split`, `stats`, `tojsonl` & `validate`). If not set, max_jobs is set to the detected number of logical processors. See Multithreading for more info.
`QSV_NO_UPDATE`	if set, prohibit self-update version check for the latest qsv release published on GitHub.
`QSV_PREFER_DMY`	if set, date parsing will use DMY format. Otherwise, use MDY format (used with `apply datefmt`, `schema`, `sniff` & `stats` commands).
`QSV_REGEX_UNICODE`	if set, makes `search`, `searchset` & `replace` commands unicode-aware. For increased performance, these commands are not unicode-aware by default & will ignore unicode values when matching & will abort when unicode characters are used in the regex. Note that the `apply operations regex_replace` operation is always unicode-aware.
`QSV_SKIPUTF8_CHECK`	if set, skip UTF-8 encoding check. Otherwise, for several commands that require UTF-8 encoded input (see UTF8-Encoding), qsv scans the first 8k.
`QSV_RDR_BUFFER_CAPACITY`	reader buffer size (default (bytes): 16384)
`QSV_WTR_BUFFER_CAPACITY`	writer buffer size (default (bytes): 65536)
`QSV_FREEMEMORY_HEADROOM_PCT`	the percentage of free available memory required when running qsv in "non-streaming" mode (i.e. the entire file needs to be loaded into memory). If the incoming file is greater than the available memory after the headroom is subtracted, qsv will not proceed. (default: (percent) 20 )
`QSV_LOG_LEVEL`	desired level (default - off; `error`, `warn`, `info`, `trace`, `debug`).
`QSV_LOG_DIR`	when logging is enabled, the directory where the log files will be stored. If the specified directory does not exist, qsv will attempt to create it. If not set, the log files are created in the directory where qsv was started. See Logging for more info.
`QSV_PROGRESSBAR`	if set, enable the --progressbar option on the `apply`, `fetch`, `fetchpost`, `foreach`, `luau`, `py`, `replace`, `search`, `searchset`, `sortcheck` & `validate` commands.
`QSV_REDIS_CONNSTR`	the `fetch` command can use Redis to cache responses. Set to connect to the desired Redis instance. (default: `redis:127.0.0.1:6379/1`). For more info on valid Redis connection string formats, click here.
`QSV_FP_REDIS_CONNSTR`	the `fetchpost` command can also use Redis to cache responses (default: `redis:127.0.0.1:6379/2`). Note that `fetchpost` connects to database 2, as opposed to `fetch` which connects to database 1.
`QSV_REDIS_MAX_POOL_SIZE`	the maximum Redis connection pool size. (default: 20).
`QSV_REDIS_TTL_SECONDS`	set time-to-live of Redis cached values (default (seconds): 2419200 (28 days)).
`QSV_REDIS_TTL_REFRESH`	if set, enables cache hits to refresh TTL of cached values.

Several dependencies also have environment variables that influence qsv's performance & behavior:

Memory Management (mimalloc)
When incorporating qsv into a data pipeline that runs in batch mode, particularly with very large CSV files using qsv commands that load entire CSV files into memory, you can fine-tune Mimalloc's behavior using its environment variables.
Network Access (reqwest)
qsv uses reqwest for its fetch, validate & --update functions & will honor proxy settings set through the HTTP_PROXY, HTTPS_PROXY & NO_PROXY environment variables.

ℹ️ NOTE: To get a list of all active qsv-relevant environment variables, run qsv --envlist. Relevant env vars are defined as anything that starts with QSV_ & MIMALLOC_ & the proxy variables listed above.

Feature Flags

qsv has several features:

mimalloc (default) - use the mimalloc allocator (see Memory Allocator for more info).
apply - enable apply command. This swiss-army knife of CSV transformations is very powerful, but it has a lot of dependencies that increases both compile time and binary size.
fetch - enables the fetch & fetchpost commands.
foreach - enable foreach command (not valid for Windows).
generate - enable generate command.
luau - enable luau command. Embeds a Luau interpreter into qsv. Luau has type-checking, sandboxing, additional language operators, increased performance & other improvements over Lua.
python - enable py command. Note that qsv will look for the shared library for the Python version (Python 3.6 & above supported) it was compiled against & will abort on startup if the library is not found, even if you're not using the py command. Check Python section for more info.
to - enables the to command.
self_update - enable self-update engine, checking GitHub for the latest release. Note that if you manually built qsv, self-update will only check for new releases. It will NOT offer the choice to update itself to the prebuilt binaries published on GitHub. You need not worry that your manually built qsv will be overwritten by a self-update.
full - enable to build qsv binary variant which is feature-capable.
all_full - enable to build qsv binary variant with all features enabled (apply,fetch,foreach,generate,luau,python,to,self_update).
lite - enable to build qsvlite binary variant with all features disabled.
datapusher_plus - enable to build qsvdp binary variant - the DataPusher+ optimized qsv binary.
nightly - enable to turn on nightly/unstable features in the rand, regex, hashbrown, parking_lot & pyo3 crates when building with Rust nightly/unstable.

ℹ️ NOTE: qsvlite, as the name implies, always has non-default features disabled. qsv can be built with any combination of the above features using the cargo --features & --no-default-features flags. The prebuilt qsv binaries has all applicable features valid for the target platform².

Minimum Supported Rust Version

qsv's MSRV policy is to require the latest Rust version that is supported by Homebrew.

Tab Completion

qsv's command-line options are quite extensive. Thankfully, since it uses docopt for CLI processing, we can take advantage of docopt.rs' tab completion support to make it easier to use qsv at the command-line (currently, only bash shell is supported):

# install docopt-wordlist
cargo install docopt

# IMPORTANT: run these commands from the root directory of your qsv git repository
# to setup bash qsv tab completion
echo "DOCOPT_WORDLIST_BIN=\"$(which docopt-wordlist)"\" >> $HOME/.bash_completion
echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bash_completion
echo "complete -F _docopt_wordlist_commands qsv" >> $HOME/.bash_completion

Testing

qsv has ~1,020 tests in the tests directory. Each command has its own test suite in a separate file with the convention test_<COMMAND>.rs. Apart from preventing regressions, the tests also serve as good illustrative examples, and are often linked from the usage text of each corresponding command.

To test each binary variant:

# to test qsv
cargo test --feature all_full

# to test qsvlite
cargo test --feature lite

# to test qsvdp
cargo test --feature datapusher_plus

# to test a specific command
# here we test only stats and use the
# -F shortcut for --feature
cargo test stats -F all_full

License

Dual-licensed under MIT or the UNLICENSE.

Sponsor

qsv was made possible by

Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used.

Naming Collision

This project is unrelated to Intel's Quick Sync Video.

Of course, you'll also need a linker & a C compiler. Linux users should generally install GCC or Clang, according to their distribution’s documentation. For example, if you use Ubuntu, you can install the build-essential package. On macOS, you can get a C compiler by running $ xcode-select --install. For Windows, this means installing Visual Studio 2022. When prompted for workloads, include "Desktop Development with C++", the Windows 10 or 11 SDK & the English language pack, along with any other language packs your require. ↩ ↩²
The foreach feature is not available on Windows. The python feature is not enabled on the prebuilt binaries. Compile qsv with Python 3.6 and above development environment installed if you want to enable the python feature. Luau support is enabled by default on the prebuilt binaries, with preference for luau for platforms that support it. ↩ ↩²
This is the same regex engine used by ripgrep - the blazingly fast grep replacement that powers Visual Studio's magical "Find in Files" feature. ↩

tmtmtmtm/qsv