A command-line tool for manipulating tabular data.
This tool is in early development. For production-level work, please use a different tool.
Grab a subset of columns from a table.
tableshaper input table.csv pick 'country, continent, pop1990:pop2000`
Drop rows you don't need.
tableshaper input table.csv filter 'continent == "South America"'
Reshape the table.
tableshaper input table.csv reshape -k year -v population --columns pop1990:pop2000
Do it all in one command.
tableshaper \
input table.csv \
pick 'country, continent, pop1990:pop2000' \
filter 'continent == "South America"' \
reshape -k year -v population --columns pop1990:pop2000
For more complete examples, check out the examples/
folder in this repo.
tableshaper
is meant to be installed with Python 3.7 or higher.
Pull down this repo and install it with pip
.
git clone https://github.com/armollica/TableShaper.git
python -m pip install TableShaper/
The TableShaper program.
tableshaper
command kicks off the program and is generally followed by a series of
commands, like pick
, filter
, or mutate
. The first command in you'll usually run
is the input
command and the last command will usually be the output
command.
These read and write tables. Here are some short example commands.
# Read CSV data from a file, perform some transformations and output to a new file.
tableshaper input input.csv pick 'column1:column10' filter 'column1 > 20' output output.csv
# Same thing, but reading from stdin and writing to stdout.
tableshaper input - pick 'column1:column10' filter 'column1 > 20' output - < input.csv > output.csv
For examples going forward, the tableshaper
portion of commands will be omitted
to keep things concise.
Table of contents | |
---|---|
$ input | Read in a table. |
$ output | Write out a table. |
$ view | View table. |
$ pick | Subset columns. |
$ rename | Rename columns. |
$ filter | Subset rows. |
$ sort | Sort rows. |
$ mutate | Create new columns. |
$ aggregate | Aggregate rows. |
$ join | Join tables. |
$ reshape | Reshape table. |
Helpers | Built in convenience functions. |
Read in a table.
The input
command requires a filename argument that points to the file you
want to read. If you are reading from stdin
pass -
as the filename.
The input file can be one of several formats. An input file format is specified
with the -f, --format
option.
The default format is CSV, comma-separated values. You can be explicitly set it
by passing csv
to the format option.
For tab-limited files use the tsv
format.
For all other delimited files use the dsv
format. You specify the delimiter
with the -d, --delim
option. For example, a semicolon-delimited file named
input.txt
could be read like so: input -f dsv -d ';' input.txt
.
Excel files can be read using the excel
format. You'll need to specify the
sheet you want to read with the -s, --sheet
option.
For JSON files, use the --json
flag. A JSON file
can be formatted several ways. The JSON format can be set with
the -j, --json-format
option:
- records: list like [{column -> value}, ... , {column -> value}]
- split: dict like {index -> [index], columns -> [columns], data -> [values]}
- index: dict like {index -> {column -> value}}
- columns: dict like {column -> {index -> value}}
- values: just the values array
Geographic data can also be imported as a table. GeoJSON, TopoJSON and
ESRI Shapefiles can all be imported as tables. (tableshaper
uses GeoPandas
for processing geodata). These formats have the following parameters: geojson
,
topojson
and shp
.
Other supported formats are feather
, parquet
, stata
and sas
.
When you import a table, data types for columns will be inferred automatically.
To prevent this and read everything in as text, use the -r, --raw
flag.
You can then set the data types explicitly in a mutate
command.
If you are reading a file without a header row, you can set column names with
the -c, --col-names
option. This takes a comma-separated list of column names.
Multiple tables can be imported by calling the input
command mulitple times.
This useful for when you want to join tables.
You can specify the name of the table to be imported using the -n, --name
option. This is useful for when working with tables and need to reference
one. If you don't give a table a name, it will automatically be given one based
on the filename. If a table is coming from stdin
, it will be given the name
table
.
Write out a table.
The output
command requires a filename argument specifying where to write the
file. If you want to output to stdout
pass -
as the filename.
To output multiple tables pass +
as the filename. You must also provide a
comma-separated list of table names to the -t, --tables
option.
You can specify the directory you want files to be put in using the -d, --dir
option.
You can output a table in a variety of format:
csv
ortsv
for comma- or tab-delimited filesjson
for a records-style JSON files (see input)geojson
orshp
for GeoJSON and ESRI Shapefiles (for geodata)feather
orparquet
for efficient binary data formatsmarkdown
orhtml
for display on in a markdown file or on webpage
View table.
Display table in a human-readable format. Or print summary information about the table.
Calling view
by itself will print the first and last 30 rows of the table.
You can limit what's display to the top n
rows with the -n, --top
option.
view -n 5 # show the top 5 rows
Use the -i, --info
flag to display a summary of the table. This includes the
number of rows and columns as well as each column's name, its data type and the
number of non-null values it has. Also displays the memory usage of the table.
Use the -s, --stats
flag to display summary statistics on the columns in the
table.
For numbers this will includes the count, mean, standard deviation, minimum, maximum, 25th percentile, median and 75th percentile.
For strings and timestamps it will include the count, the number of unique values, the most common value and the number of times it occurs. Timestamps also include the first and last items.
Subset columns.
There are two ways to pick columns, the "selection" method and the "filter" method.
The "selection" method is the default method. With this method you give a
comma-separated list of column names or ranges of column names. A range is
specified by the starting and ending columns separated by a colon: start:end
.
You can exclude a column or a range by putting a tilde (~) before it.
# Assume we have a table with columns A, B, C, etc.
# Keep columns A and D through G
pick 'A, D:G'
# Drop columns C and F
pick '~C, ~F'
# Drop columns B through G, add column E back in
pick '~B:G, E`
The "filter" method is set with the -f, --filter
flag. You provide a
Python expression that is evaluated on each column name. The column
name is loaded into the namespace as name
. If the expression evaluates
to true then the column is kept.
# Keep columns that start with "population"
pick -s 'name.startswith("population")'
# Keep columns that are numeric
pick -s 'name.isnumeric()'
Rename columns.
There are two renaming methods. The default method is to provide a comma-separated list of column name assignments of the form: new = old. All other columns are retained.
# Rename GEOID to id and STATE_FIPS to fips.
rename 'id = GEOID, fips = STATE_FIPS'
The second method is to provide a Python expression that gets evaluated on each
column name. The column name is loaded into the namespace as name
. Whatever
the expression evaluates to is what the new name will be. This method is set
with the -m, --map
flag.
# Convert column names to lowercase
rename -m 'name.lower()'
# Convert column names to snakecase
rename -m 'name.replace(' ', '_').lower()'
Subset rows.
Rows are kept based on a logical expression (true/false) or by a range of row indices.
The default method is to keep rows based on a Python expression that evaluates to true or false. The columns of the table are put into the namespace as a pandas Series object.
# Keep all rows where the population column has values greater than 1000
filter 'population > 1000'
# Keep all rows where the state column has values equal to "55" or "56"
filter 'state.isin(["55", "56"])'
If you use the -r, --row
flag, you can perform the same type of filtering
on rows individually, instead of on pandas Series as a whole. This can be more
flexible, especially when dealing with strings.
filter -r 'state in ["55", "56"]'
filter -r 're.match("^(M|m)azda", name) is not None'
The second filtering method is to specify a range of row indexes of the
format: start:end
. This method is set with the -s, --slice
flag.
# Keep the first five rows
filter -s 1:5
# Keep rows 25 through 75
filter -s 1:5
Ranges can be open-ended. If no start index is provided, it starts from the first row. If no end index is provided, it ends at the last row of the table.
filter -s :5 # is equivalent to 1:5
filter -s 100: # 100th to the last row
You can start from the back of the table too. If the start or end index begins with a tilde (~), the index will refer to that many places from the last row of the table.
filter -s ~5: # last five rows
filter -s ~10:~5: # from (n - 10) to (n - 5)
Provide multiple slices. Pass in a comma-separated list of slices and you'll get them back in that order. Warning: you can get duplicate rows this way.
filter -s '1:5, 10:15'
filter -s '1:5, ~5:' # first and last five rows
Sort rows.
Order is determined by values in a column. Sort on multiple columns by passing
in a comma-separated list of column names. Rows are sorted in ascending order,
by default. To sort in descending order, put :desc
after the column name.
sort 'mpg'
sort 'mpg:desc'
sort 'mpg, hp:desc'
Create new columns.
A new column is created by assigning a new variable in a python expression. Mutation follow this format:
new_column = [python expression]
Columns with the same name will be overwritten.
The default behavior is to perform vectorized transformation. All columns of the table are put in the namespace as a pandas Series.
mutate 'real_value = value * (price / 100)'
mutate 'touches_lake_mi = state.isin(["WI", "MI", "IL", "IN"])'
Multiple mutations can be made in one go by separating mutations with a
semicolon (;). The order of these mutations is not guaranteed to be consistent
with the order to provided them. Run separate mutate
commands if the order
matters.
mutate 'price = price / 100; pop = pop * 1000'
Grouped mutations are possible with the -g, --group-by
option. Pass a
comma-separated list of column names to group by multiple columns.
Grouped mutations are like aggregations except all original rows are preserved.
mutate -g state 'population_share = pop / pop.sum()'
Some operations like string manipulation can be difficult to deal with when dealing with pandas Series objects. In these cases you way want to perform row-based mutations.
Activate row-wise mutation with the -r, --row
flag. Columns in the row are
put in the namespace as individual values.
Grouped mutations are not possible with row-wise mutation.
mutate -r 'id = "%05d" % id'
mutate -r 'state = id[0:2]'
Aggregate rows.
Group rows based on values in one or more columns and aggregate these groups of rows into single values using methods like sum(), mean(), count(), max(), min().
Aggregations follow this format:
new_column = [python expression]
-g, --group-by Comma-separated list of columns to group by.
aggregate -g state 'population_sum = population.sum()'
aggregate -g country_id,station_id 'median_wind_speed = wind_speed.median()'
Join tables.
Perform SQL-style joins with the following flags:
- Left join:
-l, --left
(default) - Right join:
-r, --right
- Outer join:
-o, --outer
- Inner join:
-i, --inner
Pass the columns to join to the -k, --key
argument.
join -k id right.csv
join -r -k id right.csv
join -o -k 'state_id, county_id' right.csv
You can also bind rows or columns from two tables together with the
--bind-rows
and --bind-columns
flags.
join --bind-rows right.csv
join --bind-columns right.csv
When binding rows, any columns that exist in one table and not the other will
be filled with NaN
values for rows in the table without that column.
Reshape table.
There are two ways to reshape a table. The first is to go from wide to long. The gather method takes a collection of columns and converts them into two key-value columns.
This is the default method but can be set explicitly
with the -g, --gather
flag. The name of the two key-value columns are
set with the -k, --key
and -v, --value
arguments. The columns to collect
are set with the -c, --columns
argument which takes a selection of
columns (see $ pick
for an description
of selections).
reshape -k year -v population -c 1995:2013
The second method is to go from long to wide. The spread method (set with the
-s, --spread
flag) takes two key-value columns and spreads them out into
multiple columns where the key column is converted into the column name
and the rows are filled with the values in the value columns. The key-value
columns names are passed to the -k, --key
and -v, --value
arguments.
reshape -s -k year -v population
Here are some convenience functions that you can use within commands like mutate
and filter
.
Manipulating text
format_text(string)
: Template text using data from a table. For example, say you have a column with people's age, thenformat_text("{age} years old")
would return "12 years old" for a person who has 12 in the age column.camelize(string)
: Converts string to CamelCase.dasherize(string)
: Replace underscores with dashes in the string.humanize(string)
: Capitalize the first word and turn underscores into spaces and strip a trailing "_id", if any.ordinalize(number)
: Turn a number into an ordinal string used to denote the position in an ordered sequence such as 1st, 2nd, 3rd, 4th.ordinal(number)
: Return the suffix that should be added to a number to denote the position in an ordered sequence such as 1st, 2nd, 3rd, 4th.parameterize(string)
: Replace special characters in a string so that it may be used as part of a "pretty" URL.pluralize(string)
: Return the plural form of a word.singularize(string)
: Return the singular form of a word, the reverse ofpluralize()
.tableize(string)
: Create the name of a table like Rails does for models to table names. This method uses thepluralize()
method on the last word in the string.titleize(string)
: Capitalize all the words and replace some characters in the string to create a nicer looking title.transliterate(string)
: Replace non-ASCII characters with an ASCII approximation. If no approximation exists, the non-ASCII character is ignored.underscore(string)
: Make an underscored, lowercase form from the expression in the string.
Searching text
extract_text(string, regex)
: Extract part of a string using regex. For example,extract_text(sentence, "(\w+) ")
will extract the first word in a sentence.text_matches(string, regex)
:True
if a string matches the regex.starts_with(string, starter)
:True
if a string starts with the starter string.ends_with(string, ender)
:True
if a string ends with the ender string.
Parsing data types
parse_datetime(string, format_string)
: Convert a string into a date using a datetime format string, e.g., '%Y-%m-%d'.parse_int(string)
: Converts a string to an integer.parse_float(string)
: Converts a string to a float.parse_bool(string)
: Converts a string toTrue
orFalse
. Strings like "true", "1", "T", "Y" and "yes" are evaluate toTrue
. The operation is not case sensitive.
Pull down this repo and move into the directory.
git clone https://github.com/armollica/TableShaper.git
cd tableshaper/
Create a virtual environment and activate it.
python -m venv venv
. venv/bin/activate
Install the package and its dependencies with the --editable
flag. That way
changes to the source code will automatically affect the CLI program.
python -m pip install --editable .
Test ares in the tests/
folder. To run these, call pytest
at the command
line from the root of this project.
Many thanks to the people who made the pandas and click packages. This tool relies on these immensely.
Also many thanks to the people behind these tools. You've saved me loads of time and headache.
- tidyverse: I stole many ideas from the
dplyr
andtidyr
R packages, in particular. Love the micro domain-specific languages in these packages, each tailored for specific tasks. - mapshaper: A command-line tool for editing geographic data (vector-based). I mimicked the command-line interface of this in many ways, especially the command chaining. The viewer is also great.
- csvkit: Great tool for processing tabular data. Does many of the same things tableshaper does. Also does many thing this tool doesn't do.
- visidata: Tool to viewing and exploring tabular data at the command line.
- jq: A command-line JSON processor.
- ndjson-cli: A command-line tool for processing newline-delimited JSON files.