/rows

A common, beautiful interface to tabular data, no matter the format

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

rows

Join the chat at https://gitter.im/turicas/rows License: GPLv3 Current version at PyPI Downloads per month on PyPI Supported Python Versions Software status Donate

No matter in which format your tabular data is: rows will import it, automatically detect types and give you high-level Python objects so you can start working with the data instead of trying to parse it. It is also locale-and-unicode aware. :)

Have you ever lost your precious time reading a CSV that had a different dialect? Or trying to learn a whole new library API to read a new tabular data format your customer just sent? You've got gray hair when trying to access some data and the only answer was UnicodeDecodeError? So, rows was custom made for you! :-)

Note: if you're using rows in some project please tell us! :-)

Showcase

Architecture

The library is composed by:

  • A common interface to tabular data (the Table class)
  • A set of plugins to populate Table objects (CSV, XLS, XLSX, HTML and XPath, Parquet, TXT, JSON, SQLite -- more coming soon!)
  • A set of common fields (such as BoolField, IntegerField) which know exactly how to serialize and deserialize data for each object type you'll get
  • A set of utilities (such as field type recognition) to help working with tabular data
  • A command-line interface so you can have easy access to the most used features: convert between formats, sum, join and sort tables.

Just import rows and relax.

Core Values

  • Simple, easy and flexible API
  • Code quality
  • Don't Repeat Yourself

Installation

Directly from PyPI:

pip install rows

You can also install directly from the GitHub repository to have the newest features (not pretty stable) by running:

pip install git+https://github.com/turicas/rows.git@develop

or:

git clone https://github.com/turicas/rows.git
cd rows
python setup.py install

The use of virtualenv is recommended.

You can create a development image using Docker:

cat Dockerfile | docker build -t turicas/rows:latest -

The plugins csv, dicts, json, json, sqlite and txt are built-in by default but if you want to use another one you need to explicitly install its dependencies, for example:

pip install rows[html]
pip install rows[xls]

You also need to install some dependencies to use the command-line interface. You can do it installing the cli extra requirement:

pip install rows[cli]

And - easily - you can install all the dependencies by using the all extra requirement:

pip install rows[all]

If you use Debian sid or testing you can install it directly from the main repository by running:

aptitude install python-rows  # Python library only
aptitude install rows  # Python library + CLI

And in Fedora:

dnf install python-row  # Python library + CLI

Basic Usage

You can create a Table object and populate it with some data programmatically:

from collections import OrderedDict
from rows import fields, Table

my_fields = OrderedDict([('name', fields.TextField),
                         ('age', fields.IntegerField),
                         ('can', fields.BoolField)])
table = Table(fields=my_fields)
table.append({'name': u'Álvaro Justen', 'age': 28, 'can': False})
table.append({'name': u'Another Guy', 'age': 42, 'can': True})

A common use case is to have a list of dicts -- you can also import it, and rows will automatically fill in the blanks (your dicts don't need to have the same keys) and convert data:

import rows

data = [{'name': u'Álvaro Justen', 'age': 28, 'can': False},
        {'name': u'Another Guy', 'age': 42, 'can': True},]
table = rows.import_from_dicts(data)

In this case, table.fields will be created automatically (rows will identify the field type for each dict key).

Then you can iterate over it:

def print_person(person):
    can = 'can' if person.can else "just can't"
    print(u'{} is {} years old and {}'.format(person.name, person.age, can))

for person in table:
    print_person(person)  # namedtuples are returned

You'll see:

Álvaro Justen is 28 years old and just can't.
Another Guy is 42 years old and can.

As you specified field types (my_fields) you don't need to insert data using the correct types. Actually you can insert strings and the library will automatically convert it for you:

table.append({'name': '...', 'age': '', 'can': 'false'})
print_person(table[-1])  # yes, you can index it!

And the output:

... is None years old and just can't

Importing Data

rows will help you importing data: its plugins will do the hard job of parsing each supported file format so you don't need to. They can help you exporting data also. For example, let's download a CSV from the Web and import it:

import requests
import rows
from io import BytesIO

url = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
csv = requests.get(url).content  # Download CSV data
legislators = rows.import_from_csv(BytesIO(csv))  # already imported!

print('Hey, rows automatically identified the types:')
for field_name, field_type in legislators.fields.items():
    print('{} is {}'.format(field_name, field_type))

And you'll see something like this:

[...]
in_office is <class 'rows.fields.IntegerField'>
gender is <class 'rows.fields.TextField'>
[...]
birthdate is <class 'rows.fields.DateField'>

We can then work on this data:

women = sum(1 for row in legislators if row.in_office and row.gender == 'F')
men = sum(1 for row in legislators if row.in_office and row.gender == 'M')
print('Women vs Men (in office): {} vs {}'.format(women, men))

Then you'll see effects of our sexist society:

Women vs Men: 108 vs 432

Now, let's compare ages:

legislators.order_by('birthdate')
older, younger = legislators[-1], legislators[0]
print('{}, {} is older than {}, {}'.format(
        older.lastname, older.firstname, younger.lastname, younger.firstname))

The output:

Stefanik, Elise is older than Byrd, Robert

You can also get a whole column, like this:

>>> legislators[u'gender']
[u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'M',
 u'F',
 u'M',
 ...]

And change the whole column (or add a new one):

>>> legislators[u'gender'] = [u'male' if gender == u'M' else u'female'
                              for gender in legislators[u'gender']]
>>> legislators[u'gender']
[u'male',
 u'male',
 u'male',
 u'male',
 u'male',
 u'male',
 u'male',
 u'male',
 u'male',
 u'male',
 u'male',
 u'male',
 u'female',
 u'male',
 ...]

Or delete it:

>>> u'gender' in legislators.field_names
True
>>> del legislators[u'gender']
>>> u'gender' in legislators.field_names
False
>>> legislators[0].gender
[...]
AttributeError: 'Row' object has no attribute 'gender'

Note that native Python objects are returned for each row inside a namedtuple! The library recognizes each field type and converts it automagically no matter which plugin you're using to import the data.

Common Parameters

Each plugin has its own parameters (like index in import_from_html and sheet_name in import_from_xls) but all plugins create a rows.Table object so they also have some common parameters you can pass to import_from_X. They are:

  • fields: an OrderedDict with field names and types (disable automatic detection of types).
  • skip_header: Ignore header row. Only used if fields is not None. Default: True.
  • import_fields: a list with field names to import (other fields will be ignored) -- fields will be imported in this order.
  • samples: number of sample rows to use on field type autodetect algorithm. Default: None (use all rows).

Exporting Data

If you have a Table object you can export it to all available plugins which have the "export" feature. Let's use the HTML plugin:

rows.export_to_html(legislators, 'legislators.html')

And you'll get:

$ head legislators.html
<table>

  <thead>
    <tr>
      <th> title </th>
      <th> firstname </th>
      <th> middlename </th>
      <th> lastname </th>
      <th> name_suffix </th>
      <th> nickname </th>

Now you have finished the quickstart guide. See the examples folder for more examples.

Available Plugins

The idea behing plugins is very simple: you write a little piece of code which extracts data from/to some specific format and the library will do the other tasks for you. So writing a plugin is as easy as reading from/writing to the file format you want. Currently we have the following plugins:

  • CSV: use rows.import_from_csv and rows.export_to_csv (dependencies are installed by default)
  • TXT: use rows.export_to_txt (no dependencies)
  • JSON: use rows.import_from_json and rows.export_to_json (no dependencies)
  • HTML: use rows.import_from_html and rows.export_to_html (dependencies must be installed with pip install rows[html])
  • XPath: use rows.import_from_xpath passing the following arguments: filename_or_fobj, rows_xpath and fields_xpath (dependencies must be installed with pip install rows[xpath]) -- see an example in examples/library/ecuador_radiodifusoras.py.
  • Parquet: use rows.import_from_parquet passing the filename (dependencies must be installed with pip install rows[parquet] and if the data is compressed using snappy you also need to install rows[parquet-snappy] and the libsnappy-dev system library) -- read this blog post for more details and one example
  • XLS: use rows.import_from_xls and rows.export_to_xls (dependencies must be installed with pip install rows[xls])
  • XLSX: use rows.import_from_xlsx and rows.export_to_xlsx (dependencies must be installed with pip install rows[xlsx])
  • SQLite: use rows.import_from_sqlite and rows.export_to_sqlite (no dependencies)
  • ODS: use rows.import_from_ods (dependencies must be installed with pip install rows[ods])

More plugins are coming (like PDF, DBF etc.) and we're going to re-design the plugin interface so you can create your own easily. Feel free to contribute. :-)

Common Parameters

Each plugin has its own parameters (like encoding in import_from_html and sheet_name in import_from_xls) but all plugins use the same mechanism to prepare a rows.Table before exporting, so they also have some common parameters you can pass to export_to_X. They are:

  • export_fields: a list with field names to export (other fields will be ignored) -- fields will be exported in this order.

Command-Line Interface

rows exposes a command-line interface with the common operations such as convert data between plugins, sum, sort and join Tables.

Run rows --help to see the available commands and take a look at rows/cli.py. We still need to improve the CLI docs, sorry.

Locale

Many fields inside rows.fields are locale-aware. If you have some data using Brazilian Portuguese number formatting, for example (, as decimal separators and . as thousands separator) you can configure this into the library and rows will automatically understand these numbers!

Let's see it working by extracting the population of cities in Rio de Janeiro state:

import locale
import requests
import rows
from io import BytesIO

url = 'http://cidades.ibge.gov.br/comparamun/compara.php?idtema=1&codv=v01&coduf=33'
html = requests.get(url).content
with rows.locale_context(name='pt_BR.UTF-8', category=locale.LC_NUMERIC):
    rio = rows.import_from_html(BytesIO(html))

total_population = sum(city.pessoas for city in rio)
# 'pessoas' is the fieldname related to the number of people in each city
print('Rio de Janeiro has {} inhabitants'.format(total_population))

The column pessoas will be imported as an IntegerField and the result is:

Rio de Janeiro has 15989929 inhabitants

Operations

Available operations: join, transform and serialize.

TODO. See rows/operations.py.

Developing

Create the virtualenv:

mkvirtualenv rows

Install all plugins' dependencies:

pip install --editable .[all]

Install development dependencies:

pip install -r requirements-development.txt

Run tests:

make test

or (if you don't have make):

tox

you can also run tox against an especific python version:

tox -e py27
tox -e py35

tox known issuses : runing tox with py27 eviron may raise InvocationError in non Linux environments. To avoid it you may rebuild tox environment in every run with: tox -e py27 -r

or if you want to run nosetests directly:

nosetests -dsv --with-yanc --with-coverage --cover-package rows tests/*.py

To create the man page you'll need to install txt2man. In Debian (and Debian-based distributions) you can install by running:

aptitude install txt2man

Then, you can generate the rows.1 file by running:

make man

Similar Projects

Related Projects

Known Issues

Semantic Versioning

rows uses semantic versioning. Note that it means we do not guarantee API backwards compatibility on 0.x.y versions.

License

This library is released under the GNU General Public License version 3.