CrowData
is a tool to collaborate on the verification or release of data that otherwise would be hard or impossible to get via automatic tools. This is the software we used to create VozData.
In 2014, La Nacion in Argentina launched VozData, a website to crowdsourced senate spendings by asking people to transcribe information from 6500 scanned PDF documents from the senate. This is the code that created that website and it can be used with any document set and any data you may need to take from them.
VozData: collaborating to free data from PDFs: A really nice article about the process of creating VozData
from La Nacion.
-
Python 2.7.5
-
We recommend the use of virtualenv — Install it.
-
Create a virtual environment and activate it:
virtualenv ~/.python-envs/crowdata . ~/.python-envs/crowdata/bin/activate
-
Get the source code:
git clone https://github.com/crowdata/crowdata.git crowdata cd crowdata
-
Install dependencies:
pip install -r requirements.txt
(If you are using Ubuntu, you may need to install
python-dev
before dependencies.) -
Create PostgreSQL database
$ createuser -s -h localhost crow_user $ createdb -O crow_user -h localhost crowdata_development
-
Create extensions for doing trigram matching and removing accents in PostgreSQL
$ psql -ucrow_user crow_user=# \c crowdata_development crowdata_development=# CREATE EXTENSION pg_trgm; crowdata_development=# CREATE EXTENSION unaccent;
-
We keep local settings outside GIT. You will need to copy
local_settings.py.example
tolocal_settings.py
. You will need to edit the database settings there.DATABASES = { 'default': { 'ENGINE': 'django.db.backends.postgresql_psycopg2', # Add 'postgresql_psycopg2', 'postgresql', 'mysql', 'sqlite3' or 'oracle'. 'NAME': 'crowdata_development', # Or path to database file if using sqlite3. 'USER': 'crow_user', 'PASSWORD': '', 'HOST': '', 'PORT': '', } }
-
Initialize the database:
python manage.py syncdb python manage.py migrate --all
-
Start the development server
python manage.py runserver_plus
If you are going to use document cloud to load and view the PDF documents, then you will have to set the 'head html' in the admin, when creating the document set:
<script src="http://s3.documentcloud.org/viewer/loader.js"></script>
and the template function:
// Javascript function to insert the document into the DOM.
// Receives the URL of the document as its only parameter.
// Must be called insertDocument
// JQuery is available yeah
// resulting element should be inserted into div#document-viewer-container
function insertDocument(document_url) {
var url = document_url.match(/(.+)\.html$/)[1];
DV.load(url + '.js', {
container : 'div#document-viewer-container', width:650,height:835,sidebar:false});
}
There is an option 'Add Documents to this document set' in the admin for the document set. You can upload a CSV with columns document_title and document_url. This will create documents in the document set with that name and link to that url.
CrowData's copyright is © 2013 Manuel Aristarán jazzido@jazzido.com. CrowData was developed with Open News and La Nacion Argentina.
Crowdata
is an open source project that was born when Manuel Aristaran was an Open News fellow at La Nacion in 2013. It was finally released as free software when Gabriela Rodriguez continued it for VozData in 2014. Thanks to Cristian Bertelegni and La Nacion for contributing to the code.
Now it relies on contributions from people and organizations. Please, use it, comment on it and make improvements by pull requests in 'GitHub http://github.com/crowdata/crowdata'.
- Fork the repo
- Clone your fork
- Make a branch of your changes
- Make a pull request through GitHub, and clearly describe your changes