Org-SQL

This is a SQL backend for Emacs Org-Mode. It scans through text files formatted in org-mode, parses them, and adds key information such as todo keywords, timestamps, and links to a relational database. For now only SQLite is supported.

Motivation and Goals

Despite the fact that Emacs is the (second?) greatest text editor of all time, it is not a good data analysis platform, and neither is Lisp a good data analysis language. This is the strong-suite of other languages such as Python and R which have powerful data manipulation packages (dplyr and pandas) as well as specialized presentation platforms (shiny and dash). The common thread between these data analysis tools is a tabular data storage system, for which SQL is the most common language.

Therefore, the primary goal of org-sql is to provide a link between the text-based universe of Emacs / Org-mode and the table-based universe of data analysis platforms.

A common use case for org-mode is a daily planner. Within this use case, some questions that can easily be answered with a SQL-backed approach:

How much time do I spend doing pre-planned work? (track how much time is spent clocking)
How well do I estimate how long tasks take? (compare effort and clocked time)
Which types of tasks do I concentrate on the most? (tag entries and sort based on effort and clocking)
How indecisive am I? (track how many times schedule or deadline timestamps are changed)
How much do I overplan? (track number of canceled tasks)
How much do I delegate (track properties indicating which people are working on tasks)
How many outstanding tasks/projects do I have? (count keywords and tags on headlines)

There are other uses for an Org-mode SQL database. If one has many org files scattered throughout their filesystem, a database is an easy way to aggregate key information such as links or timestamps. Or if one primary uses org-mode for taking notes, it could be a way to aggregate and analyze meeting minutes.

Of course, these could all be done directly in Org-mode with Lisp code (indeed there are already built-in functions for reporting aggregated effort and clock-time). But why do that when one could analyze all org files by making a descriptive dashboard with a relatively few lines of R or Python code?

Installation

Download the package from MELPA

M-x package-install RET org-sql RET

Alternatively, clone this repository into your config directory

git clone git@github.com:ndwarshuis/org-sql.git ~/config/path/org-sql/

Once obtained, add the package to load-path and require it

(add-to-list 'load-path "~/config/path/org-sql/")
(require 'org-sql)

One can also use use-package to automate this entire process

(use-package org-sql
  :ensure t
  ;; add config options here...
  :config)

Dependencies

Emacs-Lisp packages

Org-mode (tested with 9.x)
dash (tested with 2.x)

Installed programs

sqlite (tested with 3.x)

Configuration

General behavior

org-sql-sqlite-path: the path to the sqlite database to be created
org-sql-files: list of org files to insert into database
org-sql-pragma: pragma to use for new connections (useful for performance tuning)
org-sql-buffer: the name of the buffer for the SQLite connection
org-sql-debug: turn on SQL transaction debug output in the message buffer

Database storage

These options control what data gets stored in the database, and are useful to minimize the size of the database as well as the time it takes to update:

org-sql-ignored-properties: list of properties to ignore
org-sql-ignored-tags: list of tags to ignore
org-sql-ignored-link-types: list link types to ignore (eg mu4e, file)
org-sql-included-healine-planning-types: planning types (eg :closed, :scheduled) to include
org-sql-included-contents-timestamp-types: type of timestamps (eg active, inactive) to include
org-sql-included-logbook-types: types of logbook entries to include (eg note, reschedule, etc)
org-sql-use-tag-inheritance: add inherited tags to the database
org-sql-store-clocks: whether to include clocks (nil implies no clock notes are desired either)
org-sql-store-clock-notes: whether to include clock notes

Logbooks

Much of the extracted data from org-sql pertains to logbook entries, and there are a number of settings that effect how this data is generated in org files and how it may be parsed reliably.

Firstly, one needs to set the relevant org-mode variables in order to capture logging information. Please refer to the documentation in org-mode itself for their meaning:

org-log-done
org-log-reschedule
org-log-redeadline
org-log-note-clock-out
org-log-refile
org-log-repeat
org-todo-keywords (in this one can set which todo keywords changes are logged)

Obtaining the above information for the database assumes that org-log-note-headings is left at its default value. This limitation may be surpassed in the future.

Additionally, for best results it is recommended that all logbook entries be contained in their own drawer. This means that org-log-into-drawer should be set to LOGBOOK and org-clock-into-drawer should be set to t (which means clocks go into a drawer with hardcoded name LOGBOOK). Without these settings, org-sql needs to guess where the logbook entries are based on location and pattern matching, which is not totally reliable.

Usage

Initializing

Run org-sql-user-reset. This will create a new database and initialize it with the default schema. It will also delete an existing database before creating the new one if it exists in org-sql-sqlite-path.

Updating

Run org-sql-user-update. This will synchronize the database with all files as indicated in org-sql-files by first checking if the file is in the database and inserting it if not. If the file is already present, it will check the md5 to assess if updates are needed. This function will insert the entire content of any org file that is either new or changed.

Note that the database will take several seconds to minutes if inserting many files depending on the speed of your device (particularly IO) and the size/number of files. This operation will also block Emacs until complete.

Clearing all data

Run org-sql-user-clear-all. This will clear all data but leave the schema.

Database Layout

The database is arranged by files at the top level and by org-mode features moving down to child tables. Primary keys are foreign keys are marked with P and F in parens respecively. All dates are converted into unix time integers before entering into the database.

hierarchy

The databases are arranged as follows according to their foreign key contraints:

files
- headlines
  - tags
  - properties
  - clocking
  - logbook
    - state_changes
    - planning_changes

files

Stores one row per org file

column	type	description
file_path (P)	text	path to an org file
md5	text	md5 sum of the org file
size	integer	size of the file in bytes
time_modified	integer	unused
time_created	integer	unused
time_accessed	integer	unused

headlines

Stores one row for each headline in a given org file as well as any atomic metadata attached to that headline

column	type	description
file_path (PF)	text	path to file containing the headline
headline_offset (F)	integer	offset of the headline’s first character in the org file
tree_path (P)	text	outline tree path of the headline
headline_text	text	raw text of the headline
keyword	text	the TODO state keyword
effort	integer	the value of the Effort property in minutes
priority	char	character value of the priority
archived	bool	true if the headline has an archive tag
commented	bool	true if the headline has a comment keyword
contents	text	unused

links

Stores one row for each link in a given org file and headline

column	type	description
file_path (PF)	text	path to file containing the link
headline_offset (F)	integer	offset of the headline’s first character in the org file
link_offset (P)	integer	offset of the link’s first character in the org file
link_path	text	target of the link (url or similar)
link_text	text	display text of the link in the org buffer
link_type	text	type of the link (eg http, mu4e, file, etc)

properties

Stores one row for each property identified in an org file and a given headline. This is also where one could identify if a headline is archived as it will have properties such as ARCHIVE_TIME

column	type	description
file_path (PF)	text	path to file containing the property
headline_offset (F)	integer	offset of the headline’s first character in the org file
property_offset (P)	integer	offset of the property’s first character in the org file
key_text	text	the property key
val_text	text	the property value
inherited	boolean	unused

column	type	description
file_path (PF)	text	path to file containing the tag
headline_offset (PF)	integer	offset of the headline’s first character in the org file
tag (P)	text	the tag value
inherited (P)	boolean	1 if inherited, 0 if not

clocking

Stores one row for each clock entry identified in an org file and its given headline

column	type	description
file_path (PF)	text	path to file containing the clock
headline_offset (F)	integer	offset of the headline’s first character in the org file
clock_offset (P)	integer	offset of the clock’s first character in the org file
time_start	integer	timestamp for the start of the clock
time_end	integer	timestamp for the end of the clock
clock_note	text	the note text beneath the clock if available

logbook

Stores one row for each entry in the logbook underneath a headline (excluding clocks). Some entries may have additional information associated with them for planning and state changes as given in the child tables below.

column	type	description
file_path (PF)	text	path to file containing the entry
headline_offset (F)	integer	offset of the headline’s first character in the org file
entry_offset (P)	integer	offset of the entry’s first character in the org file
entry_type	text	type of the logbook entry (see `org-log-note=headines`)
time_logged	integere	timestamp for when the entry was taken
header	text	the first line of the note, usually standardized
note	note	the text underneath the entry header

note: the header should match org-log-note-headings unless it is a clock note

state_changes

Stores one row per logbook entry with state change information (as triggered by any keywords configured to log in org-todo-headings)

column	type	description
file_path (PF)	text	path to file containing the entry
entry_offset (PF)	integer	offset of the clock’s first character in the org file
state_old	text	former todo state keyword
state_new	text	updated todo state keyword

planning_changes

Stores one row per logbook entry with planning changes as triggered by setting org-log-reschedule and org-log-redeadline.

column	type	description
file_path (PF)	text	path to file containing the entry
entry_offset (PF)	integer	offset of the entry’s first character in the org file
timestamp_offset (F)	integer	offset of the timestamp for the former planning entry

timestamp

In the future, this will have one row for each timestamp under a headline. For now it just refers to timestamps in the planning block (eg CLOSED, SCHEDULED, or DEADLINE) and timestamps in logbook headings for planning changes. It does not include the timestamp for the time a log entry was taken (this is stored in the logging table above) nor does it store timestamps for clocks (stored in the clocking table). Eventually this table will include timestamps in logbook notes as well as headline contents.

column	type	description
file_path (PF)	text	path to file containing the entry
headline_offset (F)	integer	offset of the entry’s first character in the org file
timestamp_offet (P)	integer	offset of the timestamp for the former planning entry
raw_value	text	the string of the timestamp as it appears in org buffers
type	text	either inactive or active
planning_type	text	either closed, scheduled, deadline, or null
warning_type	text	the warning type (all or first)
warning_value	integer	value of warning shift
warning_unit	text	unit for warning (hour, day, week, month, year)
repeat_type	text	the repeater type (catch-up, restart, cumulate)
repeat_value	integer	value of repeater shift
repeat_unit	text	unit for repeater (hour, day, week, month, year)
time	integer	the time of the timestamp in unix time
time_end	integer	like time but for the end if this is a range
resolution	text	minute if timestamp specifies hours/minutes, else day
resolution_end	text	like resolution but for the end if this is a range

NOTES:

Inactive-range and active-range will not appear in the type column. Use time_end instead to determine if the timestamp is a ranged timestamp
There is no distinction in this table alone between planning timestamps that are part of the planning section (eg CLOSED: [whatever]) and those from logbook entries that pertain to planning changes (eg Removed deadline, was [whatever]). This distinction requires joining the planning_changes table, which will only reference logbook planning change timestamps.

Contributions

Pull requests welcome, especially those for other SQL implementations.

Acknowledgments

The idea for this is based on John Kitchin’s implementation, which uses emacsql as the SQL backend.

vonHabsi/org-sql