ckan/ideas

Activity Stream connected to ELK/Splunk for analysis

davidread opened this issue · 7 comments

Admins should be able to analyse key activity in CKAN. This means creation / editing / deleting a dataset or organization, or user permissions being granted.

Example use cases:

  • generate stats on numbers of datasets being created / edited per week, broken down by user, organization, number of resources, resource hostname (lots of scope for this)
  • an admin behaves badly, so the system admin wants a full audit trail of their actions, including the full dataset changes.
  • report on which organization's datasets are being updated and which are not

Whilst the CKAN logs are mainly unstructured data, the Activity Stream is good, but incomplete and not easily accessible in the database.

Whilst we can continue to show the basic Activity Stream in CKAN's web interface, let's take advantage of external data analysis software to allow more advanced exploration, searching, filtering and graphing, rather than trying to build it into CKAN.

I propose:

  • I complete the Activity Stream functionality so that every create/edit/delete of package/group/organization/user/member is logged in the Activity Stream. We can also record if an action is done on the web interface or via the API.
  • Activity Stream can be linked into data analysis software to be sliced and diced. The simplest way would be to add an option to dump the Activity Stream data to a JSON log file, which can be easily loaded into lots of analysis software. For example this could be shipped in real time to an ELK stack and explored and graphed by pointing and clicking. Or enterprises can do similar things with Splunk or Sumo Logic.

(An alternative to a JSON log file would be getting the analysis software to talk to Postgres directly using JDBC, and setting up queries to do lots of joins to get the full Activity Stream. This relies a lot more on the analysis software having this capability and setting up the queries in it. I think it would be better to do the join query work in CKAN, with the result in JSON log, which is really much more flexible, and easily shipped.)

Comments v. welcome!

I'm working with OpenGov to explore this, so in particular please chip in @jqnatividad @jhinds

@davidread can the analysis software you're thinking of using consume web services? The activity stream is indexed on timestamp and can be paged by timestamp, making retrieval through the API fairly fast. It would be safer for future compatibility then relying on the database structure.

@TkTech good thinking. I found an extension for something like that for the ELK stack, and maybe the other data analysis software can do it too.

However, as I understand it, these apps only facet by the top level keys, so I think there is still a job to flatten data (e.g. promote username from the user dictionary to the top level, and add a key which is a list of all the resource formats). I'm keen for this to be available to any log analysis software, so rather than do the transform in the log analysis software, it might be better as a bit of python code in between the Activity Stream API call and the log file, which is the universal data format for log analysis software. So that might as well be done as a bit of CKAN or CKAN extension.

I've got some skills and experience with ELK.
For fun I transformed our catalogue and loaded it into ELK - used create_date as the timestamp. I also transformed orgs ...i'm forgetting what specifically i did. I created a couple of time iines (Tielion) and some viz's.
@davidread just point me to and event stream and I'll get it into ELK and share what I did so people can build on it.
A handful of visualizations you'd want to see wouuld help.

I've started a repo here: https://github.com/davidread/ckanext-analytics

It's got a simple script that exports Activity Stream as JSON lines, to play about with.

@dkelsey I'd be very happy to get your feedback - I'm not clear if JSON is the way to go with this or whether Kibana and co work better with a flat structure and we should work to export as CSV.

Hi @davidread! What's the status of ckanext-analytics? Now that your Activity Stream work is further along, perhaps we can revisit this?

@jqnatividad ckanext-analytics is a proof of concept. I've not done anything with it since last summer. The activity stream is now a bit more robust (ckan/ckan#4626) and saves the full dataset dict (ckan/ckan#3972), so this would be a great time to revisit this work. tbh my clients aren't pushing on this at the moment, so very happy to pass the baton to you and see where it leads.

loleg commented

This is a great idea, and I am thinking it might also be related to #211 since data loading pipelines could also be used on CKAN's internal streams.