Build in a step that can automatically number rows sequentially between extraction from the source and insertion into Plenario
Closed this issue · 9 comments
An alternative to #187 that may even be preferable in that it might be easier to implement and would be more robust -- not limited just to Socrata datasets. It would assure that every dataset had a unique ID even if the underlying one did not.
The challenge is that when the source dataset is updated, it could include insertion of rows, and not necessarily at the top or bottom of the file. Then we would need to do a full refresh of Plenario's copy of the dataset every time we update, which is potentially once an hour or even more frequently (for weather/sensor data). Are there ways we could guarantee that this unique ID we attach to rows stays attached to that same row? That is the main purpose for the unique ID requirement - we need a way to efficiently update only rows that have been added or altered.
I think the comment was more directed to us. @levyj has done a bit of work on including "unique IDs"--by hashing all of the columns--in source data we publish.
Plenario no longer requires a unique ID column to ingest datasets. I did this using a method that @levyj suggested: hashing all records. Now when Plenario periodically refreshes, it compares hashes and deletes records currently in Plenario not present in the data provider's dataset, while inserting records in the provider's dataset not present in Plenario.
Great! Thanks.
Jonathan Levy
Open Data Program Manager
(312) 744-2790
jonathan.levy@cityofchicago.org
data.cityofchicago.org
@JonOpenData
-----Original Message-----
From: Will Engler [notifications@github.com]
Received: Saturday, 23 Jan 2016, 10:26AM
To: UrbanCCD-UChicago/plenario [plenario@noreply.github.com]
CC: Levy, Jonathan [Jonathan.Levy@cityofchicago.org]
Subject: Re: [plenario] Build in a step that can automatically number rows sequentially between extraction from the source and insertion into Plenario (#188)
Plenario no longer requires a unique ID column to ingest datasets. I did this using a method that @levyjhttps://github.com/levyj suggested: hashing all records. Now when Plenario periodically refreshes, it compares hashes and deletes records currently in Plenario not present in the data provider's dataset, while inserting records in the provider's dataset not present in Plenario.
Reply to this email directly or view it on GitHubhttps://github.com//issues/188#issuecomment-174197892.
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.
Sad to know I get no credit for it. :(
I remember discussing the idea of hashing lat/lon + time as uniqueid and it
was considered as absurd---taking away a core user requirement. Thanks
@levyj for saying the same thing--certainly in a better way.
IMO the downside of hashing is that it is computationally expensive for
large datasets. So I would still keep a watch out for this change. But I am
glad to know, for now, it is working as I originally thought it would.
On Sat, Jan 23, 2016 at 10:25 AM, Will Engler notifications@github.com
wrote:
Plenario no longer requires a unique ID column to ingest datasets. I did
this using a method that @levyj https://github.com/levyj suggested:
hashing all records. Now when Plenario periodically refreshes, it compares
hashes and deletes records currently in Plenario not present in the data
provider's dataset, while inserting records in the provider's dataset not
present in Plenario.—
Reply to this email directly or view it on GitHub
#188 (comment)
.
Are the ETLs rerunning? Seems like a lot of datasets are offline right now.
[cid:7807fac7-6382-4e6e-84f6-3cafd47b8a83]
Tom Schenk Jr.
Chief Data Officer
Department of Innovation and Technology
City of Chicago
(312) 744-2770
data.cityofchicago.org
From: Will Engler notifications@github.com
Sent: Saturday, January 23, 2016 10:25 AM
To: UrbanCCD-UChicago/plenario
Cc: Schenk, Tom
Subject: Re: [plenario] Build in a step that can automatically number rows sequentially between extraction from the source and insertion into Plenario (#188)
Reply to this email directly or view it on GitHubhttps://github.com//issues/188#event-523929253.
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.
It looks like a lot of datasets were unavailable this evening, but then came back online at ~6:30 EST. No idea yet why they were unavailable or why they returned.
Interesting to know that in addition to knowingly stealing the idea from Socrata, I unknowingly stole it from @Pinkalicious. Credit happily shared!
Does Plenario still make use of a unique ID if there is one? That might help with the computational expense issue. If we are already paying that cost (so far, never too high for our datasets), Plenario might be able to avoid paying it again.
@levyj Per this change we don't make use of a unique ID at all. Fortunately, the computational expense isn't a worry for us either so far (updates take a few minutes at most, even for bigger datasets like Chicago Crimes).