Helioviewer-Project/api

Comparison of JP2 and other data holdings at different locations

Opened this issue · 9 comments

wafels commented

There are four different JPIP servers running - GSFC, ROB, IAS and ESAC. Each of them are serving the data stored at that location. It would be good to understand more completely which data are at which location. This would improve the ability of each helioviewer location to fill in gaps in their data holdings.

The leading idea for this is from Bogdan's message here

My idea is to have insertion time as a column in the HV database and add the ability to query that remotely (HAPI?)

So the idea is something like this:
diagram of mirror algorithm

If insertion time is the best metric to use, there are some things to think about:

  • Only works one-way. When B mirrors from A, the timestamps in B will be newer then A.
  • I'm not sure if there could be issues if working in both mirror mode + pulling from upstream sources

@bogdanni @ebuchlin, any thoughts?

Hello, thanks for the proposal! We definitely need a better synchronization process.

A few clarifications first about your flowchart:

  • Is "query primary server for new files" about ingestion date being more recent than the last (remote) ingestion date of the files already received (by the local server)? (I guess yes)
  • In the different servers' databases, is ingestion date the ingestion date on the local server, or first ingestion date of this version of the file on any server?
  • In the box "is timestamp more recent?", is this about the remote file timestamp being more recent than the local file file timestamp? (I guess yes) But then how is the remote file timestamp obtained by the local server?
  • What is happening to the local database after download, do we keep the same file ingestion mechanism as currently?

A proposal:

  • The "ingestion date" (everything in UTC) in the local database is the local ingestion date, which should be more or less equivalent to the local file timestamp (and same for the remote server, which is "local" from its own viewpoint)
  • The query to the remote server is by ingestion date range (usually, in routine operations: from the last locally processed remote ingestion date to now)
    • This query uses the remote server's HAPI interface
    • There could be other conditions in the query (observation date range, dataset...) because maybe we don't want to update everything at once, but then the last locally processed remote ingestion date cannot be stored to be used later anymore.
  • For each file in the result of this query, ordered by ingestion date, if (the file does not exist in the local database) or (the local ingestion date is older than the remote ingestion date):
    • download file
    • ingest it (setting the ingestion date to the current date; replacing existing database row if there is one)
  • There could be a mechanism to check that the servers' times (clocks) are synchronized as expected

Clarifications:

Is "query primary server for new files" about ingestion date being more recent than the last (remote) ingestion date of the files already received (by the local server)? (I guess yes)

Yes, new files means the ingestion date on the remote server is newer than the ingestion date on the local server. Since after the local server pulls the new file, its local ingestion date will be newer than the remote ingestion date, this method only works one way. So there must be a "primary" or "authoritative" server.

In the different servers' databases, is ingestion date the ingestion date on the local server, or first ingestion date of this version of the file on any server?

It is the ingestion date of the file on the local server.

In the box "is timestamp more recent?", is this about the remote file timestamp being more recent than the local file file timestamp? (I guess yes) But then how is the remote file timestamp obtained by the local server?

Yes, it will be one of the parameters returned in query.

What is happening to the local database after download, do we keep the same file ingestion mechanism as currently?

Yes, the existing ingestion method would stay the same. The part this proposal changes is how new files are selected. For this download method, instead of querying a web directory, it would query another helioviewer server and filter the results following the proposal we're discussing here.

Proposal Comments:

The "ingestion date" (everything in UTC) in the local database is the local ingestion date, which should be more or less equivalent to the local file timestamp (and same for the remote server, which is "local" from its own viewpoint)

Agreed, UTC for all dates. And the date of interest is the time the file was added to the local database.

There could be other conditions in the query (observation date range, dataset...) because maybe we don't want to update everything at once, but then the last locally processed remote ingestion date cannot be stored to be used later anymore.

It sounds like the dates will need to be stored per-source. Since the database is already storing the latest ingestion dates, it could get the latest ingestion dates for the sources being updated, and choose the oldest time from that selection.

Updated diagram:
Updated diagram for sync method

I'm away and I could not think about this schema.

Another thing to add is the computation of a checksum for each file ingested. This checksum to be stored as a column and can be retrieved over HAPI. This allows the following data integrity checks:

  • identify local storage corruption
  • identify when a file on a remote server is different

So the query could select remote files based on their (remote) ingestion time, but the comparison between remote files and local files could be on checksum only? (no need to compare ingestion times if the checksums are compared)

Checksum sounds good. That seems more reliable than just checking a timestamp.

I wrote up what we've discussed so far here. And I'm sure I've made some assumptions particularly about who's mirroring which sources.

Please review and feel free to edit.

For the HAPI server, how will the datasets be grouped? We could group by source id, then each HAPI dataset would be at the measurement level i.e. AIA 94 is its own dataset, AIA 304 is its own dataset, etc

Looks good. A small comment (somewhat following this comment): the query on the primary server needs to select by ingestion time, but I don't see why returning ingestion time is necessary. In pseudo SQL: SELECT name, ingestion_time, jp2_url WHERE ingestion_time > ..., probably no need for SELECT ingestion_time, name, ....

I don't have an opinion on dataset grouping.

Makes sense. Technically in HAPI there's no way to turn that off, though. Time is always returned even if it's not requested.