Migrate reading fxa last active data from S3
Closed this issue · 15 comments
Currently basket reads FxA last active timestamps from an S3 bucket and inserts the data to SFMC for email journey purposes. They are delivered to the S3 bucket by heka, which is a logging service we'd like to decommission. That's why we need to figure out where to migrate this data to.
Currently we're sending our log data to Stackdriver Logging, which means it's trivial to send FxA log data to BigQuery or Google Cloud Storage. We could also use the various data transfer services to send the data back to S3 if we want to minimize changes to the basket code.
The code for this script lives in https://github.com/mozmeao/basket/blob/ef0e672ab502142d7f9639133849e63f3cbf2b90/basket/news/management/commands/process_fxa_data.py
The data format of the file that's downloaded is a CSV with no header line and two columns IE:
fxa_uid,unix_timestamp_in_seconds
I think the best middle ground sounds like keeping the same file format, and using files, but moving them to GCS. Would you agree with that @jbuck?
Yeah, I'd agree with that. I'll get an export to GCS working
@jbuck hey, any updates on the export to GCS? I think Heka is now just hanging by this one thread
No updates! But, let us cut this thread. I'm going to ask dataops about the best way to run this scheduled export and update this bug tomorrow with whatever the best solution is
I basically need to run a query like:
SELECT
jsonPayload.fields.uid AS uid,
MIN(timestamp) AS timestamp
FROM
`moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_20200526`
WHERE
jsonPayload.type = 'activityEvent'
AND jsonPayload.fields.event = 'account.signed'
GROUP BY
1
ORDER BY
2 ASC;
once a day and then export the results to GCS
@robotblake suggested that we use Airflow hosted by data-ops, and implement something along the lines of https://github.com/mozilla/telemetry-airflow/blob/master/dags/glam_subdags/extract.py or https://github.com/mozilla/telemetry-airflow/blob/master/dags/utils/amplitude.py
I'm happy to help here. Looks like the query is well-defined, so all I would need to know where in GCS we want to dump the contents.
I'm happy to help here. Looks like the query is well-defined, so all I would need to know where in GCS we want to dump the contents.
I think we could put it... well, anywhere. I can create a GCS bucket in the FxA project and a service account for basket/@pmac to access if that'd work? Or if it'd be easier to create a bucket/service account in the airflow account I'm fine with that too
I can create a GCS bucket in the FxA project and a service account for basket/@pmac to access if that'd work
That sounds like a good plan as long as it's easy for you to grant the Airflow service account write access.
@jbuck As discussed last tuesday, I know you have next steps identified. Can you share a timeline here?
Draft DAG for doing the daily export from BQ to GCS: mozilla/telemetry-airflow#1099
The DAG is now running and a first day of data should be available under gs://fxa-prod-basket/last-active-timestamp/20200803/
This is going to be handled by the BigQuery view of the log data that CDP has. I'm going to be removing the processing of this data from basket in #663
@pmac - Does this mean that we can tear down the DAG that's exporting this data to GCS? mozilla/telemetry-airflow#1099
I believe so. I'm not aware of anyone using that data at this point.