download_url continues to display "job not done"

Question

download_url continues to display "job not done"

Closed this issue 7 years ago · 39 comments

job_extract <- bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")

bqr_wait_for_job(bqr_get_job('marketing-insights', job_extract$jobReference$jobId))

bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")
### Error in bqr_grant_extract_access(job_extract, "causey@spotify.com") : 
  Job not done

Answer 1 · 2017-03-08T18:10:30.000Z

Also tried this...

download_url <- bqr_grant_extract_access(bqr_get_job('marketing-insights', job_extract$jobReference$jobId), "causey@spotify.com")
Request Status Code: 403
Error in checkGoogleAPIError(req) : 
  JSON fetch error: The owner of the resource is required to have OWNER access.

(Even though I am signed in as causey@spotify.com via:

options(googleAuthR.scopes.selected = c('https://www.googleapis.com/auth/devstorage.full_control', 'https://www.googleapis.com/auth/cloud-platform'))
options(bigQueryR.scopes= c('https://www.googleapis.com/auth/devstorage.full_control', 'https://www.googleapis.com/auth/cloud-platform'))
library(bigQueryR)
bqr_auth(new_user=TRUE)

Answer 2 · 2017-03-08T18:12:36.000Z

I'm in the middle of moving tasks so it's unstable at the moment, the idea is you pass the job object and not job$name. I guess this is the github version from our earlier conversation :)

Answer 3 · 2017-03-08T18:29:46.000Z

ah, yes, forgot I was on your GitHub version! Haha.

Got it. So this is the job object:

bqr_get_job('marketing-insights', job_extract$jobReference$jobId)

So I should be running (on a stable version):

download_url <- bqr_grant_extract_access(bqr_get_job('marketing-insights', job_extract$jobReference$jobId), "causey@spotify.com")

Answer 4 · 2017-03-08T18:37:29.000Z

FYI, I just reinstalled the CRAN version and started a fresh R session from scratch, then tried again (both iterations) and got the same errors...

Answer 5 · 2017-03-08T18:37:53.000Z

You may want to install from the commit right after your fix

Answer 6 · 2017-03-08T18:56:32.000Z

devtools::install_github("cloudyr/bigQueryR", ref='439ea30a9e087eca30d9b8b7bb0e3a8270b0b160')

Done! (Still got same errors, but at least I have the fix)
Didn't realize you could install from a specific commit with install_github. Handy.

Answer 7 · 2017-03-08T21:48:01.000Z

Ok cool will look at it, traveling at the moment, at Google Next :)

Answer 8 · 2017-03-10T16:02:24.000Z

Nice! That looks awesome. LMK how it goes, might have to target that for next year.

Answer 9 · 2017-03-10T18:36:44.000Z

P.S. I have an identical query that runs in BQ (UI) and in bigrquery, but doesn't work in bigQueryR... I get some weird error about a matrix. *The query (legacy SQL):* - I am selecting two variables, a simple count aggregation, and a window function (i.e. row_number() over(partition by varx order by num_users desc) rank) -- four output columns to be returned in total - I am using subqueries and flattening - I also use the TABLE_QUERY function to return the most recent table partition - I later use a substr function to extract text from within a string - There is a filter applied, and then a group by (i.e. group by 2,3)

Answer 10 · 2017-03-10T18:38:27.000Z

Oh and this is using the bqr_query function. Parameters are set to: useLegacySql=TRUE, useQueryCache=FALSE, maxResults=1000

Answer 11 · 2017-03-10T18:46:56.000Z

And other queries are still working via bqr_query. Trying to pick apart what's causing the error uniquely here (even for selecting just a few records): - flatten and subquery still work - the table query for most recent partition works - the substring works... i.e. each of the three things being joined together work individually Leads me to think maybe the window function is causing the error?

Answer 12 · 2017-03-10T19:00:57.000Z

Okay, I've isolated the event. Query 1 works and Query 2 works, but when these queries are joined, the error occurs. Query 1 returns a single column of user IDs and so does query 2, but when they are joined, the error occurs (note that in the original query, but of these subqueries had multiple columns, but even when reduced to a single column each, the same error occurred upon joining them (by that column)).

Answer 13 · 2017-03-10T20:58:36.000Z

Joining Query 1 and Query 3 also works, so what makes Query 2 different? I can't really tell... I've simplified everything extremely, and it's baffling to me that Q1, Q2, Q3, and Q1+Q3 work while Q1+Q2 fails (uniquely in bigQueryR). select user_id from (select user_id from flatten([project:dataset.table],key) where y >= 1) tbl1 inner join (select user_id from [project2:dataset.table]) tbl2 on tbl1.user_id = tbl2.user_id limit 10")

Answer 14 · 2017-03-10T20:59:02.000Z

*#correction-first line should have alias. as this:* *select tbl1.user_id as user_id from * (select user_id from flatten([project:dataset.table],key) where y >= 1) tbl1 inner join (select user_id from [project2:dataset.table]) tbl2 on tbl1.user_id = tbl2.user_id limit 10")

Answer 15 · 2017-03-13T05:17:38.000Z

@causeyrob I've split up the issues into another Github issue, we can talk about the query not working here: #36

For the original job extract from GCS:

job_extract <- bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")

bqr_wait_for_job(bqr_get_job('marketing-insights', job_extract$jobReference$jobId))

bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")
### Error in bqr_grant_extract_access(job_extract, "causey@spotify.com") : 
  Job not done

I'm changing jobs to be easier syntax, so it will change it so you don't need to find the $jobId and just pass the job returned e.g.

job_extract <- bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")
bqr_wait_for_job(job_extract)

Regarding the issue:

Your syntax looks ok, but is the problem the job reported it was done when it was not? You can check in bqr_list_jobs() or in the interface if your particular job was successful.

It may be the job didn't finish as you queried it too quickly or the extract failed - from the error you show it sounds as if the Google Cloud Storage bucket needs to be owned by the same user as the BigQuery extractor (e.g. causey[at]spotify.com).

Is it possible you are trying to extract to a bucket you didn't create? Remember the owner will be the user authenticated at the time in the file, you can get who is authenticated etc in the script now via:

options(googleAuthR.verbose = 2)
googleAuthR::gar_token_info()

Answer 16 · 2017-03-13T06:29:59.000Z

Thanks! Hope the conference has been fun.

Yes, two things. Bucket is definitely mine, but I believe a colleague may have created it for me. Potential problem #1. Other one is that I am using a Service account. This is what I see when I run the gar_token_info() code you mention.

Type: service_account
ProjectID: marketing-insights
Client email: causey@marketing-insights.iam.gserviceaccount.com

So seems like potentially I should try again with my own personal authentication, rather than a service account (i.e. causey@spotify.com instead of causey@marketing-insights.iam.gserviceaccount.com). Can also create a brand new bucket and play around with it instead, since it's one that I would definitely own.

Any idea how to check bucket ownership? When I went into the bucket settings, it was just displaying some weird tokens. But the token next to "You" was different from the token next to "Owners" so perhaps therein lies my answer.

Answer 17 · 2017-03-13T12:36:05.000Z

The conference was very awesome, back with lots of ideas :)

Yes it sounds like ownership is the issue - confusingly it can also be the case you can create a bucket but not own the objects within it. Safest is to make sure that Google Cloud Storage and BigQuery are all verified under the same user. You need at least write access for the bucket objects, that from memory you can alter via the drop down to left of folder when you browse on the website.

Answer 18 · 2017-03-13T16:25:54.000Z

Hmm.. now I'm confused. I created everything in the bucket, but you're saying I might not own these objects regardless? I am assuming I have write access to the bucket since I created everything in it, but maybe I'm not understanding. The drop down to the left of the folder when you browse on the website is where I was saying that I encountered those strange tokens.

Answer 19 · 2017-03-13T16:43:10.000Z

Its all sorted if you make sure you authenticate bigQueryR and googleCloudStorageR with the same auth (e.g. use the same service account) - then make sure that service account email has access to the bucket. Perhaps you need to add that email as a user to the bucket, with write access.

Answer 20 · 2017-03-13T17:18:18.000Z

Ok, cool - just updated my object default permissions and my bucket permissions (added both my service account email and main email as direct owners)... haven't been using googleCloudStorageR directly, so will look into that as well, in case there's an auth conflict there. Should cover me on this end! Now to figure out this bizarre error that's occurring on the other more pressing issue... cheers for the support! [image: Inline image 1] [image: Inline image 2]

Answer 21 · 2017-03-16T15:16:41.000Z

Update here: the bucket permissions DID allow me to use the bqr_extract_data function successfully, so thank you very much! The bqr_grant_access function, however, continues to error out when I follow your example. With some tweaks, I once got it to complete without an error-- or actually sending the email.

As per http://code.markedmondson.me/bigQueryR/query.html

library(bigQueryR)

\#\# Auth with a project that has at least BigQuery and Google Cloud Storage scope
bqr_auth()

\#\# make a big query
job <- bqr_query_asynch("marketing-insights",
                        "Rob",
                        "SELECT * FROM test LIMIT 10",
                        destinationTableId = "test2")

\#\# poll the job to check its status
\#\# its done when job$status$state == "DONE"
bqr_wait_for_job(bqr_get_job("marketing-insights", job$jobReference$jobId))

\#\#once done, the query results are in "bigResultTable"

\#\# Create the data extract from BigQuery to Cloud Storage
job_extract <- bqr_extract_data("marketing-insights",
                                "Rob",
                                "test2",
                                "causey",
                                filename="test2.csv")

\#\# poll the extract job to check its status
\#\# its done when job$status$state == "DONE"
bqr_wait_for_job(bqr_get_job("marketing-insights", job_extract$jobReference$jobId))

\#\# to download via a URL and not logging in via Google Cloud Storage interface:
\#\# Use an email that is Google account enabled
\#\# Requires scopes:
\#\#  https://www.googleapis.com/auth/devstorage.full_control
\#\# https://www.googleapis.com/auth/cloud-platform
\#\# set via options("bigQueryR.scopes") and reauthenticate if needed

download_url <- bqr_grant_extract_access(job_extract, "causey@spotify.com")
Error in bqr_grant_extract_access(job_extract, "causey@spotify.com") : 
  Job not done

Answer 22 · 2017-03-16T16:04:38.000Z

Thanks, do you also have your working example code?

Answer 23 · 2017-03-16T16:50:15.000Z

download_url <- bqr_grant_extract_access(bqr_get_job("marketing-insights", job_extract$jobReference$jobId), "causey@spotify.com")

Answer 24 · 2017-03-16T17:04:51.000Z

Ok I think this is most likely due to you are not authenticated with Google Cloud Storage as well. The CRAN release defaults to BigQuery authentication only.

I have changed this to default scope to https://www.googleapis.com/auth/cloud-platform which is easiest when you are dealing with BigQuery and its interaction with Cloud Storage, and updated the examples.

A example that works in the new tests is:

  job_extract <- bqr_extract_data(tableId = "test3",
                                  cloudStorageBucket = gcs_get_global_bucket())

  job <- bqr_wait_for_job(job_extract)
  
  urls <- bqr_grant_extract_access(job, email = "my@email.com")

Answer 25 · 2017-03-16T17:55:35.000Z

Hmm.. as per https://github.com/cloudyr/bigQueryR - tried downloading from install.packages("bigQueryR", repos = c(getOption("repos"), "http://cloudyr.github.io/drat")) but got this error:

Guessing to get the latest changes I'll need go straight here? devtools::install_github("cloudyr/bigQueryR")

I definitely authenticated with GCS the last time we spoke about it...

I also used that scope...

Answer 26 · 2017-03-16T18:00:35.000Z

Yes sorry should have said Github version.

And I would now recommend authenticating via an environment variable, so you don't have to think about it too much again. This involves putting a .Renviron file in your home directory with these contents:

GCS_AUTH_FILE="location/auth.json"
BQ_AUTH_FILE="location/auth.json"

As googleCloudStorageR uses this too then it makes sure both are authenticated with the same file. Then you just need to load the library and it will handle the auth automatically.

Answer 27 · 2017-03-16T18:01:38.000Z

Hmm using your new code off of GHE, I get this error..

Answer 28 · 2017-03-16T18:01:58.000Z

Tried doing this beforehand, as a result, but same error occurred.

Answer 29 · 2017-03-16T18:03:27.000Z

This is set now via a new function:

bq_global_project("your-project")

I will soon also have an option to put this in the environment file. (this is makes it more in line with the other cloudyr packages)

Answer 30 · 2017-03-16T18:04:16.000Z

Oh its in already! So add this to the environ file too:

BQ_DEFAULT_PROJECT_ID="your-project"
BQ_DEFAULT_DATASET="your dataset"

Answer 31 · 2017-03-16T18:10:51.000Z

:/ Same error.

Now have .Renviron in my working directory, with exact file path location of service account credentials, default project ID, and dataset.

GCS_AUTH_FILE="/Users/robcausey/Code/GitHub/causey.json"
BQ_AUTH_FILE="/Users/robcausey/Code/GitHub/causey.json"
BQ_DEFAULT_PROJECT_ID="marketing-insights"
BQ_DEFAULT_DATASET="Rob"

Answer 32 · 2017-03-16T18:11:14.000Z

I also tried restarting beforehand, verifying file paths, and adding those objects to my environment manually

Answer 33 · 2017-03-16T18:21:09.000Z

Just to make sure I am on the correct version, it is showing up as 0.2.0.90

Answer 34 · 2017-03-16T18:29:34.000Z

After updating, I also get this again, even though I'm clearly owner (as shown on earlier image).

Answer 35 · 2017-03-16T18:49:23.000Z

The project setting should work after you restart R, after which you can see it via Sys.getenv("BQ_DEFAULT_PROJECT_ID")

Then loading library(bigQueryR) it should say something like below:

If it doesn't set via bq_global_project("project-name")

The job_extract "Job not done" is peculiar, are you perhaps using an old job object? The bug there was it was reporting DONE when the job in fact errored. (I assume from not being able to write to the cloud bucket.)

Do the bqr_grant_extract_access() directly on the job object, not via bqr_get_job as it should be unnecessary and perhaps the problem.

so all in all make sure to run through extracting and waiting for the job:

  ## creates a job object
  job_extract <- bqr_extract_data(tableId = "test3",
                                  cloudStorageBucket = gcs_get_global_bucket())
  
  ## check job
  job_extract

 ## wait for job until its DONE
 job <- bqr_wait_for_job(job_extract)
  
  ## use completed job
  urls <- bqr_grant_extract_access(job, email = "my@email.com")

If thats the same and still not working then I need to fix something else.

Answer 36 · 2017-03-16T19:59:38.000Z

I got the library(bigQueryR) loading to appear as shown (was having issue with .Renviron, but I resolved it now).

However, I am getting errors for the bqr_extract_data function. gcs_get_global_bucket returns NULL (even if I load googleAuthR directly, which also claims to authenticate via my JSON file). If I name the bucket directly, it claims "Invalid credential" JSON fetch error, even though the JSON has worked to authenticate otherwise.

Answer 37 · 2017-03-16T20:01:45.000Z

Retrying the code from before that worked up until the very end suddenly fails to work sooner than it did previously, so I think something is wrong.
bqr_query_async was working previously, but now fails with JSON fetch credential error as well.

Answer 38 · 2017-03-17T08:45:59.000Z

Ok the offending line there is the scopes, setting only to Cloud Storage
Setting scopes to https://www.googleapis.com/auth/devstorage.full_control - the problem with google apis is it gives a similar error if you are wrong user, wrong scope etc. so difficult to pin down for the user.

If googleCloudStorageR is loaded it sets that automatically, but now on github version of bigQueryR you don't need to load googlecloudStorageR, and it also sets to the more general scope /cloud-platform that allows BigQuery and Cloud Storage. See my screenshot above for example of what if should look like.

So not necessarily a bug but I'm going to work on making this easier to diagnose for the user, as authentication issues suck :)

I will also use this thread to help document it for the next release on CRAN (which will be as soon as these issues are sorted out) as a lot of new changes are in now. Thanks so much for the aid! 🥇

Answer 39 · 2017-05-02T20:21:23.000Z

I think this is solved by getting the scopes all cleared up, make sure latest versions of googleAuthR and BigQueryR, new session and the best scope to use is this:

options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/cloud-platform")