cloudyr/bigQueryR

download_url continues to display "job not done"

Closed this issue · 39 comments

job_extract <- bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")

bqr_wait_for_job(bqr_get_job('marketing-insights', job_extract$jobReference$jobId))

bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")
### Error in bqr_grant_extract_access(job_extract, "causey@spotify.com") : 
  Job not done

Also tried this...

download_url <- bqr_grant_extract_access(bqr_get_job('marketing-insights', job_extract$jobReference$jobId), "causey@spotify.com")
Request Status Code: 403
Error in checkGoogleAPIError(req) : 
  JSON fetch error: The owner of the resource is required to have OWNER access.

(Even though I am signed in as causey@spotify.com via:

options(googleAuthR.scopes.selected = c('https://www.googleapis.com/auth/devstorage.full_control', 'https://www.googleapis.com/auth/cloud-platform'))
options(bigQueryR.scopes= c('https://www.googleapis.com/auth/devstorage.full_control', 'https://www.googleapis.com/auth/cloud-platform'))
library(bigQueryR)
bqr_auth(new_user=TRUE)

image

I'm in the middle of moving tasks so it's unstable at the moment, the idea is you pass the job object and not job$name. I guess this is the github version from our earlier conversation :)

ah, yes, forgot I was on your GitHub version! Haha.

Got it. So this is the job object:

bqr_get_job('marketing-insights', job_extract$jobReference$jobId)

So I should be running (on a stable version):

download_url <- bqr_grant_extract_access(bqr_get_job('marketing-insights', job_extract$jobReference$jobId), "causey@spotify.com")

FYI, I just reinstalled the CRAN version and started a fresh R session from scratch, then tried again (both iterations) and got the same errors...

You may want to install from the commit right after your fix

devtools::install_github("cloudyr/bigQueryR", ref='439ea30a9e087eca30d9b8b7bb0e3a8270b0b160')

Done! (Still got same errors, but at least I have the fix)
Didn't realize you could install from a specific commit with install_github. Handy.

Ok cool will look at it, traveling at the moment, at Google Next :)

@causeyrob I've split up the issues into another Github issue, we can talk about the query not working here: #36

For the original job extract from GCS:

job_extract <- bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")

bqr_wait_for_job(bqr_get_job('marketing-insights', job_extract$jobReference$jobId))

bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")
### Error in bqr_grant_extract_access(job_extract, "causey@spotify.com") : 
  Job not done

I'm changing jobs to be easier syntax, so it will change it so you don't need to find the $jobId and just pass the job returned e.g.

job_extract <- bqr_extract_data("marketing-insights", "Rob", "my_tbl", "causey")
bqr_wait_for_job(job_extract)

Regarding the issue:

Your syntax looks ok, but is the problem the job reported it was done when it was not? You can check in bqr_list_jobs() or in the interface if your particular job was successful.

It may be the job didn't finish as you queried it too quickly or the extract failed - from the error you show it sounds as if the Google Cloud Storage bucket needs to be owned by the same user as the BigQuery extractor (e.g. causey[at]spotify.com).

Is it possible you are trying to extract to a bucket you didn't create? Remember the owner will be the user authenticated at the time in the file, you can get who is authenticated etc in the script now via:

options(googleAuthR.verbose = 2)
googleAuthR::gar_token_info()

Thanks! Hope the conference has been fun.

Yes, two things. Bucket is definitely mine, but I believe a colleague may have created it for me. Potential problem #1. Other one is that I am using a Service account. This is what I see when I run the gar_token_info() code you mention.

Type: service_account
ProjectID: marketing-insights
Client email: causey@marketing-insights.iam.gserviceaccount.com

So seems like potentially I should try again with my own personal authentication, rather than a service account (i.e. causey@spotify.com instead of causey@marketing-insights.iam.gserviceaccount.com). Can also create a brand new bucket and play around with it instead, since it's one that I would definitely own.

Any idea how to check bucket ownership? When I went into the bucket settings, it was just displaying some weird tokens. But the token next to "You" was different from the token next to "Owners" so perhaps therein lies my answer.

The conference was very awesome, back with lots of ideas :)

Yes it sounds like ownership is the issue - confusingly it can also be the case you can create a bucket but not own the objects within it. Safest is to make sure that Google Cloud Storage and BigQuery are all verified under the same user. You need at least write access for the bucket objects, that from memory you can alter via the drop down to left of folder when you browse on the website.

Hmm.. now I'm confused. I created everything in the bucket, but you're saying I might not own these objects regardless? I am assuming I have write access to the bucket since I created everything in it, but maybe I'm not understanding. The drop down to the left of the folder when you browse on the website is where I was saying that I encountered those strange tokens.

Its all sorted if you make sure you authenticate bigQueryR and googleCloudStorageR with the same auth (e.g. use the same service account) - then make sure that service account email has access to the bucket. Perhaps you need to add that email as a user to the bucket, with write access.

Update here: the bucket permissions DID allow me to use the bqr_extract_data function successfully, so thank you very much! The bqr_grant_access function, however, continues to error out when I follow your example. With some tweaks, I once got it to complete without an error-- or actually sending the email.

As per http://code.markedmondson.me/bigQueryR/query.html

library(bigQueryR)

\#\# Auth with a project that has at least BigQuery and Google Cloud Storage scope
bqr_auth()

\#\# make a big query
job <- bqr_query_asynch("marketing-insights",
                        "Rob",
                        "SELECT * FROM test LIMIT 10",
                        destinationTableId = "test2")

\#\# poll the job to check its status
\#\# its done when job$status$state == "DONE"
bqr_wait_for_job(bqr_get_job("marketing-insights", job$jobReference$jobId))

\#\#once done, the query results are in "bigResultTable"

\#\# Create the data extract from BigQuery to Cloud Storage
job_extract <- bqr_extract_data("marketing-insights",
                                "Rob",
                                "test2",
                                "causey",
                                filename="test2.csv")

\#\# poll the extract job to check its status
\#\# its done when job$status$state == "DONE"
bqr_wait_for_job(bqr_get_job("marketing-insights", job_extract$jobReference$jobId))

\#\# to download via a URL and not logging in via Google Cloud Storage interface:
\#\# Use an email that is Google account enabled
\#\# Requires scopes:
\#\#  https://www.googleapis.com/auth/devstorage.full_control
\#\# https://www.googleapis.com/auth/cloud-platform
\#\# set via options("bigQueryR.scopes") and reauthenticate if needed

download_url <- bqr_grant_extract_access(job_extract, "causey@spotify.com")
Error in bqr_grant_extract_access(job_extract, "causey@spotify.com") : 
  Job not done

Thanks, do you also have your working example code?

download_url <- bqr_grant_extract_access(bqr_get_job("marketing-insights", job_extract$jobReference$jobId), "causey@spotify.com")

Ok I think this is most likely due to you are not authenticated with Google Cloud Storage as well. The CRAN release defaults to BigQuery authentication only.

I have changed this to default scope to https://www.googleapis.com/auth/cloud-platform which is easiest when you are dealing with BigQuery and its interaction with Cloud Storage, and updated the examples.

A example that works in the new tests is:

  job_extract <- bqr_extract_data(tableId = "test3",
                                  cloudStorageBucket = gcs_get_global_bucket())

  job <- bqr_wait_for_job(job_extract)
  
  urls <- bqr_grant_extract_access(job, email = "my@email.com")

Hmm.. as per https://github.com/cloudyr/bigQueryR - tried downloading from install.packages("bigQueryR", repos = c(getOption("repos"), "http://cloudyr.github.io/drat")) but got this error:
image

Guessing to get the latest changes I'll need go straight here? devtools::install_github("cloudyr/bigQueryR")

I definitely authenticated with GCS the last time we spoke about it...
image
image

I also used that scope...
image

Yes sorry should have said Github version.

And I would now recommend authenticating via an environment variable, so you don't have to think about it too much again. This involves putting a .Renviron file in your home directory with these contents:

GCS_AUTH_FILE="location/auth.json"
BQ_AUTH_FILE="location/auth.json"

As googleCloudStorageR uses this too then it makes sure both are authenticated with the same file. Then you just need to load the library and it will handle the auth automatically.

Hmm using your new code off of GHE, I get this error..
image

Tried doing this beforehand, as a result, but same error occurred.
image

This is set now via a new function:

bq_global_project("your-project")

I will soon also have an option to put this in the environment file. (this is makes it more in line with the other cloudyr packages)

Oh its in already! So add this to the environ file too:

BQ_DEFAULT_PROJECT_ID="your-project"
BQ_DEFAULT_DATASET="your dataset"

:/ Same error.
image

Now have .Renviron in my working directory, with exact file path location of service account credentials, default project ID, and dataset.

GCS_AUTH_FILE="/Users/robcausey/Code/GitHub/causey.json"
BQ_AUTH_FILE="/Users/robcausey/Code/GitHub/causey.json"
BQ_DEFAULT_PROJECT_ID="marketing-insights"
BQ_DEFAULT_DATASET="Rob"

I also tried restarting beforehand, verifying file paths, and adding those objects to my environment manually

Just to make sure I am on the correct version, it is showing up as 0.2.0.90

After updating, I also get this again, even though I'm clearly owner (as shown on earlier image).
image
image

image

The project setting should work after you restart R, after which you can see it via Sys.getenv("BQ_DEFAULT_PROJECT_ID")

Then loading library(bigQueryR) it should say something like below:

screen shot 2017-03-16 at 19 38 15

If it doesn't set via bq_global_project("project-name")

The job_extract "Job not done" is peculiar, are you perhaps using an old job object? The bug there was it was reporting DONE when the job in fact errored. (I assume from not being able to write to the cloud bucket.)

Do the bqr_grant_extract_access() directly on the job object, not via bqr_get_job as it should be unnecessary and perhaps the problem.

so all in all make sure to run through extracting and waiting for the job:

  ## creates a job object
  job_extract <- bqr_extract_data(tableId = "test3",
                                  cloudStorageBucket = gcs_get_global_bucket())
  
  ## check job
  job_extract

 ## wait for job until its DONE
 job <- bqr_wait_for_job(job_extract)
  
  ## use completed job
  urls <- bqr_grant_extract_access(job, email = "my@email.com")

If thats the same and still not working then I need to fix something else.

I got the library(bigQueryR) loading to appear as shown (was having issue with .Renviron, but I resolved it now).

However, I am getting errors for the bqr_extract_data function. gcs_get_global_bucket returns NULL (even if I load googleAuthR directly, which also claims to authenticate via my JSON file). If I name the bucket directly, it claims "Invalid credential" JSON fetch error, even though the JSON has worked to authenticate otherwise.

image

Retrying the code from before that worked up until the very end suddenly fails to work sooner than it did previously, so I think something is wrong.
bqr_query_async was working previously, but now fails with JSON fetch credential error as well.
image

Ok the offending line there is the scopes, setting only to Cloud Storage
Setting scopes to https://www.googleapis.com/auth/devstorage.full_control - the problem with google apis is it gives a similar error if you are wrong user, wrong scope etc. so difficult to pin down for the user.

If googleCloudStorageR is loaded it sets that automatically, but now on github version of bigQueryR you don't need to load googlecloudStorageR, and it also sets to the more general scope /cloud-platform that allows BigQuery and Cloud Storage. See my screenshot above for example of what if should look like.

So not necessarily a bug but I'm going to work on making this easier to diagnose for the user, as authentication issues suck :)

I will also use this thread to help document it for the next release on CRAN (which will be as soon as these issues are sorted out) as a lot of new changes are in now. Thanks so much for the aid! 🥇

I think this is solved by getting the scopes all cleared up, make sure latest versions of googleAuthR and BigQueryR, new session and the best scope to use is this:

options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/cloud-platform")