/tap-google-sheets

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

tap-google-sheets

This is a Singer tap that produces JSON-formatted data following the Singer spec.

This tap:

  • Pulls raw data from the Google Sheets v4 API
  • Extracts the following endpoints:
  • Outputs the following metadata streams:
    • File Metadata: Name, audit/change info from Google Drive
    • Spreadsheet Metadata: Basic metadata about the Spreadsheet: Title, Locale, URL, etc.
    • Sheet Metadata: Title, URL, Area (max column and row), and Column Metadata
      • Column Metadata: Column Header Name, Data type, Format
    • Sheets Loaded: Sheet title, load date, number of rows
  • For each Sheet:
    • Outputs the schema for each resource (based on the column header and datatypes of row 2, the first row of data)
    • Outputs a record for all columns that have column headers, and for each row of data
    • Emits a Singer ACTIVATE_VERSION message after each sheet is complete. This forces hard deletes on the data downstream if fewer records are sent.
    • Primary Key for each row in a Sheet is the Row Number: __sdc_row
    • Each Row in a Sheet also includes Foreign Keys to the Spreadsheet Metadata, __sdc_spreadsheet_id, and Sheet Metadata, __sdc_sheet_id.

API Endpoints

file (GET)

metadata (GET)

  • Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2
  • This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information)
  • Primary keys: Spreadsheet Id, Sheet Id, Column Index
  • Foreign keys: None
  • Replication strategy: Full (get and replace file metadata for spreadshee_id in config)
  • Process/Transformations:
    • Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area)
      • sheetId, title, index, gridProperties (rowCount, columnCount)
    • Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column name uniqueness
    • Create/Verify Datatypes based on 2nd row value and cell metadata
      • First check:
        • effectiveValue: key
          • Valid types: numberValue, stringValue, boolValue
          • Invalid types: formulaValue, errorValue
      • Then check:
        • effectiveFormat.numberFormat.type
          • Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC
          • Determine JSON schema column data type based on the value and the above cell metadata settings.
          • If DATE, DATE_TIME, or TIME, set JSON schema format accordingly

values (GET)

Authentication

The Google Sheets Setup & Authentication Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters.

  • Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly
  • Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly
  • Tap config.json parameters:
    • client_id: identifies your application
    • client_secret: authenticates your application
    • refresh_token: generates an access token to authorize your session
    • spreadsheet_id: unique identifier for each spreadsheet in Google Drive
    • start_date: absolute minimum start date to check file modified
    • user_agent: tap-name and email address; identifies your application in the Remote API server logs

Quick Start

  1. Install

    Clone this repository, and then install using setup.py. We recommend using a virtualenv:

    > virtualenv -p python3 venv
    > source venv/bin/activate
    > python setup.py install
    OR
    > cd .../tap-google-sheets
    > pip install .
  2. Dependent libraries The following dependent libraries were installed.

    > pip install target-json
    > pip install target-stitch
    > pip install singer-tools
    > pip install singer-python
  3. Create your tap's config.json file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address).

    {
        "client_id": "YOUR_CLIENT_ID",
        "client_secret": "YOUR_CLIENT_SECRET",
        "refresh_token": "YOUR_REFRESH_TOKEN",
        "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID",
        "start_date": "2019-01-01T00:00:00Z",
        "user_agent": "tap-google-sheets <api_user_email@example.com>"
    }

    Optionally, also create a state.json file. currently_syncing is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off. Only the performance_reports uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type.

    {
        "currently_syncing": "file_metadata",
        "bookmarks": {
            "file_metadata": "2019-09-27T22:34:39.000000Z"
        }
    }
  4. Run the Tap in Discovery Mode This creates a catalog.json for selecting objects/fields to integrate:

    tap-google-sheets --config config.json --discover > catalog.json

    See the Singer docs on discovery mode here.

  5. Run the Tap in Sync Mode (with catalog) and write out to state file

    For Sync mode:

    > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    To load to json files to verify outputs:

    > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    To pseudo-load to Stitch Import API with dry run:

    > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
  6. Test the Tap

    While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices: Pylint to improve code quality:

    > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments

    Pylint test resulted in the following score:

    Your code has been rated at 9.78/10

    To check the tap and verify working:

    > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    Check tap resulted in the following:

    The output is valid.
    It contained 3881 messages for 13 streams.
    
        13 schema messages
      3841 record messages
        27 state messages
    
    Details by stream:
    +----------------------+---------+---------+
    | stream               | records | schemas |
    +----------------------+---------+---------+
    | file_metadata        | 1       | 1       |
    | spreadsheet_metadata | 1       | 1       |
    | Test-1               | 9       | 1       |
    | Test 2               | 2       | 1       |
    | SKU COGS             | 218     | 1       |
    | Item Master          | 216     | 1       |
    | Retail Price         | 273     | 1       |
    | Retail Price NEW     | 284     | 1       |
    | Forecast Scenarios   | 2681    | 1       |
    | Promo Type           | 91      | 1       |
    | Shipping Method      | 47      | 1       |
    | sheet_metadata       | 9       | 1       |
    | sheets_loaded        | 9       | 1       |
    +----------------------+---------+---------+

Copyright © 2019 Stitch