wjohnson/pyapacheatlas

Custom Attributes in bulk upload

Closed this issue · 11 comments

Hi,
We are using pyapache atlas bulk upload and have give more than 4 custom attaributes using custom[] in the excel .
the excel loads only 4 columns which is visible in user defined properties in detail page and ignores the rest .

Is there any limitation for bulk upload to load only certain no of columns from excel or it will be able to upload any number of columns
from excel using bulk upload (results = client.upload_entities(entities))

Also tried to add attributes using attributeDefs . The attributes are visible in atlas but bulk upload cannot load more than 5 additional attributes defines. is there any limitation for pyapacheatlas.readers import ExcelReader

daily_bulk_entities_upload.xlsx

I have attached the file i am trying to load. we have added new attributes like max_length,precision,Defaultval,Format_Requirements,XML_Tag,Read-only_Y/N . when i upload using the pyapache atlas bulk upload it only loads the values for 5 attributes and ignores the rest of the column values from excel.

using the code below
excel_config = ExcelConfiguration()
excel_reader = ExcelReader(excel_config)
entities = excel_reader.parse_bulk_entities(file_name)
results = client.upload_entities(entities)

@APOTUC1B thank you for sharing the excel file, I think there's a misunderstanding in how Apache Atlas / Purview accepts attributes.

Atlas has a type definition which defines the attributes you are allowed to record. Here is the (abbreviated) type definition for an rdbms_column which you are using. Notice how it does not include the attributes that you created such as Precision(numeric), DefaultVal, and Format_Requirements.

If you want to use those specific attributes, you'll need to either:

  1. Add them as custom attributes (not visible in the Purview UI at this time)
  2. Add them as business metadata (visible in the UI, not supported in the Excel upload at this time as per #222 )
  3. Create a custom rdbms_column type (e.g. custom_rdbms_column with a type definition that includes these attributes (see more here)

You are only seeing five attributes being recorded because those five attributes are on the type definition and the Atlas engine knows how to handle them. The rest are silently ignored by Apache Atlas / Purview.

I hope that helps!

{
      "category": "ENTITY",
      "name": "rdbms_column",
      "description": "a column in an rdbms table",
      "typeVersion": "1.4",
      "serviceType": "rdbms",
      "options": {
        "schemaAttributes": "[\"name\", \"description\", \"owner\", \"data_type\", \"comment\", \" isPrimaryKey\", \" isNullable\"]"
      },
      "lastModifiedTS": "3",
      "attributeDefs": [
        {
          "name": "data_type",
          "typeName": "string",
          "isOptional": true,
          "cardinality": "SINGLE"
        },
        {
          "name": "length",
          "typeName": "int",
          "isOptional": true,
          "cardinality": "SINGLE"
        },
        {
          "name": "default_value",
          "typeName": "string",
          "isOptional": true,
          "cardinality": "SINGLE"
        },
        {
          "name": "comment",
          "typeName": "string",
          "isOptional": true,
          "cardinality": "SINGLE"
        },
        {
          "name": "isNullable",
          "typeName": "boolean",
          "isOptional": true,
          "cardinality": "SINGLE"
        },
        {
          "name": "isPrimaryKey",
          "typeName": "boolean",
          "isOptional": true,
          "cardinality": "SINGLE"
        }
      ],
      "superTypes": [
        "DataSet"
      ],
      "subTypes": [],
    }

excel_config = ExcelConfiguration()
excel_reader = ExcelReader(excel_config)
entities = excel_reader.parse_bulk_entities(file_name)
print(entities)

i get all these uploaded from my local run print(entities)
'max_length': 70, 'Precision(numeric)': 50, 'DefaultVal': 40, 'Format_Requirements': 'test7', 'XML_Tag': 'test7', 'Read-only_Y/N': 'Y', 'Required_Y/N': 'Y'

when i call the same from airflow it only does for first 5 print(entities)
'max_length': 70, 'Precision(numeric)': 50, 'DefaultVal': 40, 'Format_Requirements': 'test7', 'XML_Tag': 'test7'

i get all these uploaded from my local run print(entities) 'max_length': 70, 'Precision(numeric)': 50, 'DefaultVal': 40, 'Format_Requirements': 'test7', 'XML_Tag': 'test7', 'Read-only_Y/N': 'Y', 'Required_Y/N': 'Y'

when i call the same from airflow it only does for first 5 print(entities) 'max_length': 70, 'Precision(numeric)': 50, 'DefaultVal': 40, 'Format_Requirements': 'test7', 'XML_Tag': 'test7'

So you're running this through an Airflow job? Can you confirm the airflow job is pointing to the same excel file?

Thank you for confirming it is working locally!

Can you confirm the airflow job is pointing to the same excel file?

Yes same file . I also ran the code on Airflow scheduler and it works fine and uploads all the custom attributes perfectly . only when i run from airflow dag it limits.

from local and airflow scheduler

excel_config = ExcelConfiguration()
excel_reader = ExcelReader(excel_config)
entities = excel_reader.parse_bulk_entities(file_name)
print(entities)

'entities': [{'typeName': 'rdbms_instance', 'guid': '-1001', 'attributes': {'rdbms_type': 'Sql Server', 'description': 'dataset', 'name': 'MSSQLSERVER', 'qualifiedName': 'MSSQLSERVER@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'relationshipAttributes': {}}, {'typeName': 'rdbms_db', 'guid': '-1002', 'attributes': {'name': 'dataset', 'qualifiedName': 'dataset@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'relationshipAttributes': {'instance': {'typeName': 'rdbms_instance', 'guid': '-1001', 'qualifiedName': 'MSSQLSERVER@das02.frst.corp:1433'}}}, {'typeName': 'rdbms_table', 'guid': '-1003', 'attributes': {'name': 'dailyvolumes', 'qualifiedName': 'dataset.dbo.dailyvolumes@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'relationshipAttributes': {'db': {'typeName': 'rdbms_db', 'guid': '-1002', 'qualifiedName': 'dataset@das02.frst.corp:1433'}}}, {'typeName': 'rdbms_column', 'guid': '-1004', 'attributes': {'type': 'int', 'description': 'Indicates charge off within 18 years', 'max_length': 70, 'Precision(numeric)': 50, 'DefaultVal': 40, 'Format_Requirements': 'test7', 'XML_Tag': 'test7', 'Read-only_Y/N': 'Y', 'Required_Y/N': 'Y', 'owner': 'new2', 'name': 'charged_off_24month_inda20', 'qualifiedName': 'dataset.dbo.dailyvolumes.charged_off_24month_inda20@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'customAttributes': {'Business_Calculation': "When dailyvolumes's external_status is in ('F','I','Z','B') and STATUS_REASON_CODE is not 88 and charge_off_date is with 12 years of process date and BALANCE_CURRENT >15 and charge_off_amount is not 0 then 1 else 0", 'Datasource': 'sqlserver', 'CategoricalElements': 'test2', 'Data': 'test3', 'col1': 'test7', 'Level': 22}, 'relationshipAttributes': {'table': {'typeName': 'rdbms_table', 'guid': '-1003', 'qualifiedName': 'dataset.dbo.dailyvolumes@das02.frst.corp:1433'}, 'db': {'typeName': 'rdbms_db', 'guid': '-1002', 'qualifiedName': 'dataset@das02.frst.corp:1433'}}}, {'typeName': 'rdbms_column', 'guid': '-1005', 'attributes': {'type': 'bit', 'description': 'Indicates if the account is charged off', 'max_length': 80, 'Precision(numeric)': 60, 'DefaultVal': 70, 'Format_Requirements': 'test', 'XML_Tag': 'test', 'Read-only_Y/N': 'N', 'Required_Y/N': 'N', 'owner': 'new7', 'name': 'chargeoff_inda20', 'qualifiedName': 'dataset.dbo.dailyvolumes.chargeoff_inda20@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'customAttributes': {'Business_Calculation': "When CDT_COFF_TODAY_COUNT from is 14 then 15 else 1 from dataset's dailyvolumes table", 'Datasource': 'oracle', 'CategoricalElements': 'test3', 'Data': 'test3', 'col1': 'test8', 'Level': 23}, 'relationshipAttributes': {'table': {'typeName': 'rdbms_table', 'guid': '-1003', 'qualifiedName': 'dataset.dbo.dailyvolumes@das02.frst.corp:1433'}, 'db': {'typeName': 'rdbms_db', 'guid': '-1002', 'qualifiedName': 'dataset@das02.frst.corp:1433'}}}]}

from airflow dag

excel_config = ExcelConfiguration()
excel_reader = ExcelReader(excel_config)
entities = excel_reader.parse_bulk_entities(file_name)
print(entities)
{'entities': [{'typeName': 'rdbms_instance', 'guid': '-1001', 'attributes': {'rdbms_type': 'Sql Server', 'description': 'dataset', 'name': 'MSSQLSERVER', 'qualifiedName': 'MSSQLSERVER@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'relationshipAttributes': {}}, {'typeName': 'rdbms_db', 'guid': '-1002', 'attributes': {'name': 'dataset', 'qualifiedName': 'dataset@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'relationshipAttributes': {'instance': {'typeName': 'rdbms_instance', 'guid': '-1001', 'qualifiedName': 'MSSQLSERVER@das02.frst.corp:1433'}}}, {'typeName': 'rdbms_table', 'guid': '-1003', 'attributes': {'name': 'dailyvolumes', 'qualifiedName': 'dataset.dbo.dailyvolumes@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'relationshipAttributes': {'db': {'typeName': 'rdbms_db', 'guid': '-1002', 'qualifiedName': 'dataset@das02.frst.corp:1433'}}}, {'typeName': 'rdbms_column', 'guid': '-1004', 'attributes': {'type': 'int', 'description': 'Indicates charge off within 15 years', 'max_length': 70, 'Precision(numeric)': 50, 'DefaultVal': 40, 'Format_Requirements': 'test7', 'XML_Tag': 'test7', 'name': 'charged_off_24month_inda80', 'qualifiedName': 'dataset.dbo.dailyvolumes.charged_off_24month_inda80@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'relationshipAttributes': {'table': {'typeName': 'rdbms_table', 'guid': '-1003', 'qualifiedName': 'dataset.dbo.dailyvolumes@das02.frst.corp:1433'}, 'db': {'typeName': 'rdbms_db', 'guid': '-1002', 'qualifiedName': 'dataset@das02.frst.corp:1433'}}}, {'typeName': 'rdbms_column', 'guid': '-1005', 'attributes': {'type': 'bit', 'description': 'Indicates if the account is charged off', 'max_length': 80, 'Precision(numeric)': 60, 'DefaultVal': 70, 'Format_Requirements': 'test', 'XML_Tag': 'test', 'name': 'chargeoff_inda80', 'qualifiedName': 'dataset.dbo.dailyvolumes.chargeoff_inda80@das02.frst.corp:1433'}, 'classifications': [{'typeName': 'dataset', 'entityStatus': 'ACTIVE', 'propagate': False, 'removePropagationsOnEntityDelete': False, 'validityPeriods': [], 'attributes': {}}], 'relationshipAttributes': {'table': {'typeName': 'rdbms_table', 'guid': '-1003', 'qualifiedName': 'dataset.dbo.dailyvolumes@das02.frst.corp:1433'}, 'db': {'typeName': 'rdbms_db', 'guid': '-1002', 'qualifiedName': 'dataset@das02.frst.corp:1433'}}}]}

Yikes! I'm not certain what is going on here.

The only things you can do is compare against the airflow dag environment and be sure everything matches locally?

  • Version of pyapacheatlas
  • Version of openpyxl (which reads the excel file)
  • Version of requests (shouldn't affect it but just double check)
  • The Excel file is the same (just triple checking 😅)

I'm noticing that the fields missing have forward slashes.

Could you try renaming the columns and not using a forward slash?

Could you try reordering the columns to see if it's picking up the first five consistently?

local

atlasclient 1.0.0
pyapacheatlas 0.13.1
openpyxl 3.0.9

Airflow
atlasclient 1.0.0
pyapacheatlas 0.13.1
openpyxl 3.0.9

Excel file is same

Aslo noticed that
results = client.upload_entities(entities)
fails if the excel file is having more than 5 added attributes and 5 custom attributes when the code is invoked from airflow but works well from local

found the issue please ignore