Design question
Closed this issue ยท 24 comments
Hey @koltyakov, Im starting to kick the tires on this one so I can open an initial version. A few Sharepoint and gosip related question. What would make sense for CloudQuery to sync, as far as I can understand sharepoint model and APIs, you need to specify what you want to select and what fields? is this correct? In that case I can make it a required configuration in addition to auth so the user can configure what they want to sync/fetch?
Thanks!
Yevgeny
Hey @yevgenypats, thanks for kicking the tires!
SharePoint API is relatively huge with different entities and hierarchical structures. However, the most common artifacts which deserves syncing to a reporting friendly destinations are lists. Lists are the heart of customizations with business data involved. A list is a sort of a table (actually it's a bit more complex under the hood, but the most oversimplified analogy is a table). The structure of a list is defined in a custom manner. I'd say that in a sync scenario if would be a definition which lists to sync, which fields in each list to touch. Maybe the tricky part would be a destination as it's not a static structure, but always different based on lists structure. Some core fields are common for all the lists, such as ID, version, created by/at, modified by/at, title, others are defined by a developer or a power user. Also, the data can be stored in different sites (workspaces).
So, assuming we start with a bare minimal yet already useful functionality:
- Define a site to fetch data from
- Defile lists to pull from
- Each list can have an array if fields to provide explicitly and some core fields which always the case (it's very common that the set of fields is extended with time due to user needs)
*I'm omitting lot's of extra details such as content types, lookups, storing same entities types in a different sites/lists to do not overwhelm initial thinking.
Do you think it's feasible that the destination can be universal or automatically structured with CloudQuery approaches based on dynamic input configuration?
Thanks for the detailed guide! I think that makes sense and yes, CloudQuery has notion of dynamic tables and schema that we can produce in runtime given a response from the server.
Can you point me to an example of how the API looks like to fetch from defined set of lists and how to parse the schema returns from those APIs?
Please check this sample https://github.com/koltyakov/sp-go-lists-sample
I'll DM you the creds file.
Thank you! This was super helpful and the creds all worked!
I've the draft that Im working on right now here - #2.
Right now I've a question if I can get somewhere the list of all possible types in the Sharepoint model so I can convert them to CloudQuery types and take advantage of the CQ type system (we don't need to handle all types as we can have a default to string but still this would be very useful to have as many types as possible converted to the right type so it can be reflected appropriately in any database).
That's really cool @yevgenypats
I'm away from computer today. Will send SP List Field types and payload sample tomorrow.
@yevgenypats please check the list of most common field types in SharePoint lists. Many could be normalized to the same types in destination. SharePoint differentiates types due to formatting and historical aspects. Anyways, I'll catch up with such specifics when will be applying changes in the plugin according to the platform specifics.
SharePoint Field Types:
Field Data Type | Variation | Payload Sample |
---|---|---|
Text | Single | "FieldTextSingle" |
Text | Multiline | "FieldTextMulti\nFieldTextMulti\nFieldTextMulti" |
Number | Int | 123 |
Number | Float | 123.45 |
Currency | 123.99 |
|
Yes/No (bool) | true / false |
|
Person or group | Single | 10 (an ID from user information list) |
Person or group | Multi | [10, 12] (IDs from user information list) |
Date time | "2023-02-20T15:58:10Z" (ISO String) | |
Choice | Single | "Choice 1" |
Choice | Multi | ["Choice 1", "Choice 2"] |
Hyper link | { "Description": "GitHub", "Url": "https://github.com" } |
|
Managed Metadata | Single | { "Label": "1", "TermGuid": "2e2b8c97-b1fc-4ad5-a17d-562b74cd9903", "WssId": 1 } |
Managed Metadata | Multi | Array of object with same structure as single variant |
Lookup | Single | 1 (ID of an item from a lookup list) |
Lookup | Multi | [1, 2] (IDa of items from a lookup list) |
@koltyakov Hi, I've been trying to implement the data sync part... I have a question on how get to the item.ToMap()
keys from the POV of api.FieldInfo
? I'm creating schema.Column
s for each table, using FieldInfo.Title
for the column names (lower cased and normalized) is that the correct approach? Some titles clash so I sometimes have to incrementally rename them (if there's already a name
the second column becomes name_2
, third name_3
and so on) but I suspect I might be on the wrong path... Should I use InternalName
, or some other column? Ideally I would be OK with getting a value for each field.ID
(so uniqueness is guaranteed) but ToMap() seems to return names...
Hi @disq, InternalName
is something to use, internal names are unique within a list and immutable. Display names (title) can be changed anytime in UI and are not generally something to rely on.
@koltyakov when I get the fields, it returns about ~80 columns. Some of them don't exist in the data but there are other columns instead. So I think I'm still missing something?
Here's a couple of log lines processing the list ListA
:
12:08PM DBG item keys keys=["AttachmentFiles","Attachments","AuthorId","ComplianceAssetId","ContentType","ContentTypeId","Created","EditorId","Field1","Field2","Field3","Field4","FieldValuesAsHtml","FieldValuesAsText","FieldValuesForEdit","File","FileSystemObjectType","FirstUniqueAncestorSecurableObject","Folder","GUID","GetDlpPolicyTip","ID","Id","LikedByInformation","Modified","OData__UIVersionString","ParentList","Properties","RoleAssignments","ServerRedirectedEmbedUri","ServerRedirectedEmbedUrl","Title","Versions","__metadata"] module=sharepoint-src table=sharepoint_list_a
12:08PM WRN missing columns in result missing_columns=["AccessPolicy","AppAuthor","AppEditor","Author","BaseName","ContentVersion","Created_x0020_Date","DocIcon","Edit","Editor","EncodedAbsUrl","FSObjType","FileDirRef","FileLeafRef","FileRef","File_x0020_Type","FolderChildCount","HTML_x0020_File_x0020_Type","InstanceID","ItemChildCount","Last_x0020_Modified","LinkFilename","LinkFilename2","LinkFilenameNoMenu","LinkTitle","LinkTitle2","LinkTitleNoMenu","MetaInfo","NoExecute","Order","OriginatorId","ParentUniqueId","PermMask","PrincipalCount","ProgId","Restricted","SMLastModifiedDate","SMTotalFileCount","SMTotalFileStreamSize","SMTotalSize","ScopeId","SelectTitle","ServerUrl","SortBehavior","SyncClientId","UniqueId","WorkflowInstanceID","WorkflowVersion","_CommentCount","_CommentFlags","_ComplianceFlags","_ComplianceTag","_ComplianceTagUserId","_ComplianceTagWrittenTime","_CopySource","_EditMenuTableEnd","_EditMenuTableStart","_EditMenuTableStart2","_HasCopyDestinations","_IsCurrentVersion","_IsRecord","_Level","_ModerationComments","_ModerationStatus","_UIVersion","_UIVersionString","_VirusInfo","_VirusStatus","_VirusVendorID","owshiddenversion"] module=sharepoint-src table=sharepoint_list_a
12:08PM WRN extra columns found in result extra_columns=["AttachmentFiles","AuthorId","EditorId","FieldValuesAsHtml","FieldValuesAsText","FieldValuesForEdit","File","FileSystemObjectType","FirstUniqueAncestorSecurableObject","Folder","GetDlpPolicyTip","Id","LikedByInformation","OData__UIVersionString","ParentList","Properties","RoleAssignments","ServerRedirectedEmbedUri","ServerRedirectedEmbedUrl","Versions","__metadata"] module=sharepoint-src table=sharepoint_list_a
First log line shows which keys exist in the result. There's AttachmentFiles
, AuthorId
, EditorId
, FieldValuesAsHtml
etc. as extra data (according to the last log line - I'm guessing some of these need to be ignored) but here's also no data for the columns specified in line 2 (e.g. no AccessPolicy
column in the data).
Hi @disq,
There are many service props in a list (and not only) object model. Not always they are needed for a data/app scenario.
In the API there are 2 OData modificators which allows limiting or extending props in the response: $select
and $expand
.
By default API doesn't return some props as getting them could be a heavy operations with extra joins on the backend, when they needed they explicitly requested within select/expand(projection).
On the other hand, by default it also can be too many column returned which are not needed for a view purposes. So it's very common and a good practice to explicitly provide only needed subset of fields in $select
just like one do in SQL.
In an ETL scenario I would say that only selected fields should be synced to a destination and a few mandatory (even if omitted on input settings) such as Id, Created, Modified, maybe also AuthorId, EditorId, version and FSObjType.
I hope this shed some light.
If to assume that a default plugin behavior would be selecting all $select=*
(however, I'd intend users to provide only what's needed, and defaulted empty choice to Title
only and whose service fields we always be dragging) when we would definitely ignore some props. I can provide a criteria what to ignore (a hardcoded list of props + fields attribute based) a bit later today.
@koltyakov thanks for the replies, I was able to come up with a 'SELECTive' approach. Unfortunately it had to come with a lot of options to tweak:
// Lists to fetch, if empty all lists will be fetched
Lists []string `json:"lists"`
// ListFields is a map of list name to list of fields to fetch, if empty all DefaultFields will be fetched
ListFields map[string][]string `json:"list_fields"`
// DefaultFields is the fields to fetch if not specified in ListFields
DefaultFields []string `json:"default_fields"`
// IgnoreFields is the fields to always ignore
IgnoreFields []string `json:"ignore_fields"`
// FieldOverrides is a map of field name to type, used to override the detected type. If the field does not exist in the definitions, it will be forcibly added.
FieldOverrides map[string]string `json:"field_overrides"`
// pkColumn is the primary key column name, defaults to "Id"
pkColumn string
- I've set
DefaultFields
toId, Created, Modified, Title, AuthorId, EditorId, FSObjType
- Set the default
IgnoreFields
value to__metadata
- Set 'FieldOverrides' to the
Integer
type for these 4 columns which don't come up in the field listing:AuthorId, EditorId, Id, FSObjType
and forcedId
to always be there and be the Primary Key.
@disq my suggestion for initial design:
Ignore the following fields until specifically provided (when it comes to checking /list/fields metadata):
- Fields with internal name starting with "_"
- Hidden fields (Hidden != true)
- Computed fields (FieldTypeKind != 12)
- Hardcoded list of fields:
- 'ComplianceAssetId', 'Attachments', 'AppAuthor', 'AppEditor', 'ItemChildCount', 'FolderChildCount' (will amend with time)
UPD: We commented almost simultaneously. :)
Forget to mentioned, the API will also return something extra which is not a field, e.g. __metadata
or other nested entities. I'd suggest to ignore all of these so far as well.
@koltyakov Thanks for the update... is there a way to determine if a field needs $expand
or $select
? After widening the field criteria seems like it's trying to select FieldPersonOrGroupSingle
and fails. Or should I use expand
?
An expand is the way to drill down, e.g. ?$select=Author/Title,Author/EMail&$expand=Author
.
Such field types as SP.FieldUser
, SP.FieldLookup
are lookups for the values stored in other lists.
While it's possible to drill down with a joins, it's limited and encourage not a normalized sync. In an ETL I'd encourage folks to sync simpler and normalized data arrays, then join data in SQL/reporting by IDs.
Lookup fields have the following specific, while having a field name e.g. "Department", the field name in a response would be "DepartmentId" (Id
suffix). It contains numeric ID of a lookup entity item(s).
Lookup fields also can't be selected without Id
suffix or expand and prop drill down /
syntax (e.g. ?$select=Department/Title&$expand=Department
).
I've now added Lookup
/User
/LookupMulti
/UserMulti
support, removed some options (no more DefaultFields
or FieldOverrides
) and it seems to work.
@yevgenypats @disq the initial version you've provided is great starting point. I was managed to run it locally and sync some lists to postgresql and sqlite destinations.
I will work on designing plugin specification to make it scalable and logical for SharePoint.
Also created some other issue items I will work on. Just for visibility.
Amazing @koltyakov thanks for taking the lead on that and helping with the sharepoint expertise! Feel free to ping us if any questions and/or help needed around sdk or in genral.
We will also update on our website early next week under our list of plugins a link to this sharepoint plugin as well for visibility and search.
One thing that you will need also is to tag a v1.0.0
so the goreleaser would run and create a release. this way cloudquery CLI will be able to download it via specifying name: koltyakov/cq-source-sharepoint
for a source plugin
@yevgenypats @disq, I added base documentation and applied some mostly minor changes to make the plugin ready to the first version release.
I will research incremental scenarios later on for the post v1.0.0. As even now it's quite OK:
But we can do better, as there is the way triggering only changed data for the lists of any size without fetching all. When using change API it would be possible to sync what's changed in a seconds even for large Lists (M of items). Yet, I'd need to wrap my head around how CloudQuery deal with incremental strategies.
Also, I hope that the amendments of the spec will allow scaling up to other API object model entities which might be handy exporting to fast storages with ETL.
One design decision which probably is not very common for CloudQuery sources but I truly believe should be better for SharePoint lands (as platform could be throttling naughty and also having too many extra and not needed for the export objects) is only grabbing explicitly listed entities and same for the properties.
@disq your initial code was extremely useful, I really appreciate your efforts, sorry if I pivoted or removed something, all decisions were hard and only driven by the platform aspects of what might work better with SharePoint.
Technically, we should be ready for releasing the first version next week. Once again thank you guys! I have not seen such pace and passion for a long time in industry. You are fabulous! ๐
Thank you for the detailed update and the warm words!! ๐
one thing that should actually be available is grabbing explicitly listed entities. I think if you specify the entities under tables: ["entity1", "entity2"]
(under the main source spec) this should sync only those entities (This should even support simple glob like entity*
). If it doesn't work that prob it's a bug on our end.
grabbing explicitly listed entities
I mean I amended plugin logic to only fetch what's listed, don't try all the things when nothing is provided.
Btw, I pushed v1.0.0 tag and checked that github
plugin registry works.
update on our website early next week under our list of plugins a link to this sharepoint plugin
Please let me know what's the approach for the documentation the plugin page, should I craft a separate README.md with spec notes? As I documented not only how to use plugin but also a quick guide for CloudQuery + plugin in main README.md, which might be a bit redundant for experience CloudQuery users.
I think one README in this repo with the spec would be a good approach for now. We will link directly to this GitHub from our plugins list.