wchatx/direct-access-py

Requesting Example For Use-Case Scenario | Multi-Processing into SQL Server Database

Closed this issue · 3 comments

A use-case scenario I'm interested in would be some form of the multi-processing example where multiple API endpoints can be queried and then loaded into a SQL server database on a regular basis. An added bonus would be to utilize the Enverus Developer API Best Practices to incrementally update a database by leveraging the "UpdatedDate" and "DeletedDate" fields, but this isn't a critical feature at this point.

Currently, I am using the multi-processing example to download .CSV files from two different API endpoints (Rigs and Rig Analytics) on a daily basis and then I have written a separate Python script that uses the pyodbc package to truncate the target SQL Server tables and then load the data in each .CSV file into their respective table/column. This occurs on a daily basis. It's clunky, but it works. My concern is that if I begin to add more endpoints, the inefficiency of my approach will come back to haunt me.

Also, I want to mention that I am an amateur Python user, so I'm open to any approach that is most sensible and greatly appreciate what is being provided with the direct-access package. Please let me know if I can provide any further information. Thank you very much!

@desouzapalooza I apologize this issue has sat idle for as long as it has. I intend to add a basic example of keeping tables in mssql in sync with the Developer API but wanted to check in and see how your existing process is going.

Given how quickly mssql can bulk load CSVs, I think your current approach is probably quite efficient (I'm not sure if you're actually bulk loading or using pyodbc to loop through the rows and insert in batches, though).

A potential complication of using an incremental approach is the need to implement an upsert pattern, which I don't think mssql has natively. You can use a MERGE but I've read it's inefficient. This is what I intend to add as an example and I'll benchmark it when I do.

I'll keep this issue open until the example has been added.

@wchatx No need to apologize! I look at it as a bonus that you responded.

Since I submitted my inquiry, I researched this approach and learned that bulk loading a JSON file into a SQL Server database was recommend over bulk loading CSVs. Please note that I am a beginner Python user and not a programmer, so I have been teaching myself how to leverage Python for automating data feeds. I am always open to suggestions since I am constantly learning.

The Python script I have developed over the past 3 months uses the Multiprocessing module within the DirectAccessV2 package to output a JSON file for each Enverus API endpoint. Then the script executes a stored procedure within SQL Server that loads the JSON files into their respective database tables using the OPENROWSET bulk load feature. When working with the full "Rigs-Analytics" data set, this refresh would take around 30 minutes with a CSV file whereas the JSON/OPENROWSET approach only took 1-2 minutes. Granted, I could have been doing something wrong with the CSV (looping through rows with pyodbc), but it was a no-brainer to move forward with JSON since it worked so well.

At this time, I am updating the full API endpoint data sets for every refresh (i.e. download JSON files, truncate database tables, load JSON files into database), which works fine for endpoints that don't have large data sets. Now that I want to start working with other API endpoints that have large data sets, I need to streamline my script so that I'm utilizing the CreatedDate, DeletedDate, UpdatedDate fields within the Enverus API in order to perform data refreshes more efficiently as recommended in the Enverus API Developer documentation.

I guarantee a more experienced Python user like yourself could find an inefficiency in my script or a better approach, so I'm all ears if you have any suggestions. Let me know if I can provide any further information on this topic. Thanks!

@wchatx A simple request is to port the multi_processing.py to output JSON instead of CSV. I typically find issues with CSV data (commas in a string) that impede a clean import into Excel via Power Query. I know enough coding to fire up an editor and run a python script and make minor changes. Importing a new library and figuring out out the new functions is not my forte. I imagine it wouldn't take you more than a couple of minutes. Thanks.