microsoft/data-formulator

Create new data loaders to different resources

Chenglong-MS opened this issue ยท 14 comments

Hi devs and users,

We have recently extended data formulator with the ability to directly connect to external data sources with the ExternalDataLoader class. One can extend the python data loader class to make it possible to directly load data from external data sources. The frontend will automatically populate the external classes and provided ability to complete user queries (for loading data views).

external_data_loader.mov

Instruction for extending data loader and example implementations of MySQL and Azure Data Explorer are provided here: https://github.com/microsoft/data-formulator/tree/main/py-src/data_formulator/data_loader.

Would like to use this issue to collect which data sources you would like to see, and hopefully there are some devs able to add some more data loaders for popular sources (Google Big Query, Amazon S3 for example).

Hi, We tried the new release to connect with mysql DB. It was successful. One thing we noticed here is getting real time data from data base is challenging. Do you have any inputs on this?

@karthikadevaraj interesting use case, a direct support update would be adding a button/function to refresh the dataset, and automatically update views from different sources. This would work but may not be super efficient.

A potentially better way (though requires a bit more dev) is to directly run queries against the external data source (as opposed to use duckdb as the middle person). This needs a new abstract class design for the external data loader, and changes some of the querying logic in table-routes.py

Hi @Chenglong-MS , thanks for sharing the inputs.

It would be great to have it scrape data from papers directly. All tables, lists, graphs, etc. That will save us a lot of time!

It would be a good idea to add this parameterization to the SQL Server connection in order to support more modern database engines, for example: TrustServerCertificate=yes;Connection Timeout=30;Encrypt=no;

Please support:

  • loading of local parquet files
  • loading of cloud parquet files from Azure blob via Container Name, SAS-Token, Account Name

@Chenglong-MS I have added an initial PR . This would work well for S3 connections

thanks @slackroo adding S3 data loader, I have also added Azure Blob reader (for parquet, json, csv) files. Recently code agents make adding a new data loader quite easy!

Just added postgreSQL Loader to pull request #163

It would be great to make this able to work with data in a Power BI Dataset. Generate DAX Queries to do the analysis.