vertica-python-webhdfs: A Python repository from alippai

The idea is based on the https://github.com/vertica/spark-connector library.

While this already can give you a huge speedup (depending on the number of nodes and query speed), using a similar solution with supporting Apache Arrow format would be a much better solution. Obviously using it's advanced features like REE and dictionary (as Vertica may have compatible internal concepts) would make the implementation the fastest and most ergonomic.

How it works:

Start a fake webhdfs server in the same thread using Starlette
Issue EXPORT TO PARQUET(directory='hdfs://fake_webhdfs:8000/somedata') AS SELECT ...
Collect the parquet binary files in the webserver in-memory
Convert the collect parquet tables into pyarrow tables

Limitations:

It's pure python, receiving IO performance may not be optimal
Using Parquet vs Arrow still has some overhead
Vertica has to connect to the python script, for this it needs IP or hostname
It's not encrypted
Resultset needs to fit into memory, no streaming yet
actually it needs to keep both the compressed and uncompressed versions in the memory for a moment

+----------------+                      +-----------+                   +------------+ 
|                |                      |           |                   |            | 
| Python Script  |                      |  Vertica  |                   |  Vertica   | 
| (Runs WebHDFS) |                      |   Node    |                   | Data Nodes | 
|                |                      |           |                   |            | 
+------+---------+                      +----+------+                   +----+-------+ 
       |                                     |                               |        
       |---- Run EXPORT TO PARQET Query ---->|                               |
       |                                     |--- Sends the execution jobs ->|         
       |                                     |                               |
       |<--------- Vertica data nodes connect -------------------------------|        
       |<--------- Parquet --------------------------------------------------|        
       |<--------- Parquet --------------------------------------------------|        
       |<--------- Parquet --------------------------------------------------|        
       |<--------- Parquet --------------------------------------------------|        
       |<--------- Parquet --------------------------------------------------|        
       |<--------- Parquet --------------------------------------------------|
       |                                     |
       |                                     |
       |<----- EXPORT TO PARQET returns -----|

Inefficiencies discovered:

alippai/vertica-python-webhdfs