copy data between hdfs clusters blazingly fast
- need the ability to copy files in hdfs across network boundaries
- in enterprises there are sometimes network firewalls and partitions that prevent directly copying between clusters with a tool like distcp
- cluster-fastcopy facilitates data copies between hdfs clusters in these scenarios
- a common use case is copying down production data into nonprod dev/test clusters to enable testing
Copy 10 128MB files in 2.2 seconds
Sample response from POST /copy
{
"from": "/tmp/bench10x128/",
"to": "/tmp/out/",
"written": 1342177280,
"filesRequested": 10,
"filesCopied": 10,
"copyFailures": [],
"throughputMbps": 4859.414809991191,
"elapsedSecs": 2.209611375
}
Copy files in 'from' into 'to' on 'targetUrl'
curl --request POST \
--url 'http://localhost:8080/copy?from=%2Ftmp%2Fbench32x128%2F&to=%2Ftmp%2Fout%2F&targetURL=http%3A%2F%2Flocalhost%3A8080%2Fupload'
Upload byte stream "hello, world" into 'to' directory with 'fileName'
curl --request POST \
--url 'http://localhost:8080/upload?to=%2Ftmp%2Fin%2F&fileName=hello.txt' \
--header 'Content-Type: application/octet-stream' \
--data 'hello, world!'
- receive a request to copy data from cluster1 to cluster2
- stream data from cluster1 into hdfs cluster2 by sending a byte stream to a microservice residing in cluster2's network partition
- make heavy use of goroutines to make this all as fast as possible