BridgeGC is an efficient cross-level garbage collector for big data frameworks such as Flink and Spark. Specially, BridgeGC is built on OpenJDK 17 HotSpot to reduce the GC time spent on long-lived data objects generated by the frameworks. As shown in Figure 1, BridgeGC uses limited manual annotations at the creation/release points of data objects to profile their life cycles, then BridgeGC leverages the life cycles information of annotated data objects to efficiently allocate them in JVM heap and reclaim them without unnecessary marking/copying overhead.
Download our source code, follow the instructions to build the OpenJDK JVM, resulting in a JVM with BridgeGC.
We provide two simple annotations, @DataObj
and System.Reclaim()
,
that can be used by the framework developers to annotate the creation
and release points of data objects.
Specifically, the annotation @DataObj
is used
along with the keyword new
to denote the creation of data objects.
The annotation System.Reclaim()
is used to denote the
release of a batch of data objects.
We show how we apply annotations in Spark, Flink
and Cassandra briefly as follows and more details
can be found here.
After adding the annotations, compile the framework. Before running the framework, just add -XX:+UseBridgeGC
to JVM parameters of the executor/server to enable BridgeGC.
We design three components in BridgeGC to efficiently profile, allocate, and reclaim annotated data objects.
The profiler is designed to identify data objects and track the life cycles of data objects through annotations, it processes @DataObj
and System.Reclaim()
annotations at the runtime to inform the garbage collector of allocation and reclaimable time of data objects.
The allocator separates the storage of data objects and normal objects in data pages and normal pages, and tackles the problem of space balance by dynamic page allocation. To distinguish data objects readily at the GC level, the allocator labels them using colored pointer.
The collector skips marking labeled data objects and excludes data pages from reclamation in GC cycles where data objects are known to be lived, and performs effective reclamation of data pages only after some annotated data objects are released at the framework level.
We apply and evaluate BridgeGC with Flink 1.9.3, Spark 3.3.0 and Cassandra 4.0.6. We compare BridgeGC with all available garbage collectors in OpenJDK 17, includes ZGC, G1, Shenandoah and Parallel.
For Flink, we select five batch machine learning applications that are memory intensive from Flink examples as the driving workload, including Linear Regression (LR), KMeans (KM), PageRank (PR), Components (CC) and WebLogAnalysis (WA). For Spark, we choose five representative ML applications from popular big data benchmark HiBench, including Linear Regression (LR), Support Vector Machine (SVM), Gaussian Mixture Model (GMM), PageRank (PR) and KMeans (KM). For Cassandra, we choose two workloads from a popular NoSQL database benchmark YCSB, including Write-Intensive (WI) workload and Read-Intensive (RI) workload. The detail we config the application and frameworks can be found here.
Figure 6: The total concurrent marking and copying GC time that BridgeGC and ZGC spend when running applications with different heap sizes.
As the results shown in Figure 6, BridgeGC reduces concurrent GC time by 42%-82% compared to baseline ZGC. BridgeGC achieves lower GC time by consuming 7%-13% less memory than ZGC and having 2%-52% fewer GC cycle counts. Also, BridgeGC spends 31%-46% less marking time per GC cycle.
Figure 7: Execution time of applications under baseline ZGC and execution time of other collectors normalized to ZGC.
In terms of application execution time, BridgeGC outperforms other evaluated collectors for all workloads and configurations as shown in Figure 7. Compared to the baseline ZGC, BridgeGC achieves 3%-29% speedup. BridgeGC also reduces up to 26% execution time compared to the default collector G1 in OpenJDK. As shown in Figure 8, BridgeGC also outperforms all evaluated collectors in latency metrics. BridgeGC improves applications’ performance mainly due to fewer GC cycles and less GC overhead.