Easy Batch vs Spring Batch - Performance comparison

Question

Easy Batch vs Spring Batch - Performance comparison

fmbenhassine opened this issue 4 years ago · 0 comments

In a previous post, I tried to compare Easy Batch and Spring Batch in terms of features. I came to the conclusion that with no doubt, Spring Batch provides a richer feature set and allows you to do much more than Easy Batch does.

In this post, I will compare Easy Batch and Spring Batch in terms of performance. This post is constructive! I developed an alternative to Spring Batch, so I was curious about how it would behave at runtime. Regardless of the result, the goal is to understand why a framework would perform better/worse than the other, and not to show that a framework performs better/worse than the other. The benchmark measures the execution time to read the following customer data file customers_in.csv:

id,firstName,lastName,birthDate,email,phone,street,zipCode,city,country
41837,Due,Pearson,2015-08-31,Liza.Diaz@yopmail.org,0102030405,Fifth Avenue,12345,NewYork,China
454205,Liza,Dickson,2015-06-23,Duke.Pearson@hotmail.com,0102030405,Oxford Street,12345,Paris,Germany
852684,Brad,Hinton,2015-08-31,Tommy.Dickson@hotmail.edu,0504030201,Fifth Avenue,54321,Rome,Italy

and write each record in uppercase to customers_out.csv. In another world, I would write something like:

$>cat customers_in.csv | tr '[a-z]' '[A-Z]' > customers_out.csv

but let's stay in the Java world.. 😄 The following domain object will be used to marshal/unmarshal data:

public class Customer {

    private int id;
    private String firstName;
    private String lastName;
    private Date birthDate;
    private String email;
    private String phone;
    private String street;
    private String zipCode;
    private String city;
    private String country;

    // Getters and setters omitted

}

I will use Easy Random library to generate several files of different sizes for the benchmark: 100.000, 1.000.000 and 10.000.000 customers. The configuration of Easy Batch and Spring Batch applications is pretty much like the Hello World application of the previous post, only the domain object has been changed from Tweet to Customer. Here is the main class to launch Easy Batch job:

public class EasyBatchBenchLauncher {

    public static void main(String[] args) throws Exception {
        File datasource = new File("customers_in.csv");
        File datasink = new File("customers_out.csv");
        String[] fields = {"id", "firstName", "lastName", "birthDate",
                           "email", "phone", "street", "zipCode", "city", "country"};

        Job job = new JobBuilder()
                .reader(new FlatFileRecordReader(datasource))
                .mapper(new DelimitedRecordMapper<>(Customer.class, fields))
                .processor(new CustomerProcessor())
                .marshaller(new DelimitedRecordMarshaller<>(Customer.class, fields))
                .writer(new FileRecordWriter(datasink))
                .batchSize(10)
                .build();

        JobExecutor jobExecutor = new JobExecutor();
        jobExecutor.execute(job);
        jobExecutor.shutdown();
    }

}

And here is the main class to launch Spring Batch job:

public class SpringBatchBenchLauncher {

    public static void main(String[] args) throws Exception {
        ApplicationContext context = new ClassPathXmlApplicationContext("customer-job.xml");
        JobLauncher jobLauncher = (JobLauncher) context.getBean("jobLauncher");
        Job job = (Job) context.getBean("customerJob");
        jobLauncher.run(job, new JobParameters());
    }

}

Results

The benchmark results have been obtained as an average of 5 executions on the following Hardware/Software configuration:

Hardware:

Laptop: MacBook Pro (Retina, 15-inch, Late 2013)
CPU: 2 GHz Intel Core i7
RAM: 8 GB 1600 MHz DDR3
DISK: 251 GB SSD Flash Storage

Software:

OS: Mac OS X Yosemite 10.10.3
Java: version 1.7.0_67 HotSpot(TM) 64-Bit Server VM

The commit-interval is an important parameter for the performance of Spring Batch, just like the batch-size parameter for Easy Batch.
I have used different values for these parameters: 10, 100 and 1000. The following table summarizes the number of input records, the file size and the processing time for each framework:

Number of records (file size)	Easy Batch BS = 10 (s)	Easy Batch BS = 100 (s)	Easy Batch BS = 1000 (s)	Spring Batch CI = 10 (s)	Spring Batch CI = 100 (s)	Spring Batch CI = 1000 (s)
100.000 (9.4 Mo)	1	1	1	10	6	5
1.000.000 (94 Mo)	8	7	7	74	40	38
10.000.000 (983 Mo)	82	76	73	773	424	388

It's always better to see results in charts, so here they are:

The difference is more important for very large data sets:

Please note that this is a macro benchmark, not a micro benchmark at nano second level (where I would have used JMH or a similar tool). The goal is to have a rough idea about the whole execution time for both applications.

Conclusion

Easy Batch is faster than Spring Batch in this case (but might be slower in another case). Now the question is: why this difference? From my understanding of Spring Batch mechanics, I guess the interaction with the job repository (even in memory) is the main reason. Persisting the job/step execution state at each commit-interval has a considerable performance overhead, but it enables job restarts in case of failure. It's always a matter of trade-offs..