fmbenhassine/easy-batch-vs-spring-batch

Easy Batch vs Spring Batch - Feature comparison

fmbenhassine opened this issue · 0 comments

Easy Batch and Spring Batch fundamentally try to solve the same problem: how to efficiently process large amount of data in batch mode. But they are conceptually different at several levels:

Job structure:

A job in Spring Batch is a collection of steps. A step can be a single task or chunk-oriented. In Easy Batch, there is no concept of step. A job in Easy Batch is similar to a Spring Batch job with a single chunk-oriented step (and using an in-memory job repository).

Job definition:

Spring Batch provides a DSL to define the execution flow of steps within a job. In Easy Batch, there is no such DSL. Creating a workflow of jobs is left to an external workflow engine like Easy Flows.

Job execution:

  • A Spring Batch job can have multiple job instances (identified by (identifying) job parameters). Each job instance may in turn have multiple executions. In Easy Batch, there is no such job instance or job execution concepts. Jobs are Callable objects that can be executed with a JobExecutor or ExecutorService.

  • In Spring Batch, the execution state of jobs is persisted by default in a map based job repository (which can be in-memory database as well). This is not the case for Easy Batch: By default, jobs are executed without persisting their state in a persistent store (but it is possible to do it using listeners).

This quick comparison should give you an overview of conceptual differences. There is nothing wrong with both frameworks, they just have different design choices and defaults:

  • Spring Batch is designed for large scale jobs. Restarting such jobs from scratch is not efficient. Hence, persisting the job state by default to restart it where it left off in case of failure makes perfect sense.

  • Easy Batch is targeted at small and simple ETL jobs. These jobs are in most cases idempotent (or at least can be designed to be idempotent). Such jobs can be restarted from scratch if they fail without any problem. The design choice of not persisting the job state by default makes sense in these cases.

Based on these conceptual differences, comparing Easy Batch and Spring Batch would be unfair. The only comparison that makes sense is for the use case where an Easy Batch job is compared to a Spring Batch job with a single chunk-oriented step and using an in-memory job repository. And this is what I'm going to use in this post.

The goal of this post is to compare both frameworks in terms of features with a practical example. Since I am the author of Easy Batch, you may think the comparison will be biased. It will not be the case! I am going to be objective, I am always a constructive person. I am a big fan of Spring Framework and all related projects. My goal is not to say that Easy Batch is better than Spring Batch or vice versa, the goal is to say in which situation it is better to use one framework over the other. If you want the short answer to which framework is better, here it is: Spring Batch is better! (See my opinion in the conclusion of this post).

The use case will be reading some tweets from a flat file and printing them out in uppercase to the standard output. The data source is the following tweets.csv file:

id,user,message
1,foo,Spring Batch rocks! #SpringBatch
2,bar,Easy Batch rocks too! and it's easier :wink: #EasyBatch

Records will be mapped to the following domain object:

public class Tweet {
    private int id;
    private String user;
    private String message;
    // Getters and setters omitted
}

Easy Batch implementation:

First, let's create a processor to transform tweets to uppercase:

public class TweetProcessor implements RecordProcessor<Record<Tweet>, Record<Tweet>> {
    
    @Override
    public Record<Tweet> processRecord(Record<Tweet> record) {
        Tweet tweet = record.getPayload();
        tweet.setMessage(tweet.getMessage().toUpperCase());
        return new GenericRecord<>(record.getHeader(), tweet);
    }

}

Then, configure a job and run the application:

public class EasyBatchHelloWorldLauncher {

    public static void main(String[] args) throws Exception {
        Job job = new JobBuilder().
                .reader(new FlatFileRecordReader("tweets.csv"))
                .filter(new HeaderRecordFilter())
                .mapper(new DelimitedRecordMapper<>(Tweet.class, "id", "user", "message"))
                .processor(new TweetProcessor())
                .writer(new StandardOutputRecordWriter())
                .build();
		
	JobExecutor jobExecutor = new JobExecutor();
	jobExecutor.execute(job);
	jobExecutor.shutdown();
    }

}

Spring Batch implementation:

First, let's create a processor to transform tweets to uppercase:

public class TweetProcessor implements ItemProcessor<Tweet, Tweet> {

    @Override
    public Tweet process(Tweet tweet) throws Exception {
        tweet.setMessage(tweet.getMessage().toUpperCase());
        return tweet;
    }

}

Then, create a writer:

public class TweetWriter implements ItemWriter<Tweet> {

    @Override
    public void write(List<? extends Tweet> items) throws Exception {
        for (Tweet tweet : items) {
            System.out.println(tweet);
        }
    }

}

And finally, configure the application:

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:batch="http://www.springframework.org/schema/batch"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/batch
        http://www.springframework.org/schema/batch/spring-batch.xsd
        http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans.xsd ">

    <bean id="transactionManager"
        class="org.springframework.batch.support.transaction.ResourcelessTransactionManager"/>

   <bean id="jobRepository" class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean">
        <property name="transactionManager" ref="transactionManager"/>
    </bean>

    <bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
        <property name="jobRepository" ref="jobRepository"/>
    </bean>

    <bean id="tweet" class="common.Tweet" scope="prototype"/>

    <bean id="tweetReader" class="org.springframework.batch.item.file.FlatFileItemReader">
        <property name="resource" value="classpath:tweets.csv"/>
        <property name="linesToSkip" value="1"/>
        <property name="lineMapper">
            <bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
                <property name="lineTokenizer">
                    <bean class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
                        <property name="names" value="id,user,message"/>
                    </bean>
                </property>
                <property name="fieldSetMapper">
                    <bean class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">
                        <property name="prototypeBeanName" value="tweet"/>
                    </bean>
                </property>
            </bean>
        </property>
    </bean>

    <bean id="tweetProcessor" class="springbatch.TweetProcessor"/>

    <bean id="tweetWriter" class="springbatch.TweetWriter"/>

    <batch:job id="helloWorldJob">
        <batch:step id="step1">
            <batch:tasklet>
                <batch:chunk reader="tweetReader" writer="tweetWriter" processor="tweetProcessor"
                 commit-interval="10"/>
            </batch:tasklet>
        </batch:step>
    </batch:job>

</beans>

Here is the class to launch the application with Spring Batch:

public class SpringBatchHelloWorldLauncher {

    public static void main(String[] args) throws Exception {
        ApplicationContext context = new ClassPathXmlApplicationContext("job-hello-world.xml");
        JobLauncher jobLauncher = (JobLauncher) context.getBean("jobLauncher");
        Job job = (Job) context.getBean("helloWorldJob");
        jobLauncher.run(job, new JobParameters());
    }

}

Comparison:

As you can see, Spring Batch still require to configure some technical stuff you might not really need, which is not the case for Easy Batch. And this is exactly what many people are complaining about. Here are some examples:

"Spring Batch application grows pretty quick and involves configuring a lot of stuff that, at the outset, it just doesn't seem like you should need to configure. A "job repository" to track the status and history of job executions, which itself requires a data source - just to get started? Wow, that's a bit heavy handed"

"I got a little overwhelmed by the complexity and amount of configuration needed for even a simple example"

Jeff Zapotoczny

"What should we think of the Spring Batch solution? Complex. Obviously, it looks more complicated than the simple approaches. This is typical of a framework: the learning curve is steeper."

Arnaud Cogoluègnes

"I recently evaluated Spring Batch, and quickly rejected it once I realized that it added nothing to my project aside from bloat and overhead"

rtperson

"il faut configurer le composant qui permet de lancer un batch, le « jobLauncher ». Simple, mais on voit que l’on a besoin d’un « jobRepository » qui permet de suivre et de reprendre l’avancement des tâches. On voit que l’on a besoin d’un transaction manager. Cette propriété est obligatoire, ce qui est à mon sens dommage pour les cas simples comme le nôtre où nous n’utilisons pas les transactions."

Julien Jakubowski

"Spring Batch or How Not to Design an API.. Why do I Need a Transaction Manager? Why do I Need a Job Repository?"

William Shields

Most of these posts are quite recent, there are a couple of them that seem to be outdated, but this is still true for the last version of Spring Batch (v3.0.3 as of writing this post).

EDIT 02/10/2017: Most people are complaining about the complexity of configuration of Spring Batch jobs (I am not one of them, read more about this later in this post). This complexity is not relevant anymore today thanks to the amazing Spring Boot project and the @EnableBatchProcessing annotation.

These reactions from the community can be summarized in 3 points:

  • Steep learning curve
  • Complex configuration
  • Mandatory components that you have to configure but probably don't need

Personally, steep learning curve is not a problem if it worth it (and it does for Spring Batch!). Complex configuration is also a point that I can accept. But my concern is being forced to configure components I might not need:

  1. If my application does not require transactions, why do I need to configure a transaction manager?
  2. If my application does not need retry on failure or job history, why do I need to configure a Job Repository (even in memory)?
  3. If my application does not write anything, why do I need to specify a writer?
  4. If my application does not need chunk processing, why do I need to specify a commit-interval?

There is certainly a good reason for each of these components and I have tried to answer these questions according to my understanding of the framework's internals:

If my application does not require transactions, why do I need to configure a transaction manager?
If my application does not need retry on failure or job history, why do I need to configure a Job Repository (even in memory)?

Spring Batch persists the state of the job in a database to be able to restart it where it left off in case of failure. To persist the state of the job, a transaction manager and a job repository are required. But those could have been made optional by default in case there is no requirement to retry the job on failure.

If my application does not write anything, why do I need to specify a writer?

What I am referring to is for example a batch job that counts the number of invalid records in a flat file. In this case, we don't write data anywhere (unless you define assigning a value to a variable as some kind of writing). Of course, one can use a NoOpItemWriter, but again, this is configuring a component we don't need.

If my application does not need chunk processing, why do I need to specify a commit-interval?

Coming from the unix world, I am used to tools that are record wise (sed, awk and friends). In some situations, we don't really need chunk processing. If I take the same example of a batch job that counts the number of invalid records in a flat file, providing a commit-interval makes no sense. But it makes a perfect sense in other situations! If the job is to persist data to a database, it is wise to use chunk processing and specifiy a reasonable commit-interval for performance reasons (Do you imagine committing a transaction for each record?). So like others, the chunk processing model could have been made optional by default (probably by providing a step implementation that is record wise in addition to the chunk-oriented one and give the choice to the user).

To summarize, there is nothing wrong with these components but I differ with the choice of the defaults. Spring Batch is well suited for use cases where you really need advanced features like retry on failure, remote chunking , flows, etc. When such advanced features are not needed, usually in-house solutions are created from scratch (I have seen a lot of them). And this is where Easy Batch comes to play, as a middle solution between Spring Batch and the "Do It Yourself" way:

eb-vs-sb

Easy Batch is probably easier to learn, configure and use, but this does not make it suitable for all use cases (which was not the goal in the first place). Here is a side by side comparison of features between both frameworks:

Feature Spring Batch Easy Batch
Learning curve Steep Small
POJO based development Yes Yes
Parallel processing Yes Yes
Asynchronous processing Yes Yes
Real time monitoring Yes Yes
Job configuration Java, Xml, Annotations Java
Transaction management Declarative, Programmatic Declarative, Programmatic
Chunk processing Yes Yes
Chunk scanning Yes Yes
Fault tolerance features Yes Yes
Job meta-data persistence Yes No
Remote job administration Yes No
Remote partitioning Yes No
Remote chunking Yes No
Implements the JSR 352 Yes No

There is no doubt, Spring Batch is ahead of Easy Batch in terms of features. But this comes with a cost: a complex configuration and a steep learning curve. As always, it is a trade-off: you can't have the cake and it eat it too 😄 The goal of Easy Batch is to keep the framework small and easy but at the same time extensible and flexible with smart defaults to cover the majority of use cases.

Conclusion:

I hope this post gives you some insights on both frameworks to help you choose which one to use in which situation. But in the end, the choice should be pragmatic: choose the right tool for the right job! If your application requires advanced features like retry on failure, remoting or flows, then go for Spring Batch (or another implementation of JSR 352). If you don't need all this advanced stuff, then Easy Batch can be very handy to simplify your batch application development (a real world example can be found here).

Let me conclude with my honest opinion about both frameworks: Spring Batch is better! Easy Batch is easier. Spring Batch is made by smart people working full time on the framework. Easy Batch, on the other hand, is made by an open source hacker working on the framework during nights and weekends (with the help of some great contributors). It is not the same league. Easy Batch will always be the little brother of Spring Batch 😄