j-easy/easy-batch

Univocity RecordMapper performance issues

seseso opened this issue · 3 comments

Due to the instantiation of the Univocity parser for each record, the performance is very very poor (for 1 million lines, it took me more than 3 minutes versus less than 2 seconds with a "shared" instance of the parser)

To overcome this, I would suggest to move out the creation of the parser from the processRecord method.

Good catch! Can you share the change you made or contribute it with PR?

What I ended up doing was to pass an com.univocity.parsers.common.AbstractParser to my mapper constructor and use it on the processRecord method. I would love to create a PR for it but my code is specific to my requirements where I don't use POJOs but java.util.Map instead. Here's an example of my mapper:

import com.univocity.parsers.common.AbstractParser;
import java.util.LinkedHashMap;
import java.util.Map;
import org.jeasy.batch.core.mapper.RecordMapper;
import org.jeasy.batch.core.record.GenericRecord;
import org.jeasy.batch.core.record.Record;
import org.jeasy.batch.core.record.StringRecord;

public class UnivocityRecordMapper<T> implements RecordMapper<StringRecord, Record<Map<String, String>>> {

    private final AbstractParser parser;
        
    public UnivocityRecordMapper(AbstractParser parser) {
        this.parser = parser;
    }
        
    @Override
    public Record<Map<String, String>> processRecord(StringRecord record) throws Exception {
        
        Map<String, String> rValue = new LinkedHashMap<>();
        String payload = record.getPayload();
        com.univocity.parsers.common.record.Record rec = parser.parseRecord(payload);
        rec.fillFieldMap(rValue);
        
        return new GenericRecord<>(record.getHeader(), rValue);
    }
}

After some benchmarks, It seems the performance issue is not related to the fact that the parser is not being shared. I tried to extract it from the processRecord method and the performance is the same. Here is the modified version:

import com.univocity.parsers.common.AbstractParser;
import com.univocity.parsers.common.CommonParserSettings;
import com.univocity.parsers.common.processor.BeanListProcessor;
import org.jeasy.batch.core.mapper.RecordMapper;
import org.jeasy.batch.core.record.GenericRecord;
import org.jeasy.batch.core.record.Record;

import java.io.StringReader;

/**
 * A record mapper that uses <a href="http://www.univocity.com/">uniVocity parsers</a> to map delimited records to
 * domain objects.
 *
 * @param <S> The settings type that is used to configure the parser.
 * @author Anthony Bruno (anthony.bruno196@gmail.com)
 */
abstract class AbstractUnivocityRecordMapper<T, S extends CommonParserSettings<?>> implements RecordMapper<String, T> {

    S settings;
    private AbstractParser<S> parser;
    private BeanListProcessor<T> beanListProcessor;

    /**
     * Creates a new mapper that uses <a href="http://www.univocity.com/">uniVocity parsers</a> to map delimited records
     * to domain objects.
     *
     * @param recordClass the target type
     * @param settings    the settings that is is used to configure the parser
     */
    AbstractUnivocityRecordMapper(Class<T> recordClass, S settings) {
        this.settings = settings;
        this.beanListProcessor = new BeanListProcessor<>(recordClass, 1);
        this.settings.setProcessor(beanListProcessor);
        this.parser = getParser();
    }


    @Override
    public Record<T> processRecord(Record<String> record) {
        String payload = record.getPayload();
        parser.parse(new StringReader(payload));

        T result = beanListProcessor.getBeans().get(0);
        return new GenericRecord<>(record.getHeader(), result);
    }

    protected abstract AbstractParser<S> getParser();

}

I even extracted the BeanListProcessor from that method but that didn't help neither (Even though this did not improve the performance, I still introduced the change as I think it's a good refactoring, see 926396b).

The real issue is coming from the fact that in your version, you are converting the parsed record to a Map (ie no reflection is used), while in the current version the record is converted to a java bean (with reflection behind the scene).

I tried to exclude Easy Batch from the picture and compare these two ways of converting data with univocity's APIs only:

@Test
public void testUnivocityPerformance() {

    // /!\ This is quick and dirty, should be a JMH-based throughput micro-bench but I didn't have time to do it :-(

    String payload = "foo,bar,15,true";
    
    // parse and convert to map (1021ms)
    CsvParser parser = new CsvParser(new CsvParserSettings());
    Map<String, String> rValue = new LinkedHashMap<>();
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000; i++) {
        com.univocity.parsers.common.record.Record record = parser.parseRecord(payload);
        record.fillFieldMap(rValue);
    }
    System.out.println((System.currentTimeMillis() - startTime) + "ms");

    // parse and convert to bean (527917ms)
    CsvParserSettings settings = new CsvParserSettings();
    BeanListProcessor<TestBean> beanListProcessor = new BeanListProcessor<>(TestBean.class);
    settings.setProcessor(beanListProcessor);
    parser = new CsvParser(settings);
    startTime = System.currentTimeMillis();
    TestBean testBean;
    for (int i = 0; i < 1000000; i++) {
        parser.parse(new StringReader(payload));
        testBean = beanListProcessor.getBeans().get(0);
    }
    System.out.println((System.currentTimeMillis() - startTime) + "ms");
}

TestBean is the same as in here. The difference is huge: 1s vs 8min ..

I would love to create a PR for it but my code is specific to my requirements where I don't use POJOs but java.util.Map instead

Indeed, that's not compatible with the current code and we could not merge it as is. Thank you for raising the issue anyway!