Univocity RecordMapper performance issues
seseso opened this issue · 3 comments
Due to the instantiation of the Univocity parser for each record, the performance is very very poor (for 1 million lines, it took me more than 3 minutes versus less than 2 seconds with a "shared" instance of the parser)
To overcome this, I would suggest to move out the creation of the parser from the processRecord method.
Good catch! Can you share the change you made or contribute it with PR?
What I ended up doing was to pass an com.univocity.parsers.common.AbstractParser to my mapper constructor and use it on the processRecord method. I would love to create a PR for it but my code is specific to my requirements where I don't use POJOs but java.util.Map instead. Here's an example of my mapper:
import com.univocity.parsers.common.AbstractParser;
import java.util.LinkedHashMap;
import java.util.Map;
import org.jeasy.batch.core.mapper.RecordMapper;
import org.jeasy.batch.core.record.GenericRecord;
import org.jeasy.batch.core.record.Record;
import org.jeasy.batch.core.record.StringRecord;
public class UnivocityRecordMapper<T> implements RecordMapper<StringRecord, Record<Map<String, String>>> {
private final AbstractParser parser;
public UnivocityRecordMapper(AbstractParser parser) {
this.parser = parser;
}
@Override
public Record<Map<String, String>> processRecord(StringRecord record) throws Exception {
Map<String, String> rValue = new LinkedHashMap<>();
String payload = record.getPayload();
com.univocity.parsers.common.record.Record rec = parser.parseRecord(payload);
rec.fillFieldMap(rValue);
return new GenericRecord<>(record.getHeader(), rValue);
}
}
After some benchmarks, It seems the performance issue is not related to the fact that the parser is not being shared. I tried to extract it from the processRecord
method and the performance is the same. Here is the modified version:
import com.univocity.parsers.common.AbstractParser;
import com.univocity.parsers.common.CommonParserSettings;
import com.univocity.parsers.common.processor.BeanListProcessor;
import org.jeasy.batch.core.mapper.RecordMapper;
import org.jeasy.batch.core.record.GenericRecord;
import org.jeasy.batch.core.record.Record;
import java.io.StringReader;
/**
* A record mapper that uses <a href="http://www.univocity.com/">uniVocity parsers</a> to map delimited records to
* domain objects.
*
* @param <S> The settings type that is used to configure the parser.
* @author Anthony Bruno (anthony.bruno196@gmail.com)
*/
abstract class AbstractUnivocityRecordMapper<T, S extends CommonParserSettings<?>> implements RecordMapper<String, T> {
S settings;
private AbstractParser<S> parser;
private BeanListProcessor<T> beanListProcessor;
/**
* Creates a new mapper that uses <a href="http://www.univocity.com/">uniVocity parsers</a> to map delimited records
* to domain objects.
*
* @param recordClass the target type
* @param settings the settings that is is used to configure the parser
*/
AbstractUnivocityRecordMapper(Class<T> recordClass, S settings) {
this.settings = settings;
this.beanListProcessor = new BeanListProcessor<>(recordClass, 1);
this.settings.setProcessor(beanListProcessor);
this.parser = getParser();
}
@Override
public Record<T> processRecord(Record<String> record) {
String payload = record.getPayload();
parser.parse(new StringReader(payload));
T result = beanListProcessor.getBeans().get(0);
return new GenericRecord<>(record.getHeader(), result);
}
protected abstract AbstractParser<S> getParser();
}
I even extracted the BeanListProcessor
from that method but that didn't help neither (Even though this did not improve the performance, I still introduced the change as I think it's a good refactoring, see 926396b).
The real issue is coming from the fact that in your version, you are converting the parsed record to a Map
(ie no reflection is used), while in the current version the record is converted to a java bean (with reflection behind the scene).
I tried to exclude Easy Batch from the picture and compare these two ways of converting data with univocity's APIs only:
@Test
public void testUnivocityPerformance() {
// /!\ This is quick and dirty, should be a JMH-based throughput micro-bench but I didn't have time to do it :-(
String payload = "foo,bar,15,true";
// parse and convert to map (1021ms)
CsvParser parser = new CsvParser(new CsvParserSettings());
Map<String, String> rValue = new LinkedHashMap<>();
long startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000; i++) {
com.univocity.parsers.common.record.Record record = parser.parseRecord(payload);
record.fillFieldMap(rValue);
}
System.out.println((System.currentTimeMillis() - startTime) + "ms");
// parse and convert to bean (527917ms)
CsvParserSettings settings = new CsvParserSettings();
BeanListProcessor<TestBean> beanListProcessor = new BeanListProcessor<>(TestBean.class);
settings.setProcessor(beanListProcessor);
parser = new CsvParser(settings);
startTime = System.currentTimeMillis();
TestBean testBean;
for (int i = 0; i < 1000000; i++) {
parser.parse(new StringReader(payload));
testBean = beanListProcessor.getBeans().get(0);
}
System.out.println((System.currentTimeMillis() - startTime) + "ms");
}
TestBean
is the same as in here. The difference is huge: 1s vs 8min ..
I would love to create a PR for it but my code is specific to my requirements where I don't use POJOs but java.util.Map instead
Indeed, that's not compatible with the current code and we could not merge it as is. Thank you for raising the issue anyway!