osiegmar/FastCSV

GC limit overhead exceeded because of temporary objects

extstmtrifork opened this issue · 4 comments

Hi,

I am trying to read from a csv file containing a bit more than 2 million rows, then make a simple mapping to something i can use to finally insert it to a database. However, i am getting an erro: "GC limit overhead exceeded", as it creates a lot of temporary objects.

I read the other issue regarding temporary objects, however as i could understand, it is regarding writing to a csv file, but i am getting this error while reading from an csv file.

@extstmtrifork can you provide some sample code?

So basically i have two files in which a header called "PersonID" is common.
I read the first file and insert the data into a hashmap (se code below)
Then i read the second file where i use the hashmap to get the another header "CivilRegistrationNumber" based on "PersonID".
There are 14 headers (columns) in the second csv file in which all of them are strings.
I then use all the information to insert into a database
`

public Map<String, String> readingFileAtOnce(File file) throws IOException, InterruptedException {
        Map<String, String> personMap = new HashMap<>();
        CsvReader csvReader = new CsvReader();
        csvReader.setContainsHeader(true);
        csvReader.setTextDelimiter('\'');
        csvReader.setSkipEmptyRows(true);
        CsvParser csvParser = csvReader.parse(file, StandardCharsets.UTF_8);
        CsvRow row;
        boolean headersValidated = false;
        while ((row = csvParser.nextRow()) != null) {
            if (Thread.currentThread().isInterrupted())
                throw new InterruptedException();

            if (! headersValidated) {
                dataValidator.validateHeadersExists(csvParser.getHeader(), Arrays.asList("PersonID", "CivilRegistrationNumber"));
                headersValidated = true;
            }

            try {
                dataValidator.validatePersonData(row.getField("PersonID"), row.getField("CivilRegistrationNumber"));
                personMap.put(row.getField("PersonID"), row.getField("CivilRegistrationNumber"));
            } catch (IllegalStateException e) {
                error++;
                log.error("...");
            } catch (IllegalArgumentException e) {
                error++;
                log.error("...");
            }
        }
        return personMap;
    }

`

This sounds like a JVM tuning issue ... How much heap memory are you allocating to the JVM?

IMO this design doesn't scale well.

You would be better off ...

  1. Sorting both files by PersonID
  2. Read a record from file 1.
  3. Read a record from file 2.
  4. Merge the records and write them to file 3.

Is the CSV file proper formatted? I know situations where missing (closing) text delimiters are resulting in huge column data.