safing/mmdbmeld

Memory usage keeps climbing while the mmdbmeld command is running

nonetallt opened this issue · 6 comments

As per the post title.

While trying to merge multiple relatively large database files (around 2GB combined), the process eventually dies with:

runtime out of memory: cannot allocate X-byte block (Y in use)
fatal error: out of memory

My laptop has around 6GB of free RAM for Go to work with but the process seems to die with that error before getting there (premature OOM). Any chances of improving gc or otherwise reducing the application memory footprint? I don't think this is an issue with the writer library itself, since there are no mentions of people encountering many issues with memory even with a large row count.

Edit: Also tested on a 16GB virtual machine, ran out of memory after some 30 million rows. The memory efficiency makes this tool pretty much unusable for working with any medium size files.

Well, the this tool pretty much reads the files line by line, converts it according to the config and then feeds the line immediately into the writer library.

I would definitely expect it to need a couple times the memory of the input size. Though I would have guessed that 16GB should have been enough for 2GB of input.

Can you split the data into multiple databases? Eg. one for IPv4, one for IPv6.

Maybe there is an issue with the data conversion. Can you share your config?

Can you split the data into multiple databases? Eg. one for IPv4, one for IPv6.

Isn't the entire point of the library to be able to merge databases? Why would you want to split something you want to query once into multiple different files?

Maybe there is an issue with the data conversion. Can you share your config?

That could be the case, but I'm not sure if that would account for what looks like a memory leak.

---
databases:
- name: database
  mmdb:
    ipVersion: 6
    recordSize: 34
  types:
    from: string
    to: string
    net: string
    asn.asn: string
    asn.domain: string
    asn.name: string
    asn.type: string
    asn.country: string
    country.name: string
    country.iso_3166_2: string
    continent.name: string
    continent.code: string
    city.name: string
    region.name: string
    latitude: float32
    longitude: float32
    timezone: string
    company.name: string
    company.domain: string
    company.type: string
    company.asn: string
    company.as_name: string
    company.as_domain: string
    company.as_type: string
    is_proxy: bool
    is_hosting: bool
    is_robot: bool
    is_tor: bool
    is_vpn: bool
  inputs:
  - file: company.csv
    fields:
    - from
    - to
    - '-'
    - company.name
    - company.domain
    - company.type
    - company.asn
    - company.as_name
    - company.as_domain
    - company.as_type
    - '-'
  - file: asn.csv
    fields:
    - from
    - to
    - '-'
    - asn.asn
    - asn.domain
    - asn.name
    - asn.type
    - '-'
  - file: location.csv
    fields:
    - from
    - to
    - '-'
    - city.name
    - region.name
    - country.iso_3166_2
    - latitude
    - longitude
    -  '-'
    - timezone
  - file: privacy.transformed.csv
    fields:
    - from
    - to
    - is_proxy
    - is_hosting
    - is_robot
    - is_tor
    - is_vpn
  output: database.mmdb
  optimize:
    floatDecimals: 4
    forceIPVersion: false
    maxPrefix: 0
...

I would definitely expect it to need a couple times the memory of the input size. Though I would have guessed that 16GB should have been enough for 2GB of input.

It was way worse than even that. I booted up a 200GB RAM machine just to see what would happen. To produce an output file with the size of around 2,5GB (from orginal data of ~2GB), it ate up 77GB of RAM. The total row count was somewhere around ~165 million.

Could this have something to do with using IPV6 to store IPV4 addresses?

Based on this line in the example config I thought that IPV6 should support both versions:

ipVersion: 6 # Note: IPv6 mmdb can also hold IPv4.

Edit: also just noticed that I defined types for from, to and net, I guess you aren't supposed to do that since they weren't included in the example config types?

Okay, 165 million is quite a size. You can try to take the size of your defined fields, put them in a go struct, get the size and multiply that 165m. That would be the absolute minimum size, so add at least 32B (just 4 pointers) per entry for pointers and database building metadata. This is quite some data.

Yes, you should not define "from", "to" and "net" in the types.
But, it should not make a difference.
I also just noticed that the .csv parser does not support "net" ranges, but aren't using that anyway.

If you split up the databases into IPv4 and IPv6 you can save space and make the internal pointers smaller. This has quite some impact on database size. Don't know if that impact memory size during building, though.

I also noticed you are using a recordSize of 34, which is not a support size, to my knowledge.

If you split up the databases into IPv4 and IPv6 you can save space and make the internal pointers smaller. This has quite some impact on database size. Don't know if that impact memory size during building, though.

I don't really care too much about the db file size to be honest, only the memory usage has been unreasonable.

I also noticed you are using a recordSize of 34, which is not a support size, to my knowledge.

My bad. I had an earlier copy of the config at hand and I knew I increased the size to the largest supported one after getting a writer error with a smaller size. The real size was 32. I don't think it would write anything in the first place if you were using an unsupported record size.

Okay, good. I also think it would fail with a wrong size.

I am also surprised about the memory usage, but I fear there is nothing I can do about this. You could try what the performance impact is with a huge swap file on an SSD. You could just mount it for the building.

Especially if you plan to update this file regularly, maybe a live database (eg. postgres) is a better idea in this case.