.each method always returns empty Enumerable for GeoIPCity.dat database
mtowers opened this issue · 10 comments
Using a recently downloaded GeoIPCity.dat
legacy database, calling the each
method always returns an Enumerable
with zero elements. However, if you use the GeoLiteCity.dat
database, a non-empty Enumerable
is returned.
#!/usr/bin/env ruby
require 'geoip' # v 1.6.1
@maxmind = GeoIP.new("./GeoLiteCity.dat")
puts "Cities found in GeoLiteCity.dat: #{@maxmind.each.count}"
@maxmind = GeoIP.new("./GeoIPCity.dat")
puts "Cities found in GeoIPCity.dat: #{@maxmind.each.count}"
Cities found in GeoLiteCity.dat: 12098
Cities found in GeoIPCity.dat: 0
FWIW, the .each_by_ip
method does return a populated Enumerable
.
Sorry I haven't had any time to investigate yet. Have you looked in the code? There's not much there. PR welcome.
I'm still poking on it but I've not been able to isolate the exact problem.
I'm still not sure why, but the empty Enumerable
is returned because the first call to read_city
is passing an invalid parameter to atomic_read
for the file offset. The offset being passed is greater than the length of the file. This condition manifests with a copy of GeoIPCity.dat
downloaded from MaxMind on 8/3/16. It does not manifest with my copy of GeoLiteCity.dat
, downloaded 8/9/16.
record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)
Here's what I'm seeing in my debugger:
762: def read_city(offset, hostname = '', ip = '') #:nodoc:
763: require 'pry-byebug'
764: binding.pry
=> 765: return nil if offset == 0
766: record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)
767: return unless (record && record.size == FULL_RECORD_LENGTH)
768:
769: # The country code is the first byte:
770: code = record[0]
[1] pry(#<GeoIP>)> offset
=> 10286227
[2] pry(#<GeoIP>)> index_size
=> 61717356
[3] pry(#<GeoIP>)> offset+index_size
=> 72003583
[4] pry(#<GeoIP>)> @file.size
=> 66865539
One more clue. Changing line 764 in read_city
from:
record = atomic_read(FULL_RECORD_LENGTH, offset+index_size)
to:
record = atomic_read(FULL_RECORD_LENGTH, @record_length*2*offset)
will allow the code to traverse some number of city records, although some fields are not populated correctly.
@cjheath Without a spec of the binary format of the database, I'm not sure how to proceed further diagnosing this problem. I contacted MaxMind and asked if a spec was available and apparently there is not. They referred me to this project as a reference but I've not been able to discover anything new from looking at it. https://github.com/appliedsec/pygeoip
Any ideas?
I originally wrote the geoip by following this https://github.com/maxmind/geoip-api-c.
I assume they will have updated it to work with whatever they've done to the file format,
so that would be my first port of call. I won't be able to do it this week or next, however.
each_by_ip does a recursive descent of the index tree, which starts at the start of the file and ends at @database_segments[0] - see the index_length method. The index tree is a binary search tree with nodes containing two offset values, each of @record_length (3 or 4) bytes.
each on the other hand walks through the city records which is presumed to start one byte directly after the index tree.
The end of the file has a number of 4-byte values ending with a 3-byte value. One of these four-byte values contains '\xFF\xFF\xFF' and a byte containing the database type. We search back from the end of the file to find this type number, which initialises the file format configuration variables. Note that this is not documented; it's just what I copied from the early C library code.
Given this, it seems most likely that the fix will come in detect_database_type!. See if you can tell what's going on from what I described here.
Possibly one or the other four-byte configuration values in the tail of the file contains alternate instructions about how to locate the city records, and they've started making use of that?
Finally circling back to this...
It also appears that the each
method is also not fully iterating over all city records in the GeoLiteCity.dat database(gzip). The each
method prints progress every 1000 records. When I call that method, the last output I see is 25000: 3625134
, which implies < 26,000 records iterated over.
However, if you look at the CSV city list(zip), it contains 811,675 records.