Table of Contents
Dependencies
scala
is used for this project.
- Compiling and running
sbt-1.x
is used for compiling and managing the library dependencies.java-1.8
is used for running the jar file.
- Library dependencies (
sbt
will handle the libraries. No need to download separately)scalatest
is used for unit testing.slf4j
andlog4j
backend are used for logging.
Run instructions
The script run.sh
will first try to compile the project and make an assembly jar file with dependencies using sbt
, and then run the jar file using java
. In case the compiling is not successful, it will fall back to use the jar file that was pre-compiled.
Approach
The input are read and processed line by line. Two open addressing (closed hashing) hash tables are maintained during the processing. One (DonorRepository
) stores the earliest year of donation for each donor. The other (Bank
) stores the donations (Account
) from repeated donors for each recipient in each zipcode and year. Importantly, the donations of one Account
are stored in a black-red tree with an augmented count field which stores the size of the subtree. With this balanced binary search tree, it takes
-
DonationAnalytics
is the entry point of the program. It reads and process transactions line by line.If a transaction is valid, it will be used to update theDonorRepository
. If the donor has donated before, the transaction will be used to update theBank
. Output are available in real time (well, there is still some delay caused by file IO bufferring). -
Transaction
is the class to hold the information we need from an input line. The constructor takes an input line, splits it with '|', and uses regular expressions to extract the necessary information. Invalid input lines will be ignored. Valid transactions will be used to update theDonorRepository
. If the donor is considered as repeated, the transaction will be used to update theBank
. -
DonorRepository
is the class to hold the earliest donation year for each donor. It has anupdate
method, which takes an transaction and update the earliest donation year if possible. This method will return true if the donor has donated in any prior year, and false otherwise. -
Bank
is the class to hold the accounts for each recipient in each area (zipcode) and year. The key of the hash table iss"$recipient|$zipcode|$year"
, the value is anAccount
which consists of a total amount and anOrderStatisticTree
. Theupdate
method takes an transaction and update the account for the key by adjusting the total amount and inserting the new amount to theOrderStatisticTree
. Thelookup
method takes the information for the key and a percent number (between 0 and 1). The percent will be convert to$k=size * percent$ and then the $k_{th}$smallest element in the tree will be returned. -
OrderStatisticTree
is the class to hold the donations for eachAccount
in theBank
. It is a red-black tree augmented with a count field that stores the size of the subtree, so that the time of the query for the$k_{th}$ element becomes$O(log(n))$ .
Performance and scalability
Space complexity
Transactions are processed line by line, so they only take a very small constant space in memory. It is the two hash tables that will take large amount of space, especially when the input file size is large.
-
Under reasonable assumputions
We have
$n$ input lines and most of the input are valid. The lines come earlier have an earlier transaction date. Everyone donate at most once a year. Under these assumptions, a record is either a new donor or a repeated donor. Note that every new donor will result in a new entry in theDonorRepository
hash, and every repeated donor will result in a new entry in aOrderStatisticTree
of theBank
hash. The total space complexity is$O(n)$ . -
Violation of the assumptions
If some of the assumptions are violated, it will only result in less space usage. If some input line are not valid transactions, they will be ignored. If a late transaction came earlier than the first transaction, we wouldn't know it should be repeated donation, so it will be absent in the
Bank
. If a donor donate multiple times in a year, only one record will be stored in theDonorRepository
.
In summary, the space complexity is
Time complexity
Because of using hash tables, I will focus on the average time complexity. The average time required to update the DonorRepository
is Bank
is
Testing with FEC datasets
All the individual donation record files from 1979 to 2018 are downloaded and combined chronologically. There are over 51,000,000 records in the 8.1 GB combined file. The testing is performed on a Mac (CPU Intel i5, 8G memory, fusion disk), with java maximum heap size set to 6g. It took about 5.5 minutes to process the whole dataset (~155k records/s, ~24.5 MB/s). There are about 12.0 million donors, 5.1 million accounts and 18.5 million repeated donations. Below is a histogram of running time (one +
for one second) for every 2 million records. It shows that the running time scales very well even at later stage when the trees became larger. The two peaks are likely due to garbege collection.
1- 2000000: +++++++++++
2000001- 4000000: +++++++++++++
4000001- 6000000: +++++++++++
6000001- 8000000: +++++++++++++
8000001-10000000: ++++++++++++
10000001-12000000: +++++++++++
12000001-14000000: +++++++++++++++
14000001-16000000: +++++++++++++
16000001-18000000: ++++++++++++
18000001-20000000: +++++++++++++++
20000001-22000000: ++++++++++++++++++++++++++++++
22000001-24000000: +++++++++++++
24000001-26000000: ++++++++++
26000001-28000000: ++++++++++
28000001-30000000: ++++++++++++++++
30000001-32000000: ++++++++++
32000001-34000000: ++++++++++
34000001-36000000: ++++++++++
36000001-38000000: +++++++++++
38000001-40000000: +++++++++++
40000001-42000000: ++++++++++
42000001-44000000: +++++++++++++++++++++
44000001-46000000: +++++++++++
46000001-48000000: ++++++++++++
48000001-50000000: ++++++++++++