Out of Memory issue when building large graph

Question

Out of Memory issue when building large graph

victorconan opened this issue 5 years ago · 6 comments

I have 800k instances with 200 dimensional embeddings. I am trying to build the graph using nsl.tools.build_graph with similarity threshold of 0.95. My driver type is r4.4xlarge. I keep having OOM error. Anyone knows how to estimate how much memory I need?

Answer 1 · 2020-10-02T19:07:47.000Z

Hi, @victorconan, thanks for your interest and bug report!

The memory required by the graph builder is a function not only of the input data, but also the resulting graph. A couple of questions for you:

How big is the TFRecord file from which you're reading your examples?
When you call build_graph, what values (if any) are you supplying for the lsh_splits and lsh_rounds flags?
Do you see any output written to your terminal? The program writes an INFO line every 1 million edges it creates.
Are you able to run the top unix program in another shell window while running the graph builder to determine the program's virtual and real memory usage?

Note that build_graph exists for backward compatibility and has been deprecated. Please switch to using build_graph_from_config in the same package instead.

Thanks!

Answer 2 · 2020-10-02T21:11:58.000Z

Hi, @victorconan, thanks for your interest and bug report!

The memory required by the graph builder is a function not only of the input data, but also the resulting graph. A couple of questions for you:

How big is the TFRecord file from which you're reading your examples?
The TFRecord files are only 714MB

When you call build_graph, what values (if any) are you supplying for the lsh_splits and lsh_rounds flags?
I used lsh_splits = 32 and lsh_rounds = 20. I am a little confused about the statement in the documentation that We have found that a good rule of thumb is to set lsh_splits >= ceiling(log_2(num_instances / 1000)), so the expected LSH bucket size will be at most 1000.. That seems to suggest the max lsh_splits should be 10?

Do you see any output written to your terminal? The program writes an INFO line every 1 million edges it creates.
I am using Databricks, so only saw this on the Log4j output:

Uptime(secs): 31200.0 total, 600.0 interval
Cumulative writes: 75K writes, 75K keys, 75K commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 291 writes, 291 keys, 291 commit groups, 1.0 writes per commit group, ingest: 0.01 MB, 0.00 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sum      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

Are you able to run the top unix program in another shell window while running the graph builder to determine the program's virtual and real memory usage?
I can see from the Databricks Ganglia Plot that:

|       | min   | Avg    | max    |
|-------|-------|--------|--------|
| Use   | 36.4G | 87.3G  | 140.5G |
| Total | 341G  | 805.8G | 1.2T   |

Note that build_graph exists for backward compatibility and has been deprecated. Please switch to using build_graph_from_config in the same package instead.
Thanks, will switch to build_graph_from_config
Thanks!

Thanks!

Answer 3 · 2020-10-06T23:47:00.000Z

Hi, @victorconan.

Unfortunately, I'm unfamiliar with the runtime environment you're using, so I can't really offer much help. Our graph builder is currently limited to running on a single machine and must store all node features and the resulting graph edges in memory (at least when using the lsh_splits and lsh_rounds configuration parameters). We are considering providing a more scalable graph builder in the future, but we have not yet undertaken that effort. On my workstation at work, I've successfully run it on a set of 50K nodes, as described in the build_graph_from_config API docs. 800K is quite a bit larger than that.

I am a little confused about the statement in the documentation that We have found that a good rule of thumb is to set lsh_splits >= ceiling(log_2(num_instances / 1000)), so the expected LSH bucket size will be at most 1000.. That seems to suggest the max lsh_splits should be 10?

That formula places a lower bound on lsh_splits, not an upper-bound. If your nodes tend to be grouped in clusters in the embedding space, you may need a much larger value than that lower bound (as you're currently doing).

There are a couple of things I can think of that you might experiment with:

Try an even larger value for the similarity threshold to reduce the number of edges in the resulting graph.
How long are the strings you're using for your node IDs? To represent the graph, we use a Python set() of 2-tuples, where each 2-tuple contains the source ID and target ID of an edge. So if you're using really long strings for the node IDs, that could consume a lot of memory, I suppose.
Try splitting your input file into multiple smaller files, and running the graph builder on progressively larger input sets to see where you first start hitting the OOM error. (Note that the embedding_files argument is a list of files.)
Shift all embeddings by their mean. By this I mean preprocessing your inputs as follows: 1. Compute the mean embedding across all of your inputs, then 2. subtract this mean from all embeddings. Since we use cosine similarity for the similarity function, this can help if all of your embeddings are bunched together relative to the origin, for example, if they're all located in the first quadrant (i.e., all embedding values are positive).

Please reply back on this bug, and I'll try to help further if I can. Thank you.

Answer 4 · 2020-10-08T18:07:56.000Z

Hi @aheydon-google ,

Thanks for the reply!

I have tried using similarity threshold of 0.99, and it is running very long time (3 days so far). I will try increase lsh_splits to 128 and reduce lsh_rounds to 1 and see how long it takes. I do notice that so far the tsv file is around 11G (previously with threshold 0.95, it is about 157G)
My node ID is 32-length string
Yes, my embedding files are already a bunch of TFRecords, I have 2988 files in total, and each of them is about 240KB.
Okay, I will try subtracting the mean and see if it helps.

Thanks!

Answer 5 · 2020-10-14T23:20:39.000Z

Thanks for the update! If you're using a threshold of 0.99 and the graph builder is running for 3 days, that's a problem. What that tells me is that at least 1 of your LSH buckets is quite large. That needs to be better understood.

One thing that might help is getting access to the log messages that the graph builder writes. I'm not sure why those aren't currently being written for you. Are you invoking build_graph as a program as described in the nsl.tools Overview? If not, I think it would be good if you could do that, since I believe it should enable INFO-level logging.

Please let us know how it goes. Thanks!

Answer 6 · 2020-11-11T17:45:10.000Z

Closing this issue for now. Please feel free to re-open if you have further questions.