ntamas/plfit

Numerosity-based input?

vigna opened this issue · 3 comments

vigna commented

It is very common, when trying to analyse distributions of, say, web graph, to have the distribution in the form
1 143245
2 395599
etc., that is, for each possible value the associated numerosity. The way we're using plfit now is simply that of generating a sample by printing 143254 times "1", 395599 times "2", and so on, but this is slow, memory-hungry and definitely suboptimal. How difficult would be to have plfit use directly a sample specified as above?

Well, it's not impossible but it definitely needs a bit of work. Right now all the functions in plfit.c work with an array of doubles. (This is because the primary use-case for plfit so far was to fit power-laws to degree distributions in igraph, where you already have a vector that contains the degrees for each node in the graph). Switching to the proposed representation (which is basically a run-length encoded version of the sorted sample vector) would necessitate rewriting plfit_i_logsum_discrete() and plfit_i_logsum_less_than_discrete at least, along with all the functions that call these two functions, and the functions related to resampling (for estimating the p-value). Unfortunately I don't think I'll have time for this in the near future, but I can help with the code review if someone is willing to send a PR.

Alternatively, if you worry only about the space consumed on the disk but don't mind if the run-length encoded representation from the disk is "expanded" into a full vector in memory, then it's enough to re-write the first part of process_file in main.c that deals with reading the input file - or write a small Python script that reads the run-length encoded representation from a file and prints the "expanded" representation to stdout, and pipe this into the stdin of plfit.

vigna commented

Yes, the latter is what we are doing now. Only, on a graph with 12 billion nodes, so it's a bit slow and memory consuming 😂.

Yeah, I can imagine. Sorry about that :( Unfortunately I'm not doing research any more so it's hard to find spare time for these side-projects, that's why I cannot promise that I'll implement this in the near future. It would definitely be a nice addition.