mattpodolak/pmaw

slower than psaw for me

gobbedy opened this issue · 19 comments

Hello,

Thanks for v0.1.0!

I just tried it and for me it was slower than psaw. Below is a standalone that compares the completion times for both. Result: pmaw took 5m16s and psaw took 2m42s.

Am I doing something wrong?

from psaw import PushshiftAPI as psawAPI
from pmaw import PushshiftAPI as pmawAPI
import os
import time


pmaw_api = pmawAPI(num_workers=os.cpu_count()*5)
psaw_api = psawAPI()


start = time.time()
test = pmaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])
end = time.time()
print("pmaw took " + str(end - start) + " seconds.")

start = time.time()
test_gen = psaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])
test = list(test_gen)
end = time.time()
print("psaw took " + str(end - start) + " seconds.")

I'm running your code right now and will post an update shortly. My first guess is that with too many threads there is a degradation in performance due to competition.

On a side note, if you're using v0.1.0 can you update to v0.1.1, I released a fix today as there was an error in the time slicing, however, this only affects the data integrity not the performance.

@gobbedy I have an update, on my computer your code creates 40 workers, we initialize your search with 40 time slices for the specified period. A lot of these time slices do not contain any submissions, which unnecessarily extends the completion time.

I would recommend using less threads, or increasing the time window. Let me know if you have improved results and I'll close this ticket.

Indeed, on my computer my code yields 60 workers. I just ran it (just pmaw) with a few other num_workers settings:
20: 4m22
10: 3m44
5: 3m47
4: 4m08
3: 2m33
2: 2m23

So with 2/3 workers is faster than psaw, which is nice!

Increasing the time window wouldn't work for my application (I need to get information for specific time windows) -- though you bring up an insightful point: I believe there are zero results for some time windows now due to pushshift being semi-down due to switching their servers over to AWS. This may make pmaw more efficient (relative to psaw) when pushshift's database gaps are filled up.

Nonetheless, I do notice in your code that using metadata, you're able to find out exactly how many results are available for a given set of parameters. Using this metadata approach, would it be possible to slice the requests so as to minimize the number of requests? eg suppose the total number of available results is precisely 1500 and the max_results_per_request is 100, it should (at least in theory) be possible to slice it into exactly 15 requests? Is that impossible in practice?

And lastly, I notice that max_results_per_request is 100, and in the psaw code it seems to be 1000. Is it just faster at 100?

Thanks!

Unfortunately, since we don't know the distribution of posts or comments in a time period we can't confidently slice into an exact number of requests. If we assume a uniform distribution and slice based off this, some slices may have more than 100 posts/comments while some may still have 0.

For the max_results_per_request, I changed this to 100 as from my testing it appears that Pushshift has changed the maximum results for a query to 100 from 1000.

I'll think about this some more and will add to this ticket if I come up with any potential solutions for your use-case

Can you share the checkpoint results which are printed when you run PMAW? I ran your code and this is how it performed for the time window specified (returning around 20,000 submissions):

PMAW (10 threads) - 463s
PSAW - 547s

Note, I'm currently pulling a massive dataset from Pushshift in the background which has caused more requests to fail. Experimenting with 20+ threads appears to result in a much higher number of rejected requests due to rate-limiting from Pushshift.

I just ran it with 60 workers for the sake of this reply, let me know if you want a different num_workers. Note also that I'm still at v0.1.0 for now. Below is the output I got:

Setting limit to 22746
Checkpoint:: Success Rate: 98.67% - Requests: 300 - Batches: 5
Total:: Success Rate: 98.50% - Requests: 334 - Batches: 6
pmaw took 343.424528837204 seconds.
psaw took 300.7546238899231 seconds.

To be honest, I'm impressed that the success rate was 98.67% for that many concurrent calls, perhaps you should try increasing the rate_limit to 100+ (if you're using the default rate-averaging), otherwise, I'd recommend reducing the base_backoff.

A second suggestion would be to set the batch_size to a multiple of 60, right now there are a lot of idle workers waiting for the batch to finish, with only 10 workers this can be insignificant, but with such a high success rate at 60 threads we are missing out on a potential time reduction.

Would love to see if it becomes performant at a rate_limit of 120 and a batch_size of 180

In the meantime I was poring over your code and I have a possible idea for a performance improvement.

My understanding of the code is not total so forgive me if any of the following is wrong.

I notice that after each request, you split each time slice in two. So that if you start with, say 4 workers, after each "generation" of tasks you end up with 4 additional slices (so after 5 generations you'd have 4 + 4*5 = 24 time slices).

The more slices you have, the more uneven the number of results I'd imagine, and the more chance of wasted work.

Could you simply omit the time slice after each iteration?

ie replace this code

                        # number of timeslices is depending on remaining results
                        if remaining > self.max_results_per_request:
                            num = 2
                        elif remaining > 0:
                            num = 1
                        else:
                            num = 0

with the following?

                        # number of timeslices is depending on remaining results
                        if remaining > self.max_results_per_request:
                            num = 1
                        else:
                            num = 0

Also, this is a nitpick but since the data is ordered (I think?) could you skip the above for loop and just do this?

before = float(data[0]['created_utc'])

(or data[-1], not sure)

No idea if I'm way out of in left field or not

PS. I started writing this before your last comment, will read now

I just tried my own suggestion and it took 288s (still 60 workers). Not sure if the improvement is just a coincidence. I noticed large swings in completion time for no reason sometimes.

I'm not coding my application only for myself so I'm not sure how dependable the high success rate is. I tried setting rate_limit to 120, and batch_size to 180 (after undoing my change) and it took 508s, with the following output:

Total:: Success Rate: 100.00% - Requests: 1 - Batches: 1
24440 total results available for the selected parameters
Setting limit to 24440
Total:: Success Rate: 37.79% - Requests: 1024 - Batches: 17
Total:: Success Rate: 37.82% - Requests: 1026 - Batches: 18
pmaw took 508.28102231025696 seconds.

PS. Have now updated to v0.1.1

Thanks for the suggestions!

With regards to the time slices, the purpose of this is two-fold, first we try to minimize the number of potential idle threads. If we have 4 workers, our first generation will have 4 time slices. We search each time slice to find if there are results available. Since some time slices may return all the available items for that slice the number of slices in the queue will decrease and the next generation will have an idle worker. Essentially, time slicing with a multiple > 1 allows work to be shared amongst threads without any explicit communication.

With time_slice =2 assuming 1 slice in each generations returns <=100:
Gen 0 - 4 active workers - 0 in queue
-- add 6 slices to queue
Gen 1 - 4 active workers - 2 in queue
-- add 6 slices to queue
Gen 2 - 4 active workers - 4 in queue

With time_slice =2 assuming 1 slice in each generations returns <=100:
Gen 0 - 4 active workers - 0 in queue
-- add 3 slices to queue
Gen 1 - 3 active workers - 0 in queue
-- add 2 slices to queue
Gen 2 - 2 active workers - 0 in queue

Now you are probably wondering, why don't we just create more initial time slices to address this? The current implementation is based on the perspective of a breadth-first search problem since we don't know what time slices will contain items we start big and narrow it down through iterative slicing with a multiple of 2.

I think I can replace the for loop with that code, thanks. I'll have to add some additional code to ensure that the results are always desc sort by created_utc and do some tests as I'm not sure how reliably Pushshift sorts the results.

Just tried my change again and it took 289s. Seems likely it's an improvement.

Will have a look at your comment tomorrow! I'm taking the rest of the evening off.

I realized I made a change in v0.1.1 which could be increasing the amount of wasted work

  if remaining > self.max_results_per_request:
      num = 2

should be self.max_results_per_request*2, which will reduce the amount of requests for less than 100 items, I'll release a patch to fix this.

Since some time slices may return all the available items for that slice the number of slices in the queue will decrease and the next generation will have an idle worker.

In that case, perhaps you could check if there idle threads, and only set num = 2 if there are? Or if you want to get fancy, you could even check if any slices have remaining < max_results_per_request, such that there will be an idle thread in the next generation?

Anyway, I'll get out of your hair lol. I feel like I'm trying to micro-manage at this point.

Regarding the for loop, it may help you have confidence that psaw does it that way (that is, without a for loop). Technically they have a for loop but they just set the new "before" to the last create_utc in the loop.
See the line

self.payload['before'] = thing.created_utc

here: https://github.com/dmarx/psaw/blob/master/psaw/PushshiftAPI.py

One last idea which I don't think would yield a practical improvement but should at least be an improvement in theory: using requests.Session()

According to the doc (https://requests.readthedocs.io/en/master/user/advanced/):

So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

The reason I don't think it's a practical improvement is that the bottleneck is PushShift's rate limiting, so doing fancy low level http optimization shouldn't yield much speedup, if any.

check if any slices have remaining < max_results_per_request, such that there will be an idle thread in the next generation

Great idea, will use this to track future idle threads and slice accordingly.

I might do some benchmarking with requests.Session() and see if there is an improvement, right now I'm trying to improve the memory management as well, since I think there is a risk for a memory error if pulling millions of comments, open to any suggestions regarding this.

I'll update this ticket and close it once I've tested the new slicing logic

Closing this ticket as with v0.1.2, pmaw with default settings for the provided parameters completes 12.7% faster than psaw.

pmaw_api = pmawAPI()
test = pmaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])

With regards to the slicing logic, we can't determine if there will be an idle thread until after all requests in a batch have finished as there will only be an idle worker if the req_list < batch_size. The req_list grows as requests fail or are completed and therefore would require us to move the time slices to run after all requests have been completed. A disadvantage of moving the time slicing code is that we will lose any performance gains of running this slicing code asynchronously as requests complete.

For search windows with a potential for a lot of gaps (relatively small windows or a high number of workers), performance can be maintained by initializing the PushshiftAPI with a larger batch_size (a multiple of the num_workers) which will mitigate the risk of idle threads. It's also worth noting that there are diminishing returns w.r.t performance and eventually a negative impact when incrementing num_workers beyond 10. In the future, this could be mitigated by the use of rotating / randomly selected proxies from a pool, but I'm probably not going to add this at any point as this could negatively impact the Pushshift server.

If you have any problems with v0.1.2 or any feedback/suggestions please open a new ticket.

@mattpodolak awesome. Btw I'd like to share something with you by e-mail if possible?

Yeah sure, send me an email mpodola2@gmail.com

Hi! Thank you for this great work. I'm trying to use pmaw to collect submissions and comments on Reddit. However, there seem to be some issues with search_submissions. (search_comments) works fine. I ran the exact same lines of code that you provided as an example but didn't get any posts. Any idea why this would happen?