Avoiding blacklist
eiffel31 opened this issue · 3 comments
In ui.py, download is made with 10 parallel connections running at maximum speed.
with Pool(processes=10) as pool:
for i, (component, extra) in enumerate(pool.imap_unordered(fetchLcscData, componentsToFetch)):
lcsc = component['lcsc']
print(f" {lcsc} fetched. {((i+1) / len(missing) * 100):.2f} %")
component["extra"] = extra
component["extraTimestamp"] = int(time.time())
lib.addComponent(component)
This may have been considered as aggressive downloading and be visible on their server performance.
I would suggest a much lighter download:
- sequential access
- 200ms delay between 2 fetches
Sure it will take much more time, but we are not in a rush for a daily update.
Note that error message in log changed from "Forbidden" to "Too many requests".
I think that their first protection was to blacklist the address heavily requesting => forbidden. Some days later they improved their safety mechanism while adding a parallel connection limit (configured at 1?) or a requests/s limit (probably harder to implement) => too many requests
So gentle downloading may work fine.
These are just guesses...
- The difference between
Forbidden
andToo many requests
is only if we send user-agent identifying as a web browser. - Downloading sequentially changes nothing - the IP range is blocked
- Scraping from another location works, but we risk getting blocked again.
I am communicating with JLC PCB and LCSC, however, it will take some time.