Please enable JavaScript.
Coggle requires JavaScript to display documents.
Bottleneck at crawlera_proxy_pool_sup when shrink pools. (Evaluate effect.…
Bottleneck at crawlera_proxy_pool_sup when shrink pools.
obstacles
Fix Sessions based logic.
Improve speed of supervisor.
Evaluate effect.
Simulated environment.
Simulate crawlers with different behaviors (seen in prod).
Which tools to use to develop crawlers:
Use python-crawlers like in prod?
Use crawlera-tester (crawlers farm)?
Stable. (uses a few pools with big enough and stable amount of proxies)
Impulsed. (uses a few pools with big time delays between activity)
Broadcrawl. (produce a lots of one-off pools with ~61 proxies)
According Willian suggestion: instead of create crawlers we can redirect a part of traffic from "prod env" to this "sim env".
Netlocs
Simulate netlocs. (mock_netloc? httpbin?)
Shall mocked netlocs ban crawlers in a similar fashion(statistical) as in prod? Whether it distorts 'diversity' much if ignore it?
Don't simulate netlocs.
Use separate set of proxies (separate from those which used in prod).
OUTSOURCE!!!
Production environment. (we prefer to avoid that)
Identify relevant metrics and expectations.
Allocated but unused proxies.
Expected to decrease.
Total amount of pools.
Expected to decrease.
Diversity of proxies used in a requests.
Expected to increase.
CORE TEAM RESPONSIBILITY (Slav)
actual goals
Revive strategy of active shrink.