feat: configurable ProviderSearchDelay in bitswap #9053

iand · 2022-06-22T16:11:39Z

core/node/bitswap.go

Jorropo · 2022-06-22T16:25:17Z

Also to be clear I pulled 20ms out of my ass, I have no idea what it should be.

20ms sounds short because I guess most IPFS nodes takes longer than 20ms to respond to a bitswap request (overhead of opening a new stream, ...).

But personally I'm fine with 20ms

iand · 2022-06-22T17:50:41Z

Also to be clear I pulled 20ms out of my ass, I have no idea what it should be.

Yep. The original issue suggests removing it, i.e. 0, but a small step up to 20ms seems sensible. Can always be adjusted in future.

aschmahmann

Thanks for pushing on this. Left some comments asking about why the given number was chosen as well as if the evidence here indicates we should be exposing this as a Internal.Bitswap option.

aschmahmann · 2022-06-22T20:24:42Z

core/node/bitswap.go

@@ -21,6 +22,7 @@ const (
 	DefaultTaskWorkerCount             = 8
 	DefaultEngineTaskWorkerCount       = 8
 	DefaultMaxOutstandingBytesPerPeer  = 1 << 20
+	DefaultProviderSearchDelay         = 20 * time.Millisecond


My understanding from #8807 is that there's some analysis being done to decide on the magic number here.

In particular is there any analysis on the amount of added network traffic as a result of reducing the number here and should we expose this number in the config file?

If I'm reading the graph it correctly #8807 (comment) it looks like almost no peers respond in under 100ms and some large fraction (80%?) respond by 1 second.

Dropping to 0ms, 20ms, or anything that's before Bitswap responses come back means we're sending out more requests than we do today. Understanding how many extra requests we're making vs how much time we're saving seems useful. This could also be helpful in assisting users who want to tweak the values between keeping the network chatter lower and waiting longer on Bitswap responses (e.g. gateways vs IPFS Desktop kinds of settings).

There will certainly be more chatter, but if the values at the end of #8807 (comment) are any valid, it's going to be a very small fraction. According to the comment, we're talking about 20k/12million requests (~0.1%), when Bitswap comes back positive. Even if that value is off by a factor of 10, we're still talking about 1% of cases where we make two requests, when one would be enough. But we're also saving minimum one RTT (cases where we receive DONT_HAVE) and maximum 1s (timeout) in 99% of cases.

Overall, my view is that if we start in parallel (or almost in parallel) we don't have much to lose. We can also spin up a node (or a few nodes, not connected to each other) and do the measurement study in the real network when this is merged.

The node(s) would:

start connections in the network, as they would normally.

listen for CIDs requested through Bitswap.

ask for the same CIDs at regular time intervals thereafter, excluding asking the peers where they received the request from (presumably the peer that asked for a CID will eventually get it, so we want to avoid counting false positives).

Has anyone corroborated that 0.1% number? Seems like a pretty critical measurement. The degree to which we make unnecessary reqs is the degree to which we'd further increase the load on DHT servers, which AFAIU is so high that many folks run in client-only mode, so I'm really loathe to further increase DHT server load.

@mrd0ll4r this relates to your previous comment here: #8807 (comment) - any easy way you can instrument this measurement (success/failure of Bitswap discovery) from the infrastructure you already have in place? :)

Hmm, we are working on a paper where we'll be able to do this, but I don't think that's going to happen in the near future. The methodology will be the same as I used for the comment, i.e., manually sending stuff on Bitswap and collecting responses. So no, I can't get this from the passive measurements we're doing at the moment, sorry...

EDIT: also, importantly, we explicitly do not use the usual request mechanism, which uses this timeout. We send stuff on Bitswap directly, without ever triggering a DHT search. So I won't be able to test the impact of this.

About the 20k positive responses/12 million requests number: Keep in mind that

I had to read those off the graph, they're not accurately from the dataset,

I can't ascertain the popularity of the content I tested with in a larger context, i.e., I don't know if other content is spread much more widely, generating more positive responses (or the opposite), and

Those 20k requests were the positive ones, but we did receive a response to almost all requests we sent out. This is because we set the send_dont_have flag, in order to also receive negative responses. This, critically, allows us to not rely on timeouts. (I had a number about what portion of peers respect send_dont_have, but I don't have it at hand. I believe it was the vast majority.)
I suppose something like that would be possible here, too (although that's another issue): set send_dont_have for all requests, and start searching the DHT as soon as some portion of our neighbors responded without any positive response.

In general, I agree with 20ms being too low as a timeout, because it's unlikely that any meaningful number of responses will have been received. It effectively behaves as 0ms, i.e., do DHT and Bitswap discovery in parallel. If that's the goal, setting it to zero seems less confusing.

I can't ascertain the popularity of the content I tested with in a larger context, i.e., I don't know if other content is spread much more widely, generating more positive responses (or the opposite).

Totally. Some data processing required here, but I expect that ipfs.io gets a lot of requests for data stored with popular pinning services, some of which they're peered with (more than 0.1%).

This knob here deals with resource consumption vs performance, with the added caveat that the resource consumption is both on the DHT server nodes and on the client node itself.

Let's say we take the view that we're not concerned about additional DHT server load (idk that that's a good idea, but let's say). How does a user figure out what they should set the magic number to? Have we tried running this in places like ipfs.io to see what seems optimal at least for that deployment?

While we're figuring this out it seems pretty reasonable to expose a knob in the Bitswap.Internal config, although it should come with some guidance around tuning it.

As an aside as we introduce additional routing systems (e.g. Reframe delegated routers which are already supported) we'll likely need to figure out other ways to handle this. For example, maybe it makes sense to wait 500ms before a DHT request but 0s for a delegated routing request to cid.contact?

There's probably a more ideal dynamic system here than hard coding in some numbers, but if there are real performance benefits to be gained we can hard code some numbers and encourage some follow-up work on more dynamic/adaptable systems here.

Added ProviderSearchDelay to Internal.Bitswap config but left the default at the low value of 20ms. We could restore that to 1s and allow experiments to test the effect of changing it before enshrining a new default in code.

Would that ^ help get results from more existing nodes @guillaumemichel?

As a note, we are starting the Bitswap Discovery Success Rate study this week and expect to have first results in the next couple of weeks. Work is being tracked here: https://pl-strflt.notion.site/Effectiveness-of-Bitswap-Discovery-Process-aab515ebc1774bee8193da415bfb98f4 - any input more than welcome.

For now can we expose this config but leave the default alone until at least we review @guillaumemichel's results?

Yes. Will change to 1s

BigLep · 2022-09-01T01:59:48Z

@guseggert : are you able to clarify what next steps are needed here?

I think if we aren't touching the defaults but we make this configurable we should be fine to ship.

We can come back later to adjust the default after more of the experiments have run.

yiannisbot · 2022-09-01T13:28:23Z

Quick update from ProbeLab side: we've started looking into this in more detail and we expect some first results to come out this week. @guillaumemichel to provide a more specific update on current progress and experiments targeted. All the action is happening here if anyone wants to contribute: https://pl-strflt.notion.site/Effectiveness-of-Bitswap-Discovery-Process-aab515ebc1774bee8193da415bfb98f4

I would personally definitely support proceeding with a config setting and a recommendation until we have more stable results.

guseggert · 2022-09-01T17:25:12Z

@guseggert : are you able to clarify what next steps are needed here?

I think if we aren't touching the defaults but we make this configurable we should be fine to ship.

We can come back later to adjust the default after more of the experiments have run.

Yes this sounds good to me too, let's just expose the values in config and leave the defaults alone, to give us some breathing room to understand the network impact of adjusting them.

iand · 2022-09-01T20:50:31Z

Will also be testing this in Thunderdome to compare the following values: 0ms, 20ms, 50ms, 100ms, 200ms, 500ms, 750ms, 1000ms

probe-lab/thunderdome#29

BigLep · 2022-09-22T16:16:17Z

@Jorropo will take getting this merged (but we'll remove changing the default).

Jorropo · 2022-09-22T16:17:29Z

core/node/bitswap.go

@@ -21,6 +22,7 @@ const (
 	DefaultTaskWorkerCount             = 8
 	DefaultEngineTaskWorkerCount       = 8
 	DefaultMaxOutstandingBytesPerPeer  = 1 << 20
+	DefaultProviderSearchDelay         = 20 * time.Millisecond


Can you revert this to 1 second please ? Let try having this as an option before changing defaults.

iand · 2022-09-26T14:09:16Z

@Jorropo changed default back to 1s

Jorropo · 2022-09-26T16:06:54Z

Thx, I will review this after the code freeze.

BigLep · 2022-09-26T18:15:49Z

@Jorropo : we can talk during 2022-09-27 triage, but if this is exposing a configurable parameter and not changing any defaults, I can see a case for making this in as part of 0.16 (even though it wasn't in the RC).

BigLep · 2022-09-27T16:38:55Z

2022-09-27 conversation: this will land in 0.17

Jorropo

LGTM 🎉

BigLep · 2022-10-12T22:58:55Z

@Jorropo : are you going to merge? I didn't think we were waiting on anyone else.

iand requested a review from aschmahmann June 22, 2022 16:13

Jorropo reviewed Jun 22, 2022

View reviewed changes

core/node/bitswap.go Outdated Show resolved Hide resolved

aschmahmann reviewed Jun 22, 2022

View reviewed changes

BigLep assigned iand Jul 27, 2022

iand mentioned this pull request Aug 4, 2022

Test impact of provider delay changes in Kubo probe-lab/thunderdome#11

Closed

iand mentioned this pull request Aug 17, 2022

Add providerdelay experiment probe-lab/thunderdome#29

Closed

1 task

BigLep requested a review from guseggert September 1, 2022 01:58

BigLep requested a review from Jorropo September 22, 2022 16:16

Jorropo requested changes Sep 22, 2022

View reviewed changes

BigLep mentioned this pull request Sep 22, 2022

Release 0.16 #9237

Closed

Jorropo self-requested a review September 26, 2022 16:07

BigLep removed the request for review from guseggert October 4, 2022 16:39

iand added 4 commits October 10, 2022 11:35

feat: remove provider delay interval in bitswap

a485618

Use a non-zero provider search delay

64a8a76

Add ProviderSearchDelay to Internal.Bitswap config

4813df1

Revert bitswap provider delay to 1s

5665859

Jorropo approved these changes Oct 12, 2022

View reviewed changes

Jorropo merged commit 066a0b9 into ipfs:master Oct 13, 2022

lidel changed the title ~~feat: remove provider delay interval in bitswap~~ feat: configurable ProviderSearchDelay in bitswap Jan 4, 2023

lidel mentioned this pull request Jan 4, 2023

Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807

Open

guillaumemichel mentioned this pull request Jan 30, 2023

Milestone: Bitswap Discovery Delay probe-lab/roadmap#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: configurable ProviderSearchDelay in bitswap #9053

feat: configurable ProviderSearchDelay in bitswap #9053

feat: configurable ProviderSearchDelay in bitswap #9053

feat: configurable ProviderSearchDelay in bitswap #9053

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment