Feat: pin add --max-depth (arbitrary depth recursive pins) #5142

hsanjuan · 2018-06-20T13:51:32Z

This implements #5133 introducing an option to limit how deep we fetch and store
the DAG associated to a recursive pin ("--max-depth"). This feature
comes motivated by the need to fetch and pin partial DAGs in order to do
DAG sharding with IPFS Cluster.

This means that, when pinning something to --max-depth, the DAG will be
fetched only to that depth and not more.

In order to get this, the PR introduces new recursive pin types: "recursive1"
means: the given CID is pinned along with its direct children (maxDepth=1)

"recursive2" means: the given CID is pinned along with its direct children
and its grandchildren.

And so on...

This required introducing "maxDepth" limits to all the functions walking down
DAGs (in merkledag, pin, core/commands, core/coreapi, exchange/reprovide modules).

maxDepth == -1 effectively acts as no-limit, and all these functions behave like
they did before.

In order to facilitate the task, a new CID Set type has been added:
thirdparty/recpinset. This set carries the MaxDepth associated to every Cid.
This allows to shortcut exploring already explored branches just like the original
cid.Set does. It also allows to store the Recursive pinset (and replaces cid.Set).
recpinset should be moved outside to a different repo eventually.

TODO: tests
TODO: refs -r with --max-depth

License: MIT
Signed-off-by: Hector Sanjuan code@hector.link

kevina · 2018-06-20T18:05:16Z

Please note that we also have the notion of a "best effort" pin used to pin anything off the files root. It works by keeping the 'GC' from removing anything under it but the GC won't fail if one of the children can not be found. I am thinking we should just make that pin type explicit.

@hsanjuan does that type of pin not meet your needs?

@whyrusleeping what do you think?

kevina · 2018-06-20T18:14:33Z

Also if we go trough we this, I want to avoid the special case when the recursion is 0 by defining a direct pin as a recursive pin with a depth of 0.

hsanjuan · 2018-06-21T12:25:43Z

@hsanjuan does that type of pin not meet your needs?

I don't think it does, honestly I don't fully understand it. I need go-ipfs to NOT fetch all the subtree.

Also if we go trough we this, I want to avoid the special case when the recursion is 0 by defining a direct pin as a recursive pin with a depth of 0.

That's doable, happy to do it.

This implements #5133 introducing an option to limit how deep we fetch and store the DAG associated to a recursive pin ("--max-depth"). This feature comes motivated by the need to fetch and pin partial DAGs in order to do DAG sharding with IPFS Cluster. This means that, when pinning something to --max-depth, the DAG will be fetched only to that depth and not more. In order to get this, the PR introduces new recursive pin types: "recursive1" means: the given CID is pinned along with its direct children (maxDepth=1) "recursive2" means: the given CID is pinned along with its direct children and its grandchildren. And so on... This required introducing "maxDepth" limits to all the functions walking down DAGs (in merkledag, pin, core/commands, core/coreapi, exchange/reprovide modules). maxDepth == -1 effectively acts as no-limit, and all these functions behave like they did before. In order to facilitate the task, a new CID Set type has been added: thirdparty/recpinset. This set carries the MaxDepth associated to every Cid. This allows to shortcut exploring already explored branches just like the original cid.Set does. It also allows to store the Recursive pinset (and replaces cid.Set). recpinset should be moved outside to a different repo eventually. TODO: tests TODO: refs -r with --max-depth License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

InternalPins() is a pinset composed by: - Recursive pins CIDs - Direct pins CIDs - The empty node CID - A root CID pointing to all above (and any of the subbuckets that may have been created) It is only set during Flush/Load operations for the pinner. Thus recursively exploring internal pins in order to decide which CIDs are safe from GC only re-explores the recursive DAGs and should not be necessary. Mind that, previously, the CidSet will correctly prune any already explored branches so it did not have pernicious effects. But now it does. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

kevina · 2018-06-21T16:36:23Z

I don't think it does, honestly I don't fully understand it. I need go-ipfs to NOT fetch all the subtree.

Creating a best-effort pin won't fetch anything. It will simply prevent any subtrees already fetched from being garbage collected.

Stebalien · 2018-06-21T19:23:53Z

edit: (not a cancel comment button...)

hsanjuan · 2018-06-22T19:46:35Z

@kevina I'm thinking that deleting Direct pins as a fully different set might have some performance impact (in order to keep current APIs), as direct pins will need to be extracted from the recursive set by filtering them out, thus listing direct pins may become a slower operation if the recursive set is very big (something that doesn't happen now). Is this something acceptable?

hsanjuan · 2018-06-22T19:48:02Z

Creating a best-effort pin won't fetch anything. It will simply prevent any subtrees already fetched from being garbage collected.

I need to fetch partial subtrees too.

kevina · 2018-06-22T21:23:49Z

I need to fetch partial subtrees too.

I am thinking this should be implemented directly without complicating the pinier or the gc. You can then create a best-effort pin to keep them from getting GC.

However, I am not really against this idea if it will be useful in other contextes. @Stebalien @whyrusleeping what do you think.

kevina · 2018-06-22T21:24:45Z

@kevina I'm thinking that deleting Direct pins as a fully different set might have some performance impact (in order to keep current APIs), as direct pins will need to be extracted from the recursive set by filtering them out, thus listing direct pins may become a slower operation if the recursive set is very big (something that doesn't happen now). Is this something acceptable?

I would think so. But others may disagree.

hsanjuan · 2018-07-02T13:05:01Z

@Kubuxu @whyrusleeping can I get some attention on this?

hsanjuan · 2018-07-18T10:31:34Z

@Stebalien maybe you can help with this?

Stebalien

No.

It's a hack.
It's a ~800 line hack.
It's a hack that touches a bunch of critical code.
It adds a package to thirdparty.

If we're going to do this, we should do it right.

Stebalien · 2018-07-24T04:18:34Z

Let's go back to the issue and discuss ways to do this right. While this technically addresses the immediate need, it won't solve the long-term issue of needing more flexible ways to specify pins and it abuses an enum to store a number. Really, we need a more flexible pinset that allows for complex pin policies.

hsanjuan · 2018-07-24T10:10:46Z

@Stebalien I am a bit dissapointed it took 1 month to get a full frontal rejection with little alternative proposal. I understand that the "right" way to do this:

Is a larger change
Touches more critical sections
Is all unclear at this point
Would throw away all the hacks in the current pinning system (mine, and previously existing) anyways

My request is clearly spec-ed and implementable within the current state of things. So instead of deferring to larger abstract change, I would like to get a list of concrete steps that we can take to move forward, possibly in parallel with a the long total-revamp-of-the-pin-system discussion.

For example, this change, provides you with a bunch of areas to discuss on already that would also clarify parts in the larger discussion:

There are a bunch of functions depth-first-traversing the DAG. It does not seem crazy to update them with a depth limit as I did. The fact that there are 4 or 5 of these, means more lines of code. The bulk of the change is this. Are we against that? This can be PRed separately.
There is the commands part, with flags defining an API and expected behaviour that we can agree on or not. What would be response format for depth-limited pins? Extra key in the response object? Or are we ok with type "recursive%d"?
The package in thirdparty should be moved out
There is a "enum abuse". Or extending the recursive types to account for arbitrary max-depths. This is a hack. What would be the alternative? It needs to keep the API compatible. I did not come up with a way of doing it in less amounts of code, or touching less critical sections, but yes it's a hack. We can add a new single MaxDepth pin type, but we'll have to make sure we carry not just the type but metadata associated to it (the actual depth) all around.
Finally, the last part of the change affects how the new pins are stored in disk. Again, cannot think how to do it in less amounts of code or touching less critical paths. Since most --max-depth values will be 1 or 2, it made sense to do it like this.

Despite the critic, the change does not get much in the way of the current pinning system (it does not change the logic, the api, the storage format) and the standard pinning path.

There is another way of approaching this too. I only need max-depth 1 and 2 (or maybe even reducing to just 1). I can just support those values, introducing 2 specific pin types and potentially reducing the parts of the "hack" to support arbitrary pin depths.

Is there a way of doing this now that you'd consider workable?

Stebalien · 2018-07-24T19:13:44Z

I am a bit dissapointed it took 1 month to get a full frontal rejection with little alternative proposal.

It's a large patch, touches a bunch of critical code, hasn't been flagged as a priority, and the design wasn't discussed at all before implementing. I don't even look at patches like this until I have some time to actually think about them.

little alternative proposal

I don't know enough about the current pin system to give one off the top of my head.

Despite the critic, the change does not get much in the way of the current pinning system (it does not change the logic, the api, the storage format) and the standard pinning path.

This patch adds a cluster-specific, weird addition to pinning by hacking it into the the existing pin system with no discussion. Whatever feature we end up adding to support depth-limited pins, we'll have to maintain it.

The primary issue here is the lack of discussion and/or context. You should probably read:

There are a bunch of functions depth-first-traversing the DAG. It does not seem crazy to update them with a depth limit as I did. The fact that there are 4 or 5 of these, means more lines of code. The bulk of the change is this. Are we against that? This can be PRed separately.

It should be possible to achieve the same thing by passing the depth to the visit function (can create a new EnumerateChildrenWithDepth). EnumerateChildrenWithDepth shouldn't have to care about the max depth (the visit function should just return false when it hits the max depth).

Whatever we do, those functions shouldn't talk about pinning or use types like RecPin. That'll just confuse readers.

There is the commands part, with flags defining an API and expected behaviour that we can agree on or not. What would be response format for depth-limited pins? Extra key in the response object? Or are we ok with type "recursive%d"?

Personally, I'd give them the type "partial" and then add a "max-depth" field, or something like that.

The package in thirdparty should be moved out.

Yes.

There is a "enum abuse". Or extending the recursive types to account for arbitrary max-depths. This is a hack. What would be the alternative? It needs to keep the API compatible. I did not come up with a way of doing it in less amounts of code, or touching less critical sections, but yes it's a hack. We can add a new single MaxDepth pin type, but we'll have to make sure we carry not just the type but metadata associated to it (the actual depth) all around.

We (and cluster) will need more complex pins anyways.

Finally, the last part of the change affects how the new pins are stored in disk. Again, cannot think how to do it in less amounts of code or touching less critical paths. Since most --max-depth values will be 1 or 2, it made sense to do it like this.

The values will usually be 1 or 2 for cluster.

kevina · 2018-07-24T19:32:01Z

@hsanjuan @Stebalien I attempted to start a discussion about alternative ways to solve this problem, but it seams I was ignored.

In particular it would likely be better to enhance our best-effort pins and then fetch the needed subtrees separately.

Stebalien · 2018-07-24T20:42:02Z

In particular it would likely be better to enhance our best-effort pins and then fetch the needed subtrees separately.

We just need to make sure that fills the need. If we do that, we won't end up removing any accidentally downloaded nodes.

Let's move the discussion back to the issue.

whyrusleeping · 2018-07-25T03:41:33Z

thirdparty/recpinset/recpinset.go

+func (s *Set) Visit(c *cid.Cid, maxDepth int) bool {
+	curMaxDepth, ok := s.set[string(c.Bytes())]
+
+	if !ok || IsDeeper(maxDepth, curMaxDepth) {


[ A ] | \ [ B ] [ C ] | | [ C ] [ D ]

So what happens if we visit C in the first (left) tree? It seems like we would call visit on the second one, and pass 1,2 to IsDeeper which would return false, and cause us to not ever visit D.

Sorry, but that's not correct.

Assuming you are traversing this graph with MaxDepth=2 at the beginning:

The C on the first branch would be visited with maxDepth=0

The C on the second branch would be visited with MaxDepth=1

1>0, thus Visit will return true, thus the functions will keep traversing the path because the previously visited C had a lower depth limit than the new one.

In this context maxDepth means, the maximum depth of the tree below this CID. Thus we always keep exploring if IsDeeper(). It does not mean item's depth as you seemed to assume.

Ah, these variables could definitely use some better naming then. and maybe a comment. maybe maxDepth -> curHeight ?

It is actually maxDepth. That is, it's the "max depth" at which we are planning on exploring this CID. When we increase the maxDepth, we explore the path again.

whyrusleeping · 2018-07-25T03:46:20Z

@hsanjuan let's do the depth limited recursion stuff in a separate PR, and discuss pinning improvements where @kevina and @Stebalien are mentioning.

Sorry for the wait, but let's sketch this out a bit more before moving forward.

Stebalien · 2018-07-25T04:01:11Z

merkledag/merkledag.go

+// Thus, setting depth to two, will walk the root, the children, and the
+// children of the children.
+// Setting depth to a negative number will walk the full tree.
+func EnumerateChildrenMaxDepth(ctx context.Context, getLinks GetLinks, root *cid.Cid, maxDepth int, visit func(*cid.Cid, int) bool) error {


A more general way to do this (and the async version) would be to:

Make the visit function take a current depth. I'd like to pass a full path (as a []string) but that might get expensive.

Have the visit function determine if we should go deeper.

Have the visit function determine if we should go deeper.

But this is what is does. You can either remember how far the item is to the limit (maxDepth, as it does now), or you can remeber how deep the item is and what the absolute depth limit is. The way it's done now requires only remembering one thing.

hsanjuan · 2018-07-25T11:48:28Z

@Stebalien

This patch adds a cluster-specific, weird addition to pinning by hacking it into the the existing pin system with no discussion. Whatever feature we end up adding to support depth-limited pins, we'll have to maintain it.

This is the discussion. This PR is here to kickstart a discussion: how to support this feature in the current pin system. It was told to me (offline) that I could attempt to do this in the current pin system until the whole MFS thing is figured out. Now you have a bunch of concrete stuff to criticize and propose improvements on, a list of all the pieces that are potentially touched and the minimal approach to it (hacky, yes, but minimal). Sorry but I thought actually understanding the current pin system and propose how to change it was a better approach than nicely asking to someone else to write my feature.

That said, I'll gladly maintain it too whatever the outcome is. Sorry I don't like the stress on the "we'll have to maintain it". It leaves me with very mixed feelings about what you're implying and makes me a little sad.

It should be possible to achieve the same thing by passing the depth to the visit function (can create a new EnumerateChildrenWithDepth). EnumerateChildrenWithDepth shouldn't have to care about the max depth (the visit function should just return false when it hits the max depth).

I think you cannot shortcut branches from the search by knowing only the current depth. You need to know how deep you explored the last time you visited the CID (I tried that way first). At least I could not figure out how to do it just with depth while at the same time not re-exploring branches.

Whatever we do, those functions shouldn't talk about pinning or use types like RecPin. That'll just confuse readers.

Happy to rename.

We (and cluster) will need more complex pins anyways.

The values will usually be 1 or 2 for cluster.

That's fair to say, but since cluster is requesting this feature and is very limited in scope compared to "complex pins", it should make sense that it works for well for the cluster usecase. It can always be improved in the future as more use cases and needs appear. Perhaps having added a single "only children/maxdetph=1" pin type would have been a better approach to this. (as a side not, it would be interesting to hear use-cases of people pinning large amounts of pins all with different depths limits, but I agree this should eventually work well).

@whyrusleeping

@hsanjuan let's do the depth limited recursion stuff in a separate PR

Sure, no problem.

Stebalien · 2018-07-25T17:18:21Z

This is the discussion.

This PR was couched as a finished product. No WIP, "proposal", open design questions, "should we do this", etc. and the remaining TODOs are "write more tests" and "implement refs -r --max-depth". That's why I reacted the way I did. A fair amount of your time was wasted writing this patch and some (although significantly less) of my time was wasted reviewing it.

That said, I'll gladly maintain it too whatever the outcome is. Sorry I don't like the stress on the "we'll have to maintain it". It leaves me with very mixed feelings about what you're implying and makes me a little sad.

By "we", I meant all of us. My point was that this is a group effort. And no, you won't (and shouldn't) handle every bug report possibly related to this PR.

It should be possible to achieve the same thing by passing the depth to the visit function (can create a new EnumerateChildrenWithDepth). EnumerateChildrenWithDepth shouldn't have to care about the max depth (the visit function should just return false when it hits the max depth).

I think you cannot shortcut branches from the search by knowing only the current depth. You need to know how deep you explored the last time you visited the CID (I tried that way first). At least I could not figure out how to do it just with depth while at the same time not re-exploring branches.

Correct. As this patch currently does, the visit function would have to remember how deep it was when it last explored a CID.

However, that means that you can:

Only track the current depth in EnumerateChildrenWithDepth function.
Decide whether or not to go deeper in the visit function (based on the current depth, the max depth (local to the visit function), and the lowest depth at which we've explored the current CID (visit function state).

whyrusleeping · 2018-07-25T17:29:05Z

@hsanjuan The way I see it, the main sticking point around 'maintaining' this is whether or not you're okay breaking it when we come up with something better.

Upgrading pinning to something better will require a migration anyways, so thats not too big of an issue, however, what we have to make sure of is: Are we confident the future solution will support exactly this behavior? One potential thing I see being problematic is if we switch to ipld selectors, and that makes more sense for cluster to use, then we might get stuck maintaining a feature we really don't need anymore because someone else might start using it.

Let's see how this PR looks without all the depth limited traversal logic, and revisit the design then.

hsanjuan · 2018-07-27T22:09:34Z

Decide whether or not to go deeper in the visit function (based on the current depth, the max depth (local to the visit function), and the lowest depth at which we've explored the current CID (visit function state).

Yes, ok that works too and I see the advantage now. Thanks for taking the time to explain.

hsanjuan · 2018-07-27T22:51:09Z

Are we confident the future solution will support exactly this behavior?

As long as there's an equivalent way of doing things, I'm happy with breaking changes (but this opens a bad, or weird precedent at least, even if it's good for me). Anyway, as you said, let's do the things we agree on first and revisit the discussion.

achingbrain · 2019-03-28T11:57:19Z

Is this going to be revisited in the future?

From what I understand the ability to pin subsections of graphs is blocking cluster shipping their feature of splitting an unreasonably large DAG across multiple IPFS nodes - this would very much help package manager maintainers increase availability of their registries if stored on IPFS.

lanzafame · 2019-04-16T09:43:42Z

@achingbrain hopefully as this is still blocking the sharding functionality in IPFS Cluster.

sashahilton00 · 2019-11-27T14:40:55Z

Is there any news/progress on this issue over the past few months? Whilst I don't was to be 'that guy' who asks others to implement features, ipfs-cluster is currently pretty useless for clusters over a few TB with large files inside due to the requirement to replicate entire files.

Whilst likely not possible for a while due to the need for me to familiarise myself with the technicalities of IPFS under the hood and the codebase, if there are no plans to revisit this in the near future, I will try to start learning about IPFS and thinking about some of the problems in previous comments, with a view to hopefully writing this feature in X months time. Though whilst considering that option, there seem to be a few mentions that the entire pin system is due for a refactor - is this happening anytime soon, as it makes thinking about an implementation for this feature harder if this is the case.

hsanjuan requested a review from Kubuxu as a code owner June 20, 2018 13:51

ghost assigned hsanjuan Jun 20, 2018

hsanjuan changed the title ~~Feat: Arbitrary-depth recursive pin levels.~~ Feat: pin add --max-depth (arbitrary depth recursive pins) Jun 20, 2018

hsanjuan mentioned this pull request Jun 20, 2018

Feat: FetchGraphToDepth() to fetch a graph to a given depth #5134

Closed

hsanjuan added 2 commits June 21, 2018 14:35

Feat: add --max-depth to the "refs" command

45e9a4c

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

hsanjuan force-pushed the feat/pin-levels branch from f9e2146 to 45e9a4c Compare June 21, 2018 12:36

Stebalien closed this Jun 21, 2018

ghost removed the status/in-progress In progress label Jun 21, 2018

Stebalien reopened this Jun 21, 2018

ghost assigned Stebalien Jun 21, 2018

ghost added the status/in-progress In progress label Jun 21, 2018

Stebalien requested changes Jul 24, 2018

View reviewed changes

kevina mentioned this pull request Jul 24, 2018

Feature: "children" pinning mode #5133

Open

whyrusleeping reviewed Jul 25, 2018

View reviewed changes

Stebalien reviewed Jul 25, 2018

View reviewed changes

momack2 added this to In Progress in ipfs/go-ipfs May 9, 2019

achingbrain mentioned this pull request Jul 10, 2020

Replace Pin.depth with Pin.mode string enum ipfs/pinning-services-api-spec#16

Closed

Stebalien unassigned Stebalien and hsanjuan Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: pin add --max-depth (arbitrary depth recursive pins) #5142

Feat: pin add --max-depth (arbitrary depth recursive pins) #5142

Feat: pin add --max-depth (arbitrary depth recursive pins) #5142

Are you sure you want to change the base?

Feat: pin add --max-depth (arbitrary depth recursive pins) #5142

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment