-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile blocked by crossplane.io/external-create-pending
annotation
#3037
Comments
Same issue on crossplane version 1.7.0 and provider-aws 0.26.0. Running on a local kind cluster. |
If the annotation is removed from managed resource, it doesn't seem to have any effect. What is causing this annotation to be created? |
We got the issue in the following scenario:
|
Crossplane (or more specifically, a Crossplane provider) adds the In an ideal world we wouldn't need to do this - we'd just create the external resource and return from the reconcile loop. Next reconcile we'd look up the external resource by its identifier and determine whether the create failed or succeeded. This is actually how Crossplane originally worked. The problem is that a surprising (and kind of depressing) amount of cloud APIs don't use deterministic identifiers. For example when you call the AWS API to create a VPC the API returns a payload that tells you what the newly created VPC's ID is. If we don't successfully record that ID (e.g. because the provider was restarted mid-create) we have no way to tell on the next reconcile whether we successfully created the VPC or not. That's basically what this annotation does - ensures that if creation was ambiguous that we stop and ask a human to intervene rather than potentially creating more infrastructure than you actually asked for (e.g. creating a second VPC and leaking the original one). In some cases its possible to workaround this by looking things up by labels or other identifying properties instead, but this requires a lot of case-by-case code to be added to managed resource controllers and even then isn't possible in all cases (e.g. for resources without deterministic identifiers that also don't support tagging/labels). I think the two things we can do to improve this are:
|
Ah right kinda makes sense @negz. One thing though: in our case that restart was just a regular deployment, so the probability of this happening again is pretty high (especially at our scale), would it be possible to reduce that risk by like putting a preStop hook or something to make sure that the provider doesn't get killed mid-creation? Or doing a graceful shutdown on sigterm? Unless maybe that's already the case? Like of course it will happen that the pod gets killed for exceptional reasons (OOM, node going down...), but ideally it would not create this situation as part of a regular deploy is what I'd be hoping for. |
I don't think we currently have any logic to handle graceful shutdowns in providers, but yes I agree that would be a good idea. |
crossplane.io/external-create-pending
annotation
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
My fault for posting before full do diligence! I've confirmed these resource have had this error long before our cluster upgrade this week and the timing seems to line up with their initial creation so these findings are still consistent with @negz's working theory this bug is related to loss of state/external identifier during the creation phase if the provider runtime is killed before recording. |
This issue is also discussed here: crossplane/crossplane-runtime#340 @negz - would it make sense to move the current check after the Observe function call, and if Observe returns successfully (Create is not needed) then the annotation could be updated to avoid this error. That would handle the case where the resource was created and/or already exists and the external-create-pending annotation is "out of sync" for whatever reason. |
meet the same problem that blocks aws resources creation. crossplane:v1.9.1, provider-aws-controller:v0.31.0 |
Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as |
/fresh |
I also hit this issue. But on my end, we have both
I am trying to understand how this could happen on my end. If the provide pod was restarted during the reconciliation loop, how could the Also it's interesting that |
Thanks for sharing that you are still seeing this @ZhiminXiang - I think this is worth re-opening and including in our Developer Experience epic, at least to look more deeply into this 🤔 |
I have similar problem.
I fixed this problem by add 2 lines in @@ -216,6 +217,8 @@ func (c *external) Create(ctx context.Context, mg resource.Managed) (managed.Ext
meta.SetExternalName(cr, result.ID)
+ meta.SetExternalCreateSucceeded(cr, time.Now())
+ _ = c.kube.Update(ctx, cr)
return managed.ExternalCreation{}, nil
} Just quick review the code. |
Also had a case where:
In my case I had The solution was to remove the |
Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as |
Flash forward almost 1.5 years from my creation of this issue and I'm hitting what I think is a different variation of this problem. Currently running apiVersion: cloudfront.aws.crossplane.io/v1alpha1
kind: Distribution
metadata:
annotations:
crossplane.io/composition-resource-name: Distribution
crossplane.io/external-create-failed: "2023-12-18T21:48:06Z"
crossplane.io/external-create-pending: "2023-12-18T21:48:06Z"
crossplane.io/external-create-succeeded: "2023-12-18T21:34:56Z"
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatedExternalResource 14m managed/distribution.cloudfront.aws.crossplane.io Successfully requested creation of external resource
Warning CannotUpdateManagedResource 14m managed/distribution.cloudfront.aws.crossplane.io Operation cannot be fulfilled on distributions.cloudfront.aws.crossplane.io "web-www-dev-usw-pc2ks-259cf": the object has been modified; please apply your changes to the latest version and try again
Normal PendingExternalResource 14m (x4 over 14m) managed/distribution.cloudfront.aws.crossplane.io Waiting for external resource existence to be confirmed
Warning CannotCreateExternalResource 22s (x14 over 13m) managed/distribution.cloudfront.aws.crossplane.io cannot create Distribution in AWS: DistributionAlreadyExists: The caller reference that you are using to create a distribution is associated with another distribution. Already exists: EW7C2ZJPWS6YK You'll notice a few interesting things:
Interesting to see this is somewhat similar to what @ZhiminXiang observed back in May even though I am using an entirely different provider (contrib/provider-aws) Finally, to rule out the original suspected cause of this GitHub issue, I have confirmed there have been no controller Pod restarts and all Pods are older than the initial creation of my resource above:
Let me know if I can provide anything further or if anyone has any additional ideas. |
These annotations were introduced in crossplane/crossplane-runtime#283. Per crossplane/crossplane#3037 folks find these annotations hard to reason about. That's understandable, because they're doing a lot of subtle things. This section ended up super long, but I think this is an area where folks really need to understand what's happening in order to make good decisions when Crossplane refuses to proceed. Signed-off-by: Nic Cope <nicc@rk0n.org>
I've opened crossplane/docs#688 to document what these annotations are doing. I recommend anyone facing this issue have a read through. There's a bunch of subtle behaviour these annotations are helping with, including:
Of particular note is that having Consider the case where the provider successfully creates a resource, then you delete that resource via the cloud console. The provider will attempt to recreate the resource, and thus set a |
Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as |
What happened?
Managed resource stopped reconciling with error event:
This looks like a duplicate of #2843, but to be clear I am seeing this on an entirely different provider
aws
vs.gcp
in that linked issue. I think it would be good to understand the root cause of this state rather than accepting the removal of thecrossplane.io/external-create-pending
annotation as the accepted solution as that is much harder to scale in a large production deployment IMO.How can we reproduce it?
I don't have clear steps for reproduction and am opening this issue in hopes others come forward with more details and/or maintainers have some ideas on what/where to try/look for reproducing.
What environment did it happen in?
Crossplane version: 1.6.4 (but likely the error state started before upgrading to this version)
kubectl version
): v1.22.6-eks-7d68063uname -a
): 5.4.181-99.354.amzn2The text was updated successfully, but these errors were encountered: