[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile blocked by crossplane.io/external-create-pending annotation #3037

Closed
iAnomaly opened this issue Apr 8, 2022 · 21 comments
Closed
Labels
bug Something isn't working docs stale

Comments

@iAnomaly
Copy link
iAnomaly commented Apr 8, 2022

What happened?

Managed resource stopped reconciling with error event:

Events:
  Type     Reason                           Age                 From                                 Message
  ----     ------                           ----                ----                                 -------
  Warning  CannotInitializeManagedResource  29m (x19 over 19h)  managed/queue.sqs.aws.crossplane.io  cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed

This looks like a duplicate of #2843, but to be clear I am seeing this on an entirely different provider aws vs. gcp in that linked issue. I think it would be good to understand the root cause of this state rather than accepting the removal of the crossplane.io/external-create-pending annotation as the accepted solution as that is much harder to scale in a large production deployment IMO.

How can we reproduce it?

I don't have clear steps for reproduction and am opening this issue in hopes others come forward with more details and/or maintainers have some ideas on what/where to try/look for reproducing.

What environment did it happen in?

Crossplane version: 1.6.4 (but likely the error state started before upgrading to this version)

  • Cloud provider or hardware configuration: AWS
  • Kubernetes version (use kubectl version): v1.22.6-eks-7d68063
  • Kubernetes distribution (e.g. Tectonic, GKE, OpenShift): EKS
  • OS (e.g. from /etc/os-release): Amazon Linux2
  • Kernel (e.g. uname -a): 5.4.181-99.354.amzn2
@iAnomaly iAnomaly added the bug Something isn't working label Apr 8, 2022
@garreeoke
Copy link

Same issue on crossplane version 1.7.0 and provider-aws 0.26.0. Running on a local kind cluster.

@garreeoke
Copy link

If the annotation is removed from managed resource, it doesn't seem to have any effect. What is causing this annotation to be created?

@Jell
Copy link
Jell commented Apr 20, 2022

We got the issue in the following scenario:

  1. try to create a resource that was failing to be created due to some IAM configuration issues
  2. restart the provider deployment
  3. get the error

@negz
Copy link
Member
negz commented Apr 26, 2022

Crossplane (or more specifically, a Crossplane provider) adds the crossplane.io/external-create-pending annotation to managed resources right before it attempts to create their corresponding external resource (i.e. make a create call to some cloud API). It then adds another annotation - either crossplane.io/external-create-failed or crossplane.io/external-create-succeeded once the creation is observed to have succeeded or failed. Unfortunately this all has to happen in a single reconcile loop iteration. If the reconcile is interrupted before it writes the failed/succeeded annotation the managed resource ends up in this state. That's likely what happened for @Jell - the provider deployment was restarted while it was in the middle of trying to create some external resources.

In an ideal world we wouldn't need to do this - we'd just create the external resource and return from the reconcile loop. Next reconcile we'd look up the external resource by its identifier and determine whether the create failed or succeeded. This is actually how Crossplane originally worked. The problem is that a surprising (and kind of depressing) amount of cloud APIs don't use deterministic identifiers. For example when you call the AWS API to create a VPC the API returns a payload that tells you what the newly created VPC's ID is. If we don't successfully record that ID (e.g. because the provider was restarted mid-create) we have no way to tell on the next reconcile whether we successfully created the VPC or not. That's basically what this annotation does - ensures that if creation was ambiguous that we stop and ask a human to intervene rather than potentially creating more infrastructure than you actually asked for (e.g. creating a second VPC and leaking the original one).

In some cases its possible to workaround this by looking things up by labels or other identifying properties instead, but this requires a lot of case-by-case code to be added to managed resource controllers and even then isn't possible in all cases (e.g. for resources without deterministic identifiers that also don't support tagging/labels).

I think the two things we can do to improve this are:

  1. Document what I wrote here.
  2. Allow resources that do have deterministic identifiers to opt-out of this behavior.

@negz negz added the docs label Apr 26, 2022
@Jell
Copy link
Jell commented Apr 27, 2022

Ah right kinda makes sense @negz. One thing though: in our case that restart was just a regular deployment, so the probability of this happening again is pretty high (especially at our scale), would it be possible to reduce that risk by like putting a preStop hook or something to make sure that the provider doesn't get killed mid-creation? Or doing a graceful shutdown on sigterm? Unless maybe that's already the case?

Like of course it will happen that the pod gets killed for exceptional reasons (OOM, node going down...), but ideally it would not create this situation as part of a regular deploy is what I'd be hoping for.

@negz
Copy link
Member
negz commented Apr 29, 2022

I don't think we currently have any logic to handle graceful shutdowns in providers, but yes I agree that would be a good idea.

@negz negz changed the title CannotInitializeManagedResource: cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed Reconcile blocked by crossplane.io/external-create-pending annotation Apr 29, 2022
@stale
Copy link
stale bot commented Aug 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Aug 13, 2022
@negz negz added stale and removed wontfix This will not be worked on labels Aug 15, 2022
@stale stale bot removed the stale label Aug 15, 2022
@iAnomaly
Copy link
Author
iAnomaly commented Aug 23, 2022

@negz I have confirmed we are seeing this not just at creation!

I have two provider-aws queues.sqs.aws.crossplane.io Managed Resources that are 209d old and after upgrading our Kubernetes cluster (which resulted not only in a newer Kubernetes API version 1.22 -> 1.23 but replacement of all nodes that would have cascaded and terminated/recreated both the crossplane and provider-aws Pods). They have been failing to reconcile since the cluster upgrade with the event message: cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed

Perhaps the bug can trigger during provider termination during ANY reconcile loop, not just creation?

My fault for posting before full do diligence! I've confirmed these resource have had this error long before our cluster upgrade this week and the timing seems to line up with their initial creation so these findings are still consistent with @negz's working theory this bug is related to loss of state/external identifier during the creation phase if the provider runtime is killed before recording.

@bobh66
Copy link
Contributor
bobh66 commented Aug 23, 2022

This issue is also discussed here: crossplane/crossplane-runtime#340

@negz - would it make sense to move the current check after the Observe function call, and if Observe returns successfully (Create is not needed) then the annotation could be updated to avoid this error. That would handle the case where the resource was created and/or already exists and the external-create-pending annotation is "out of sync" for whatever reason.

@wangyi198682
Copy link

meet the same problem that blocks aws resources creation. crossplane:v1.9.1, provider-aws-controller:v0.31.0

@github-actions
Copy link

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 7 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Jan 24, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 31, 2023
@elohmrow
Copy link

/fresh

@github-actions github-actions bot removed the stale label Mar 23, 2023
@ZhiminXiang
Copy link

I also hit this issue. But on my end, we have both crossplane.io/external-create-pending and crossplane.io/external-create-succeeded set.

crossplane.io/external-create-pending: "2023-05-16T08:00:07Z"
crossplane.io/external-create-succeeded: "2023-05-16T07:59:24Z"

I am trying to understand how this could happen on my end. If the provide pod was restarted during the reconciliation loop, how could the external-create-succeeded be set?

Also it's interesting that crossplane.io/external-create-pending was set after crossplane.io/external-create-succeeded

@jbw976
Copy link
Member
jbw976 commented May 23, 2023

Thanks for sharing that you are still seeing this @ZhiminXiang - I think this is worth re-opening and including in our Developer Experience epic, at least to look more deeply into this 🤔

@jbw976 jbw976 reopened this May 23, 2023
@ZhiminXiang
Copy link

Thanks @jbw976 for taking a look.
The issue we saw is partially related to this one. It may also involve the GCP provider.
So I created a separate issue #4099 which includes more details about what was happening in our case.

@datastream
Copy link

I have similar problem.
example:

metadata:
  annotations:
    crossplane.io/external-create-pending: "2023-06-08T15:39:01Z"
    crossplane.io/external-name: example
    kubectl.kubernetes.io/last-applied-configuration: |

I fixed this problem by add 2 lines in func (c *external) Create(ctx context.Context, mg resource.Managed) (managed.ExternalCreation, error) {.
patch my code like this:

@@ -216,6 +217,8 @@ func (c *external) Create(ctx context.Context, mg resource.Managed) (managed.Ext
        meta.SetExternalName(cr, result.ID)
+       meta.SetExternalCreateSucceeded(cr, time.Now())
+       _ = c.kube.Update(ctx, cr)
        return managed.ExternalCreation{}, nil
 }

Just quick review the code.
https://github.com/crossplane/crossplane-runtime/blob/025f5287ee482937a7f57a22c8d9d852071e90fe/pkg/reconciler/managed/reconciler.go#LL951C1-L957C4
https://github.com/crossplane/crossplane-runtime/blob/025f5287ee482937a7f57a22c8d9d852071e90fe/pkg/reconciler/managed/reconciler.go#L1005-L1006
Should we also call r.client.Update(ctx, managed) after meta.SetExternalCreateSucceeded(managed, time.Now()) ?

@arturkasperek
Copy link

Also had a case where:

    crossplane.io/external-create-pending: "2023-07-14T10:33:55Z"
    crossplane.io/external-create-succeeded: "2023-07-14T10:33:59Z"

In my case I had managementPolicy: Observe

The solution was to remove the crossplane.io/external-create-pending annotation

@github-actions
Copy link

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Oct 30, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 13, 2023
@iAnomaly
Copy link
Author
iAnomaly commented Dec 18, 2023

Flash forward almost 1.5 years from my creation of this issue and I'm hitting what I think is a different variation of this problem. Currently running crossplane:v1.14.4 and crossplane-contrib/provider-aws:v0.46.0 (crossplane-contrib/provider-aws:v0.45.2 previously) and while testing creation of Distribution.cloudfront.aws.crossplane.io/v1alpha1, I am experiencing this behavior:

apiVersion: cloudfront.aws.crossplane.io/v1alpha1
kind: Distribution
metadata:
  annotations:
    crossplane.io/composition-resource-name: Distribution
    crossplane.io/external-create-failed: "2023-12-18T21:48:06Z"
    crossplane.io/external-create-pending: "2023-12-18T21:48:06Z"
    crossplane.io/external-create-succeeded: "2023-12-18T21:34:56Z"
...
Events:
  Type     Reason                        Age                 From                                               Message
  ----     ------                        ----                ----                                               -------
  Normal   CreatedExternalResource       14m                 managed/distribution.cloudfront.aws.crossplane.io  Successfully requested creation of external resource
  Warning  CannotUpdateManagedResource   14m                 managed/distribution.cloudfront.aws.crossplane.io  Operation cannot be fulfilled on distributions.cloudfront.aws.crossplane.io "web-www-dev-usw-pc2ks-259cf": the object has been modified; please apply your changes to the latest version and try again
  Normal   PendingExternalResource       14m (x4 over 14m)   managed/distribution.cloudfront.aws.crossplane.io  Waiting for external resource existence to be confirmed
  Warning  CannotCreateExternalResource  22s (x14 over 13m)  managed/distribution.cloudfront.aws.crossplane.io  cannot create Distribution in AWS: DistributionAlreadyExists: The caller reference that you are using to create a distribution is associated with another distribution. Already exists: EW7C2ZJPWS6YK

You'll notice a few interesting things:

  1. The existence of a pending, succeeded and failed external annotation state all present simultaneously.
  2. The Events demonstrate the controller is attempting to create a new external resource even while also waiting for the external resource existence to be confirmed PendingExternalResource?
  3. The CannotCreateExternalResource is expected in the sense Crossplane is trying to recreate the same resource again using the same canonical identifier caller reference for this specific provider API and type (but the resource already exists from the first CreatedExternalResource 14m earlier.

Interesting to see this is somewhat similar to what @ZhiminXiang observed back in May even though I am using an entirely different provider (contrib/provider-aws)

Finally, to rule out the original suspected cause of this GitHub issue, I have confirmed there have been no controller Pod restarts and all Pods are older than the initial creation of my resource above:

crossplane-794cbcb9c8-hn6xd                            1/1     Running   0          6d2h
crossplane-rbac-manager-7d6875d4b8-wzz8f               1/1     Running   0          6d2h
function-auto-ready-ad9454a37aa7-f467cb448-l8rkn       1/1     Running   0          5d21h
function-go-templating-f34ec030415a-858c56b6cb-h56wz   1/1     Running   0          5d21h
provider-aws-b4eafc5192c9-7c89c954c4-sxnjx             1/1     Running   0          31m

Let me know if I can provide anything further or if anyone has any additional ideas.

@iAnomaly iAnomaly reopened this Dec 18, 2023
@github-actions github-actions bot removed the stale label Dec 19, 2023
negz added a commit to negz/docs that referenced this issue Jan 30, 2024
These annotations were introduced in crossplane/crossplane-runtime#283.

Per crossplane/crossplane#3037 folks find
these annotations hard to reason about. That's understandable, because
they're doing a lot of subtle things.

This section ended up super long, but I think this is an area where
folks really need to understand what's happening in order to make good
decisions when Crossplane refuses to proceed.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz
Copy link
Member
negz commented Jan 30, 2024

I've opened crossplane/docs#688 to document what these annotations are doing.

I recommend anyone facing this issue have a read through. There's a bunch of subtle behaviour these annotations are helping with, including:

  • Leaking resources due to being unable to store non-deterministic, cloud-generated external names.
  • Leaking resources due to eventually consistent cloud APIs reporting that resources don't exist when they do.
  • Leaking resources due to stale cache reads of the MR itself from the controller-runtime cache.

Of particular note is that having external-create-pending and external-create-succeeded (etc) annotations set at the same time is normal. The annotation values are timestamps - i.e. the most recent time creation succeeded, the most recent time creation was pending, etc.

Consider the case where the provider successfully creates a resource, then you delete that resource via the cloud console. The provider will attempt to recreate the resource, and thus set a create-pending annotation that is newer than create-succeeded.

Copy link

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Apr 30, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs stale
Projects
None yet
Development

No branches or pull requests