[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High level metrics #4620

Open
1 of 3 tasks
sttts opened this issue Sep 14, 2023 · 5 comments
Open
1 of 3 tasks

High level metrics #4620

sttts opened this issue Sep 14, 2023 · 5 comments
Assignees
Labels
enhancement New feature or request observability roadmap Issues that have priority and are included in the roadmap, or are candidates to add to the roadmap user experience

Comments

@sttts
Copy link
Contributor
sttts commented Sep 14, 2023

What problem are you facing?

Today users rely mostly on controller-runtime metrics for monitoring a control plane. Higher level metrics are missing.

How could Crossplane help solve your problem?

Here are some metrics that would be useful:

  • Time-to-Change/Provision [ mean / percentile ] (= "Time to Readiness") – The initial change.  Can be the result of detecting a non-Crossplane change to the system or something initiating by the control plane or git/webhook
  • Time-to-Detect[ mean / percentile ] – This is when Crossplane first becomes aware that something has changed, and it now needs to reconcile.
  • Time-to-Reconcile[ mean / percentile ] – This is when we have successfully reconciled an event (and includes cloud provider time to provision).
  • Exceeding-Time-to-Reconcile [ mean / percentile ] – The time it takes more than the default reconcile interval, counting everything faster as zero. This number should low'ish. It correlates with the queue length, but is more understandable.
  • Reconciliation Rate[ rate ] – How many resources are reconciled per time.
  • Failure Rate[ rate  ] – How many reconciles fail with cloud API errors.
  • Cloud provider throttling – How many reconciles are throttled and how long.

Related Issues and PRs

  1. enhancement metrics observability
    ezgidemirel
  2. ulucinar
  3. bug metrics observability
@sttts sttts added the enhancement New feature or request label Sep 14, 2023
@pedjak pedjak self-assigned this Sep 18, 2023
@jeanduplessis jeanduplessis added this to the v1.15 milestone Sep 27, 2023
@jbw976 jbw976 added the roadmap Issues that have priority and are included in the roadmap, or are candidates to add to the roadmap label Nov 2, 2023
@jbw976 jbw976 modified the milestones: v1.15, v1.16 Nov 2, 2023
Copy link
github-actions bot commented Feb 1, 2024

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Feb 1, 2024
@jbw976
Copy link
Member
jbw976 commented Feb 1, 2024

/fresh this is a roadmap item we are still interested in

@lsviben
Copy link
Contributor
lsviben commented Feb 28, 2024

Hey @sttts, could you help clarifiy some of the metrics you mention so we are on the same page:

Time-to-Change/Provision [ mean / percentile ] (= "Time to Readiness") – The initial change. Can be the result of detecting a non-Crossplane change to the system or something initiating by the control plane or git/webhook

Not sure what exactly is qualified as a "change to the system". Is this something like: Time from when there is a change of/creation of the kubernetes resource (MR/Claim) up to when its reconciled/updated/provisioned externally and Ready? The change can come from a gitops workflow or XP creating/updating an MR based on a composite.

Time-to-Detect[ mean / percentile ] – This is when Crossplane first becomes aware that something has changed, and it now needs to reconcile.

Is this one related to when Crossplane detects that something has changed on the external resource, and then the time to detect and reconcile it? The one you created a PR for?

Time-to-Reconcile[ mean / percentile ] – This is when we have successfully reconciled an event (and includes cloud provider time to provision).

This one I understand as time from CreationTimestamp up to when its Synced: True, Ready: True? Is that correct?

Exceeding-Time-to-Reconcile [ mean / percentile ] – The time it takes more than the default reconcile interval, counting everything faster as zero. This number should low'ish. It correlates with the queue length, but is more understandable.

This one is the most confusing to me. What is the default reconcile interval? Is it the poll-interval? And then we measure the time it took to reconcile more then that? So if there is a queue, it could be longer.

@jbw976
Copy link
Member
jbw976 commented May 15, 2024

@ezgidemirel is there more to do in this epic before we can close it out? If there is more, would you be able to update the tasklist so it captures all the planned work still to be done? 🙇‍♂️

@ezgidemirel
Copy link
Member

@jbw976 updated the list.

@jbw976 jbw976 removed this from the v1.16 milestone May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request observability roadmap Issues that have priority and are included in the roadmap, or are candidates to add to the roadmap user experience
Projects
Status: In Progress
Development

No branches or pull requests

6 participants