[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some optimization for process stats collection #2645

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dqminh
Copy link
Contributor
@dqminh dqminh commented Aug 19, 2020
  • Use Readdirnames instead of ioutil.Readdir to collect fds
  • Add a flag to disable socket Count collection, as Readlink is a lot of work for each fd

@k8s-ci-robot
Copy link
Collaborator

Hi @dqminh. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dashpole dashpole self-assigned this Aug 20, 2020
@dashpole
Copy link
Collaborator

/ok-to-test

@dqminh
Copy link
Contributor Author
dqminh commented Sep 7, 2020

@dashpole have you had time to look into this one ?

@@ -43,6 +43,8 @@ var (
whitelistedUlimits = [...]string{"max_open_files"}
referencedResetInterval = flag.Uint64("referenced_reset_interval", 0,
"Reset interval for referenced bytes (container_referenced_bytes metric), number of measurement cycles after which referenced bytes are cleared, if set to 0 referenced bytes are never cleared (default: 0)")
disableSocketCountStats = flag.Bool("disable_socket_count_stats", false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is currently covered under the "process" stats category in --disable_metrics. Does that not work for your usecase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply !

No, because we want to have process stats, we just don't need the socket counts ( partly because it's too costly to collect )

@dashpole
Copy link
Collaborator
dashpole commented Nov 4, 2020

Sorry for the slow review. How expensive are socket stats? If we wanted to do this, we would need to add a new category for it in --disable_metrics, rather than a new flag. But we would only do that if socket stats are very expensive.

@dqminh
Copy link
Contributor Author
dqminh commented Nov 5, 2020

@dashpole in the commit, i put in some numbers from when i test it

With args:

/cadvisor \
  -disable_metrics=sched,hugetlb,referenced_memory,cpu_topology,tcp,udp,advtcp \
  -housekeeping_interval=1s -profiling=true

and using profile?seconds=10, we have

  • before: processStatsFromProcs takes 25.93% with os.Readlink takes 16.67%.
  • after: processStatsFromProcs takes 11.82%

We have been running this for a while on our machines, here's a graph

Screenshot from 2020-11-05 10-43-31

The first significant bump is when we update cadvisor to https://github.com/google/cadvisor/commits/6f30891d89cc5b9cf93634048eb9f77c748a1437 (without these patches) from 0.30,
the reduction is with these patches. As you see they are still consuming more than before, but i would accept that is from extra metrics in existing categories that we are collecting. Would be ideal to be able to further remove unused metrics, but it's an acceptable cost for us at the moment.

@dims dims requested a review from bobbypage November 5, 2020 11:54
@dims
Copy link
Collaborator
dims commented Nov 5, 2020

@bobbypage this looks legit to me :)

@dqminh
Copy link
Contributor Author
dqminh commented Jan 25, 2021

@dashpole ping again. Is there anything we can do here ?

@dashpole
Copy link
Collaborator

This should be a new category of disable metrics, instead of a new flag.

}
if strings.HasPrefix(linkName, "socket") {
socketCount++
if !noSocketCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add something like this instead?

if noSocketCount
     return fdCount, socketCount, nil
}

@eero-t
Copy link
Contributor
eero-t commented Jul 14, 2021

GitHub action comments between the code state that there are undefined names being used.

How expensive are socket stats? If we wanted to do this, we would need to add a new category for it in --disable_metrics, rather than a new flag.

I concur that this change is too insignificant to warrant its own option. It should be part of -disable_metrics.

I think socket (count) item being dependent on process stats item is confusing to the users of that option though, and it should be made independent of process stats.

(If it's not made independent, then this PR documentation updates are incomplete, all places in docs mentioning process or socket stats should be updated to note that there's a dependency between their CLI options.)

Also, when there's a disable option for some item, outputting of the disabled data should also be skipped (e.g. in Prometheus output). Currently that too is missing from this PR.

eero-t added a commit to eero-t/cadvisor that referenced this pull request Aug 20, 2021
Because collecting that data is reported to have noticeable impact on
cAdvisor load:
	google#2645

Note: this option depends on process stats being enabled, as it's a
subset of them.
@eero-t
Copy link
Contributor
eero-t commented Aug 20, 2021

Here's (only build tested) code for adding "sock_count" to metrics one can use with -disable_metrics/-enable_metrics: https://github.com/eero-t/cadvisor/commits/sock-count-option

I'm not sure how its dependency on process stats option should be documented.

eero-t added a commit to eero-t/cadvisor that referenced this pull request Aug 23, 2021
Because collecting that data is reported to have noticeable impact on
cAdvisor load:
	google#2645

Added "sock_count" option depends on "process" stats being enabled, as
it's a subset of them.
eero-t added a commit to eero-t/cadvisor that referenced this pull request Aug 26, 2021
Because collecting that data is reported to have noticeable impact on
cAdvisor load:
	google#2645

Added "sock_count" option depends on "process" stats being enabled, as
it's a subset of them.
@eero-t
Copy link
Contributor
eero-t commented Aug 26, 2021

Here's (only build tested) code for adding "sock_count" to metrics one can use with -disable_metrics/-enable_metrics: https://github.com/eero-t/cadvisor/commits/sock-count-option

I'm not sure how its dependency on process stats option should be documented.

I've rebased above branch/commit to latest upstream (adding missing items to Prometheus doc), so that I could add "sock_count" option to relevant places in docs. If separating sock count is still useful, I could make a new PR of that to replace this...?

eero-t added a commit to eero-t/cadvisor that referenced this pull request Aug 30, 2021
Because collecting that data is reported to have noticeable impact on
cAdvisor load:
	google#2645

Added "sock_count" option depends on "process" stats being enabled, as
it's a subset of them.
eero-t added a commit to eero-t/cadvisor that referenced this pull request Sep 16, 2021
Because collecting that data is reported to have noticeable impact on
cAdvisor load:
	google#2645

Added "sock_count" option depends on "process" stats being enabled, as
it's a subset of them.
eero-t added a commit to eero-t/cadvisor that referenced this pull request Sep 17, 2021
Because collecting that data is reported to have noticeable impact on
cAdvisor load:
	google#2645

Added "sock_count" option depends on "process" stats being enabled, as
it's a subset of them.
eero-t added a commit to eero-t/cadvisor that referenced this pull request Oct 12, 2021
Because collecting that data is reported to have noticeable impact on
cAdvisor load:
	google#2645

Added "sock_count" option depends on "process" stats being enabled, as
it's a subset of them.
`ioutil.Readdir` does extra Lstat on the files and also sorts the files, but we
have no need for that, so switch to using only Readdirnames.

With args:

```
/cadvisor \
  -disable_metrics=sched,hugetlb,referenced_memory,cpu_topology,tcp,udp,advtcp \
  -housekeeping_interval=1s -profiling=true
```

and using `profile?seconds=10`:

- before: `processStatsFromProcs` takes 38.46%
- after: `processStatsFromProcs` takes 25.93%
@dqminh dqminh force-pushed the dqminh/optimize-process-stats branch from b70237f to ec94a2f Compare October 15, 2021 16:06
@google-cla google-cla bot added the cla: yes label Oct 15, 2021
Socket count costs a lot of CPU cycles to collect as we need to perform readlink
to determine if fd is a socket. In many applications/use cases
such as load balancers etc, just fd count is enough for many monitoring purposes,
so we add a option to disable socket count stats collection here with
`process_socket_count` metric.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
@dqminh dqminh force-pushed the dqminh/optimize-process-stats branch from ec94a2f to eeb9b63 Compare October 15, 2021 16:08
}
if strings.HasPrefix(linkName, "socket") {
socketCount++
if includeSocketCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For less indentation / more readability, this can be

if !includeSocketCount {
	return
}
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kinda prefer the current way since it cleanly scopes the socket counting logic inside the block, but this is ok too if we prefer it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other checks are early-exit ones, so it would be more consistent within that function, and the loop is already scoping the counting logic. I do not feel strongly about this though as the loop is short enough to be readable even with extra indent, and there are some other functions in this same file that are not using early-exit.

eero-t added a commit to eero-t/cadvisor that referenced this pull request Nov 22, 2021
Because collecting that data is reported to have noticeable impact on
cAdvisor load:
	google#2645

Added "sock_count" option depends on "process" stats being enabled, as
it's a subset of them.
@dashpole dashpole removed their assignment Feb 10, 2022
@dqminh
Copy link
Contributor Author
dqminh commented Mar 22, 2022

Ping again for review since it has been some time and i kinda forgot about this, sorry !

@pacoxu
Copy link
Contributor
pacoxu commented Mar 28, 2022

This may help for fixing kubernetes/kubernetes#99183. Is this PR still in progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants