-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random invalid session and inconsistent service accounts #19510
Comments
@pschichtel Do you only have one KES server for production? |
@jiuker production has 3, backup has 1 |
Does the 3 KES server have the same keys for production? @pschichtel |
They are all connected to the same vault (with a dedicated V2 KV engine for minio), so I'd assume so. How can I check? |
Could you check the key |
Not sure what you mean |
Check for overlapping value assignments between two clients |
Sorry for being confused!
By key, do you mean a KES key or an access key/secret access key? There is no "site-replicator-0" KES key, so I assume access key. What do you mean by "value" then?
What do you mean by "value assignments"? And what clients? I just checked with mcli again (
|
I only found a strange case where when I open two minio login pages in one browser at the same time, there have one said that |
@jiuker I have the occasional case, where the page after login stays blank, because I think you case sounds like a race condition on the shared cookies/localStorage/sessionStorage between the browser tabs, that are not shared between browsers. |
Yeah. Will return back to login page for |
@jiuker I don't think it is limited to a specific page, I've seen it happen on several different pages.
I'm not so sure anymore, because I get errors with mcli too, that doesn't go through the console, right? |
We can't reproduce any of the issues reported here. |
How can I properly clear replication settings from both sites? then I could test the production cluster without site replication and see if that helps. |
|
I just noticed, even the replication rules on buckets are completely inconsistent from refresh to refresh. @poornas thanks, I'll try that next week. |
Bucket versioning is also affected. it seems like everything somehow related to site replication is completely inconsistent between the nodes of the production cluster. It also seems to have gotten worse since I last checked last week. |
@poornas I removed the backup site from replication and it's all fine now. Should the |
Remove |
Are you saying it should automatically disappear after removing site-replication? Because it hasn't so far, neither in the production site nor in the backup site. Both sites don't have any other replication rules. So I'll delete the service accounts to have a clean state. |
Yeah. It should disappear. If not, you can try delete it for I didn't reproduce your case. |
I removed the accounts, I'll upgrade both instances to latest now and then setup replication again in the evening. |
I remember @harshavardhana saying something about this in a past issue: The /v1/service-accounts endpoint is rather slow (400-900 ms "wait" time in the browser) given that this is a small cluster (5 nodes) and only 3 service accounts exist and my connection is basically local, this feels noticeable slow in the UI. This is still the case even after disabling replication. Is the timing within a normal range or would this be worth investigating? I originally thought this is caused by the replication problem, but apparently it isn't. |
I'd argue the operator would better stop and enter an error state, when a correct upgrade is not possible. How can we find the cause of the tenant:
image:
repository: quay.io/minio/minio
tag: 'RELEASE.2024-06-11T03-13-30Z'
name: tenant
configuration:
name: credentials
pools:
- name: main
servers: 5
volumesPerServer: 1
storageClassName: ''
size: 123123123
labels:
velero.io/exclude-from-backup: "true"
metrics:
enabled: true
certificate:
requestAutoCert: true
env:
- name: MINIO_OPERATOR_TLS_ENABLE
value: "off"
- name: MINIO_DOMAIN
value: "minio.example.org"
- name: MINIO_BROWSER_REDIRECT_URL
value: "https://console.minio.example.org"
- name: MINIO_SERVER_URL
value: "https://minio.example.org"
log:
disabled: true
prometheus:
disabled: true
prometheusOperator: true |
also there is no NetworkPolicy and also no outbound firewall in general, so there is no reason why the operator shouldn't be able to download and distribute the minio binaries. |
@harshavardhana todays minio release seems to have a few changes relevant to this issue (especially the binary validation). is it worth testing with this version or would the operator likely still fail to update the binary? |
Not really that seems to be a different problem, I will investigate it later coming week. |
OK, great. Tag me if you need something tested. |
I think k0s is causing these problems, what is your container runtime? |
k0s' default: containerd/runc, nothing fancy. The affected clusters are on k0s 1.29.x and 1.30.x |
This is the k0sctl config I use to deploy the single-node home lab: apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
name: k0s-cluster
spec:
hosts:
- ssh:
address: ...
user: ...
port: ...
keyPath: ...
role: single
privateInterface: ...
privateAddress: &ip ...
# This currently won't work: https://github.com/k0sproject/k0sctl/issues/476
# installFlags: ['--kubelet-extra-args="--cpu-manager-policy=static"']
installFlags: ['--profile=enable-swap']
k0s:
version: v1.30.2+k0s.0
dynamicConfig: false
config:
apiVersion: k0s.k0sproject.io/v1beta1
kind: ClusterConfig
metadata:
name: k0s
spec:
api:
address: *ip
externalAddress: ...
k0sApiPort: 9443
port: 6443
sans:
- ...
- *ip
- 127.0.0.1
extensions:
helm:
charts: null
repositories: null
storage: {}
installConfig:
users:
etcdUser: etcd
kineUser: kube-apiserver
konnectivityUser: konnectivity-server
kubeAPIserverUser: kube-apiserver
kubeSchedulerUser: kube-scheduler
network:
calico: null
dualStack: {}
kubeProxy:
mode: iptables
kuberouter:
autoMTU: true
mtu: 0
# removing these will break networking
peerRouterASNs: ""
peerRouterIPs: ""
podCIDR: 10.244.0.0/16
provider: kuberouter
serviceCIDR: 10.96.0.0/12
storage:
etcd:
externalCluster: null
peerAddress: *ip
type: etcd
workerProfiles:
- name: enable-swap
values:
memorySwap:
swapBehavior: LimitedSwap
featureGates:
NodeSwap: true most of the values are defaults |
can you also check by exec into operator pod at |
|
oh looks like this is not kept back, after the fetch() |
these changes are to investigate the issue minio/minio#19510 > Tar file extraction failed for file index: 2, with: EOF
Hopefully with this change we can do some more deeper investigation. |
@harshavardhana if you have a rough guide on how to deploy that change as a custom container I'd happily through that on my home lab |
There are situations where the tar extraction might fail, we need to investigate why this happens. via minio/minio#19510 > Tar file extraction failed for file index: 2, with: EOF
There are situations where the tar extraction might fail, we need to investigate why this happens. via minio/minio#19510 > Tar file extraction failed for file index: 2, with: EOF
What is your current operator version? |
5.0.15 |
I guess we are waiting for the operator 6 release here, right? |
@pschichtel yes |
however I would suggest upgrading to the latest release to fix few things where it now waits properly. operator 6.0 would handle the container behavior. |
I upgraded the operator to 5.0.16 and upgrade minio to RELEASE.2024-07-04T14-25-45Z. 5.0.16 indeed doesn't carry your changes, the same issue with the same log line still occurs, but after the upgrade the cluster still seems fine so far. So your changes seem to have helped with the symptoms |
yeah they will definitely help however v6.0.0 would not cause the same problem as v5.0.16 |
so this issue can be closed ? |
Yeah we can close this and I promise to verify that operator 6 indeed performs the upgrade correctly or I'll file a new issue with the additional information from the logs. Would I file the new issue here or in the operator repo? |
in the operator repo please @pschichtel thanks for your patience on this the last couple of months. |
This is a follow up to #19217 and #19201
After my vacation I just verified the state of the minio installation again after the previous issues.
Expected Behavior
Once logged in I'd expect not to randomly receive "invalid session" warnings or to get randomly logged out when navigating to certain pages (e.g. the Site Replication config page).
I would also expect to the same service accounts on my root user every time I refresh the Access Keys page (or when directly accessing
/api/v1/service-accounts
).Current Behavior
I randomly get invalid session responses ("The Access Key Id you provided does not exist in our records.") from the backend and on some pages, that leads to a redirect to the login page.
I also get a different list service accounts every time I refresh, sometimes it doesn't even include the site-replicator-0 account, which would explain why I'm still seeing #19217. Actually in my tests now by refreshing
/api/v1/service-accounts
a bunch of times, I rarely get all 4 service accounts.The backup site still occasionally logs this as in #19217:
Steps to Reproduce (for bugs)
I'm still not sure how I arrived at this state, I assume by enabling site replication.
I've checked that KES is working on both the production and the backup site. At this point I'm not even able to disable site replication on the production site, because I get constantly logged out (redirected to login page) from the page.
The single-node backup instance does not observe this behavior. there, I never get invalid session responses, I always get the same 4 service accounts on the root user (including site-replicator-0) and I can also access the Site Replication page.
Context
It makes using the minio console difficult. I assume, replication from backup to production would not reliably work (or be a lot slower), but that's not currently something I need to do.
Interestingly
mcli admin user svcacct list production admin
always returns the complete list of service accounts for my root user, although not always in the same order, but that doesn't matter. S3 clients in general don't seem to be affected, at least not functionally.To elaborate on the setup:
2 sites:
The keys between the KES deployments are identical (replicated files from production site can be decrypted on backup site. The production KES setup is responsive and can successfully access the vault (I created and deleted a test key to confirm).
Your Environment
minio --version
):RELEASE.2024-04-06T05-26-02Z
uname -a
):Linux 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: