[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random invalid session and inconsistent service accounts #19510

Open
pschichtel opened this issue Apr 15, 2024 · 83 comments
Open

Random invalid session and inconsistent service accounts #19510

pschichtel opened this issue Apr 15, 2024 · 83 comments
Assignees
Labels
community fixed in latest release this issue is already fixed and upgrade is recommended

Comments

@pschichtel
Copy link
pschichtel commented Apr 15, 2024

This is a follow up to #19217 and #19201

After my vacation I just verified the state of the minio installation again after the previous issues.

Expected Behavior

Once logged in I'd expect not to randomly receive "invalid session" warnings or to get randomly logged out when navigating to certain pages (e.g. the Site Replication config page).

I would also expect to the same service accounts on my root user every time I refresh the Access Keys page (or when directly accessing /api/v1/service-accounts).

Current Behavior

I randomly get invalid session responses ("The Access Key Id you provided does not exist in our records.") from the backend and on some pages, that leads to a redirect to the login page.

I also get a different list service accounts every time I refresh, sometimes it doesn't even include the site-replicator-0 account, which would explain why I'm still seeing #19217. Actually in my tests now by refreshing /api/v1/service-accounts a bunch of times, I rarely get all 4 service accounts.

The backup site still occasionally logs this as in #19217:

minio-1  | API: SRPeerBucketOps(bucket=154a22a1-8dca-4e64-98d8-687376a04d32)
minio-1  | Time: 16:58:03 UTC 04/15/2024
minio-1  | DeploymentID: bc54da3b-88f4-4a0d-a9d4-2365bf5a0d80
minio-1  | RequestID: 17C68294B9A6D50A
minio-1  | RemoteHost: 
minio-1  | Host: 
minio-1  | UserAgent: MinIO (linux; amd64) madmin-go/2.0.0
minio-1  | Error: Site replication error(s): 
minio-1  | 'ConfigureReplication' on site Production (bf3123cf-9753-4ea4-a46f-535599899c4c): failed(Backup->Production: Bucket target creation error: Remote service endpoint offline, target bucket: 154a22a1-8dca-4e64-98d8-687376a04d32 or remote service credentials: site-replicator-0 invalid 
minio-1  | 	The Access Key Id you provided does not exist in our records.) (*errors.errorString)
minio-1  |        4: internal/logger/logger.go:259:logger.LogIf()
minio-1  |        3: cmd/logging.go:30:cmd.adminLogIf()
minio-1  |        2: cmd/admin-handlers-site-replication.go:142:cmd.adminAPIHandlers.SRPeerBucketOps()
minio-1  |        1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()

Steps to Reproduce (for bugs)

I'm still not sure how I arrived at this state, I assume by enabling site replication.

I've checked that KES is working on both the production and the backup site. At this point I'm not even able to disable site replication on the production site, because I get constantly logged out (redirected to login page) from the page.

The single-node backup instance does not observe this behavior. there, I never get invalid session responses, I always get the same 4 service accounts on the root user (including site-replicator-0) and I can also access the Site Replication page.

Context

It makes using the minio console difficult. I assume, replication from backup to production would not reliably work (or be a lot slower), but that's not currently something I need to do.

Interestingly mcli admin user svcacct list production admin always returns the complete list of service accounts for my root user, although not always in the same order, but that doesn't matter. S3 clients in general don't seem to be affected, at least not functionally.

To elaborate on the setup:

2 sites:

  1. site (production): 5 nodes, each with 1 disk, deployed via minio-operator to k8s, kes configured against a vault running in the same k8s
  2. site (backup): 1 node with 1 disk, deployed via docker-compose, kes configured with filesystem, containing the necessary keys from vault (to decouple the backup site from the k8s).

The keys between the KES deployments are identical (replicated files from production site can be decrypted on backup site. The production KES setup is responsive and can successfully access the vault (I created and deleted a test key to confirm).

Your Environment

  • Version used (minio --version): RELEASE.2024-04-06T05-26-02Z
  • Server setup and configuration: deployed by operator (5.0.14), replicating to a single-node setup on the same version deployed with docker-compose.
  • Operating System and version (uname -a): Linux 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
@jiuker
Copy link
Contributor
jiuker commented Apr 16, 2024

@pschichtel Do you only have one KES server for production?

@pschichtel
Copy link
Author

@jiuker production has 3, backup has 1

@jiuker
Copy link
Contributor
jiuker commented Apr 16, 2024

@jiuker production has 3, backup has 1

Does the 3 KES server have the same keys for production? @pschichtel

@pschichtel
Copy link
Author

They are all connected to the same vault (with a dedicated V2 KV engine for minio), so I'd assume so. How can I check?

@jiuker
Copy link
Contributor
jiuker commented Apr 16, 2024

Could you check the key site-replicator-0's value have changed always? @pschichtel

@pschichtel
Copy link
Author

Not sure what you mean

@jiuker
Copy link
Contributor
jiuker commented Apr 16, 2024

Not sure what you mean

Check for overlapping value assignments between two clients

@pschichtel
Copy link
Author

Sorry for being confused!

Could you check the key site-replicator-0's value have changed always?

By key, do you mean a KES key or an access key/secret access key? There is no "site-replicator-0" KES key, so I assume access key. What do you mean by "value" then?

Check for overlapping value assignments between two clients

What do you mean by "value assignments"? And what clients?

I just checked with mcli again (mcli admin user svcacct info production site-replicator-0) and every now and then I get mcli: <ERROR> Unable to get information of the specified service account. The specified service account is not found (Specified service account does not exist)., so there must be an instance that doesn't have the account I guess. Didn't notice it yesterday weirdly. If it doesn't fail with said error, then it returns this consistently:

AccessKey: site-replicator-0
ParentUser: root-user
Status: on
Name: 
Description: 
Policy: implied
Expiration: no-expiry

@jiuker
Copy link
Contributor
jiuker commented Apr 16, 2024

I only found a strange case where when I open two minio login pages in one browser at the same time, there have one said that {"detailedMessage":"Access Denied.","message":"invalid session"} , but I can access it normally with two browsers. I don't know if it's similar to your case? @pschichtel

@pschichtel
Copy link
Author
pschichtel commented Apr 16, 2024

@jiuker I have the occasional case, where the page after login stays blank, because /api/v1/session returns invalid session / The Access Key Id you provided does not exist in our records.. I don't need a two tabs/browsers for that.

I think you case sounds like a race condition on the shared cookies/localStorage/sessionStorage between the browser tabs, that are not shared between browsers.

@jiuker
Copy link
Contributor
jiuker commented Apr 16, 2024

@jiuker I have the occasional case, where the page after login stays blank, because /api/v1/session returns invalid session / The Access Key Id you provided does not exist in our records.. I don't need a two tabs/browsers for that.

I think you case sounds like a race condition on the shared cookies/localStorage/sessionStorage between the browser tabs, that are not shared between browsers.

Yeah. Will return back to login page for /api/v1/session return that {"detailedMessage":"Access Denied.","message":"invalid session"} when fresh the page. It looks like your case. So I guess that could be a console issue. What's the page you view when that happen. @pschichtel

@pschichtel
Copy link
Author

@jiuker I don't think it is limited to a specific page, I've seen it happen on several different pages.

So I guess that could be a console issue.

I'm not so sure anymore, because I get errors with mcli too, that doesn't go through the console, right?

@harshavardhana
Copy link
Member

We can't reproduce any of the issues reported here.

@pschichtel
Copy link
Author
pschichtel commented Apr 26, 2024

How can I properly clear replication settings from both sites? then I could test the production cluster without site replication and see if that helps.

@poornas
Copy link
Contributor
poornas commented Apr 26, 2024
 $ mc admin replicate remove sitea siteb  --force

@pschichtel
Copy link
Author

I just noticed, even the replication rules on buckets are completely inconsistent from refresh to refresh.

@poornas thanks, I'll try that next week.

@pschichtel
Copy link
Author

Bucket versioning is also affected. it seems like everything somehow related to site replication is completely inconsistent between the nodes of the production cluster. It also seems to have gotten worse since I last checked last week.

@pschichtel
Copy link
Author

@poornas I removed the backup site from replication and it's all fine now. Should the site-replicator-0 account disappear? Or should I clean that account up before re-enabling replication?

@jiuker
Copy link
Contributor
jiuker commented Apr 29, 2024

Remove site-replication and site-replicator-0 account disappear. You cannot remove that account, it's interval account.

@pschichtel
Copy link
Author

Are you saying it should automatically disappear after removing site-replication? Because it hasn't so far, neither in the production site nor in the backup site. Both sites don't have any other replication rules.

So I'll delete the service accounts to have a clean state.

@jiuker
Copy link
Contributor
jiuker commented Apr 29, 2024

Are you saying it should automatically disappear after removing site-replication? Because it hasn't so far, neither in the production site nor in the backup site. Both sites don't have any other replication rules.

So I'll delete the service accounts to have a clean state.

Yeah. It should disappear. If not, you can try delete it for I didn't reproduce your case.

@pschichtel
Copy link
Author

I removed the accounts, I'll upgrade both instances to latest now and then setup replication again in the evening.

@pschichtel
Copy link
Author

I remember @harshavardhana saying something about this in a past issue: The /v1/service-accounts endpoint is rather slow (400-900 ms "wait" time in the browser) given that this is a small cluster (5 nodes) and only 3 service accounts exist and my connection is basically local, this feels noticeable slow in the UI. This is still the case even after disabling replication. Is the timing within a normal range or would this be worth investigating? I originally thought this is caused by the replication problem, but apparently it isn't.

@pschichtel
Copy link
Author

I'd argue the operator would better stop and enter an error state, when a correct upgrade is not possible.

How can we find the cause of the Tar file extraction failed for file index: 2, with: EOF error? As I said: It happens in completely independent setups with different networks and different host systems (different hardware, different kernel, different distro), however they are both in k0s and they both use a similarly configured tenant created through the helm chart, the values:

tenant:
  image:
    repository: quay.io/minio/minio
    tag: 'RELEASE.2024-06-11T03-13-30Z'
  name: tenant
  configuration:
    name: credentials
  pools:
  - name: main
    servers: 5
    volumesPerServer: 1
    storageClassName: ''
    size: 123123123
    labels:
      velero.io/exclude-from-backup: "true"
  metrics:
    enabled: true
  certificate:
    requestAutoCert: true
  env:
  - name: MINIO_OPERATOR_TLS_ENABLE
    value: "off"
  - name: MINIO_DOMAIN
    value: "minio.example.org"
  - name: MINIO_BROWSER_REDIRECT_URL
    value: "https://console.minio.example.org"
  - name: MINIO_SERVER_URL
    value: "https://minio.example.org"
  log:
    disabled: true
  prometheus:
    disabled: true
  prometheusOperator: true

@pschichtel
Copy link
Author
pschichtel commented Jun 12, 2024

also there is no NetworkPolicy and also no outbound firewall in general, so there is no reason why the operator shouldn't be able to download and distribute the minio binaries.

@pschichtel
Copy link
Author

@harshavardhana todays minio release seems to have a few changes relevant to this issue (especially the binary validation). is it worth testing with this version or would the operator likely still fail to update the binary?

@harshavardhana
Copy link
Member

@harshavardhana todays minio release seems to have a few changes relevant to this issue (especially the binary validation). is it worth testing with this version or would the operator likely still fail to update the binary?

Not really that seems to be a different problem, I will investigate it later coming week.

@pschichtel
Copy link
Author

OK, great. Tag me if you need something tested.

@harshavardhana
Copy link
Member

I think k0s is causing these problems, what is your container runtime?

@pschichtel
Copy link
Author

k0s' default: containerd/runc, nothing fancy. The affected clusters are on k0s 1.29.x and 1.30.x

@pschichtel
Copy link
Author

This is the k0sctl config I use to deploy the single-node home lab:

apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0s-cluster
spec:
  hosts:
  - ssh:
      address: ...
      user: ...
      port: ...
      keyPath: ...
    role: single
    privateInterface: ...
    privateAddress: &ip ...
    # This currently won't work: https://github.com/k0sproject/k0sctl/issues/476
    # installFlags: ['--kubelet-extra-args="--cpu-manager-policy=static"']
    installFlags: ['--profile=enable-swap']
  k0s:
    version: v1.30.2+k0s.0
    dynamicConfig: false
    config:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: ClusterConfig
      metadata:
        name: k0s
      spec:
        api:
          address: *ip
          externalAddress: ...
          k0sApiPort: 9443
          port: 6443
          sans:
          - ...
          - *ip
          - 127.0.0.1
        extensions:
          helm:
            charts: null
            repositories: null
          storage: {}
        installConfig:
          users:
            etcdUser: etcd
            kineUser: kube-apiserver
            konnectivityUser: konnectivity-server
            kubeAPIserverUser: kube-apiserver
            kubeSchedulerUser: kube-scheduler
        network:
          calico: null
          dualStack: {}
          kubeProxy:
            mode: iptables
          kuberouter:
            autoMTU: true
            mtu: 0
            # removing these will break networking
            peerRouterASNs: ""
            peerRouterIPs: ""
          podCIDR: 10.244.0.0/16
          provider: kuberouter
          serviceCIDR: 10.96.0.0/12
        storage:
          etcd:
            externalCluster: null
            peerAddress: *ip
          type: etcd
        workerProfiles:
        - name: enable-swap
          values:
            memorySwap:
              swapBehavior: LimitedSwap
            featureGates:
              NodeSwap: true

most of the values are defaults

@harshavardhana
Copy link
Member

can you also check by exec into operator pod at /tmp/webhook/v1/update/ and then do du -sh /tmp/webhook/v1/update/* and ls -ltr /tmp/webhook/v1/update/

@pschichtel
Copy link
Author
$ kubectl -n minio-operator exec -it pod/minio-operator-7fcd88996b-mcnpn -- bash
bash-5.1$ du -sh /tmp/webhook/v1/update/*
du: cannot access '/tmp/webhook/v1/update/*': No such file or directory
bash-5.1$ ls -ltr /tmp/webhook/v1/update/
ls: cannot access '/tmp/webhook/v1/update/': No such file or directory
bash-5.1$ 

@harshavardhana
Copy link
Member

oh looks like this is not kept back, after the fetch()

harshavardhana added a commit to harshavardhana/operator that referenced this issue Jun 24, 2024
these changes are to investigate the issue

minio/minio#19510

> Tar file extraction failed for file index: 2, with: EOF
@harshavardhana
Copy link
Member

Hopefully with this change we can do some more deeper investigation.

@pschichtel
Copy link
Author

@harshavardhana if you have a rough guide on how to deploy that change as a custom container I'd happily through that on my home lab

harshavardhana added a commit to harshavardhana/operator that referenced this issue Jun 24, 2024
There are situations where the tar extraction might fail,
we need to investigate why this happens.

via minio/minio#19510

> Tar file extraction failed for file index: 2, with: EOF
harshavardhana added a commit to minio/operator that referenced this issue Jun 24, 2024
There are situations where the tar extraction might fail,
we need to investigate why this happens.

via minio/minio#19510

> Tar file extraction failed for file index: 2, with: EOF
@harshavardhana
Copy link
Member

@harshavardhana if you have a rough guide on how to deploy that change as a custom container I'd happily through that on my home lab

What is your current operator version?

@pschichtel
Copy link
Author

5.0.15

@pschichtel
Copy link
Author

I guess we are waiting for the operator 6 release here, right?

@harshavardhana
Copy link
Member

@pschichtel yes

@harshavardhana
Copy link
Member

however I would suggest upgrading to the latest release to fix few things where it now waits properly.

operator 6.0 would handle the container behavior.

@pschichtel
Copy link
Author

I upgraded the operator to 5.0.16 and upgrade minio to RELEASE.2024-07-04T14-25-45Z. 5.0.16 indeed doesn't carry your changes, the same issue with the same log line still occurs, but after the upgrade the cluster still seems fine so far. So your changes seem to have helped with the symptoms

@harshavardhana
Copy link
Member

I upgraded the operator to 5.0.16 and upgrade minio to RELEASE.2024-07-04T14-25-45Z. 5.0.16 indeed doesn't carry your changes, the same issue with the same log line still occurs, but after the upgrade the cluster still seems fine so far. So your changes seem to have helped with the symptoms

yeah they will definitely help however v6.0.0 would not cause the same problem as v5.0.16

@harshavardhana
Copy link
Member

I upgraded the operator to 5.0.16 and upgrade minio to RELEASE.2024-07-04T14-25-45Z. 5.0.16 indeed doesn't carry your changes, the same issue with the same log line still occurs, but after the upgrade the cluster still seems fine so far. So your changes seem to have helped with the symptoms

so this issue can be closed ?

@pschichtel
Copy link
Author

Yeah we can close this and I promise to verify that operator 6 indeed performs the upgrade correctly or I'll file a new issue with the additional information from the logs. Would I file the new issue here or in the operator repo?

@harshavardhana
Copy link
Member

Yeah we can close this and I promise to verify that operator 6 indeed performs the upgrade correctly or I'll file a new issue with the additional information from the logs. Would I file the new issue here or in the operator repo?

in the operator repo please @pschichtel thanks for your patience on this the last couple of months.

@harshavardhana harshavardhana added fixed in latest release this issue is already fixed and upgrade is recommended and removed priority: medium waiting for info labels Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community fixed in latest release this issue is already fixed and upgrade is recommended
Projects
None yet
Development

No branches or pull requests

6 participants