[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail VMIs in irrecoverable state after an interrupted postcopy migration #11479

Merged
merged 5 commits into from
Mar 20, 2024

Conversation

vladikr
Copy link
Member
@vladikr vladikr commented Mar 11, 2024

What this PR does

Before this PR:
Virtual machine instances would get stuck in a Paused/PostcopyFailed state.
Users cannot recover their VMs from this state.
Furthermore, the virt-handler will endlessly try to unpause the VMI, which will never succeed.

After this PR:
This PR lets the virt-handler Fail VMIs that have reached an irrecoverable state.
It is also adding functional tests to make sure that VMs will be eventually restarted.

Fixes # issues.redhat.com/browse/CNV-38195

virtual machines instance will no longer be stuck in an irrecoverable state after an interrupted postcopy migration. Instead, these will fail and could be restarted again.

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/compute labels Mar 11, 2024
@vladikr vladikr force-pushed the irrecoverable_postcopy branch 2 times, most recently from 9e2cbef to 9183acb Compare March 11, 2024 23:39
Copy link
Contributor
@fossedihelm fossedihelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vladikr!
I am still reviewing the second commit but can you please tell me if the missing tryGracefully is relevant? Thank you!

return d.helperVmShutdown(vmi, domain, tryGracefully)
}

func (d *VirtualMachineController) helperVmShutdown(vmi *v1.VirtualMachineInstance, domain *api.Domain, tryGracefully bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tryGracefully is never used, so we perform gracefully shutdown if ACPI feature is available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fossedihelm eagle eyes! Thanks.

* See the License for the specific language governing permissions and
* limitations under the License.
*
* Copyright 2024 Red Hat, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright 2024 Red Hat, Inc.
* Copyright The KubeVirt Authors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@vladikr vladikr force-pushed the irrecoverable_postcopy branch 2 times, most recently from c3bc4ba to 5380f7a Compare March 12, 2024 14:15
Copy link
Contributor
@fossedihelm fossedihelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladikr Sorry for the splitted review :)
Some comments below


})

Context("[Serial] should migrate using for postcopy", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This [Serial] is misleading since actually one of the Entry of the DescribeTable inside of it will run in parallel due to the missing Serial decorator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, the top level [Serial] was a mistake, it wasn't actually Serial

}, 150, 1*time.Second).Should(Equal(v1.MigrationFailed))
}

It("should make sure that VM restarts after failur", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It("should make sure that VM restarts after failur", func() {
It("should make sure that VM restarts after failure", func() {

runMigrationKillerPod(vmi.Status.NodeName)

By("Making sure that post-copy migration failed")
observePostCopyMigrationFailure(migration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can be dropped in favor of

Eventually(matcher.ThisMigration(migration), 150, 1*time.Second).Should(BeInPhase(v1.MigrationFailed))

Comment on lines 321 to 330
func beRestarted(oldUID types.UID) gomegatypes.GomegaMatcher {
return gstruct.PointTo(gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{
"ObjectMeta": gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{
"UID": Not(Equal(oldUID)),
}),
"Status": gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{
"Phase": Equal(v1.Running),
}),
}))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is great and it's the second time I see it. Maybe we should start thinking of creating a new file under matcher pkg.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks like there are many functions that need to be separated into the matcher. I prefer to do it separately in a follow up.
For now, I just made the beRestarted public.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure! I just to mention it to share the issue :)

Comment on lines 405 to 412
func confirmMigrationMode(virtClient kubecli.KubevirtClient, vmi *v1.VirtualMachineInstance, expectedMode v1.MigrationMode) {
By("Retrieving the VMI post migration")
vmi, err := virtClient.VirtualMachineInstance(vmi.Namespace).Get(context.Background(), vmi.Name, &metav1.GetOptions{})
Expect(err).ToNot(HaveOccurred())

By("Verifying the VMI's migration mode")
Expect(vmi.Status.MigrationState.Mode).To(Equal(expectedMode))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this under libmigration pkg and remove the duplicate in the swap_test? I know it's outside of this PR :(

@vladikr vladikr force-pushed the irrecoverable_postcopy branch 2 times, most recently from 90dca2f to aceca1d Compare March 12, 2024 20:46
Copy link
Contributor
@fossedihelm fossedihelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vladikr for your patience in the review process.
Last things from my side and then we are ready to go :)

Comment on lines 206 to 209
vmi := libvmi.NewFedora(libnet.WithMasqueradeNetworking()...)
vmi.Namespace = namespace
vmi.Spec.Domain.Resources.Requests[k8sv1.ResourceMemory] = resource.MustParse("3Gi")
vmi.Spec.Domain.Devices.Rng = &v1.Rng{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we could save some time using Alpine or Cirros instead of Fedora. Here we don't need stress-ng :)
And in general we can drop the namespace parameter since we are creating the VM using directly:
virtClient.VirtualMachine(testsuite.NamespacePrivileged).
Last, we can also use use libvmi Options:

Suggested change
vmi := libvmi.NewFedora(libnet.WithMasqueradeNetworking()...)
vmi.Namespace = namespace
vmi.Spec.Domain.Resources.Requests[k8sv1.ResourceMemory] = resource.MustParse("3Gi")
vmi.Spec.Domain.Devices.Rng = &v1.Rng{}
vmi := libvmi.NewFedora(
append(libnet.WithMasqueradeNetworking(),
libvmi.WithResourceMemory("2Gi"),
libvmi.WithRng())...,
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fossedihelm, we need stress-ng here.

  1. to make sure there is enough time before the migration switches to post copy.
  2. that the memory of that guest is already "polluted" so it will take time copy the rest of the memory to target after switch to post copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Sorry, I missed the line! Sorry for the noise

// check VMI, confirm migration state
waitUntilMigrationMode(vmi, v1.MigrationPostCopy, 300)

// launch killer pod on every node that isn't the vmi's node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// leftorver
Actually we are doing the opposite :)

}))
}

func VMIMigrationWithGuestaAgent(virtClient kubecli.KubevirtClient, pvName string, memoryRequestSize string, migrationPolicy *migrationsv1.MigrationPolicy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, typo:
VMIMigrationWithGuestAgent
IMO we should do something to this "test" shared between two files. But definitely outside of the scope of this PR.
So we should do it in a follow up :)

*
*/

package matcher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

Comment on lines +41 to +46
func BeReady() gomegatypes.GomegaMatcher {
return gstruct.PointTo(gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{
"Status": gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{
"Ready": BeTrue(),
}),
}))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another beReady definition in tests/hotplug/cpu.go if you want to unify everything :D

@fossedihelm
Copy link
Contributor

@vladikr To me looks good!
Once the conflicts are solved I will be happy to lgtm.
Thank you for the effort and the patience

@vladikr vladikr force-pushed the irrecoverable_postcopy branch 2 times, most recently from ea01f70 to 6208133 Compare March 13, 2024 22:35
@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024
@fossedihelm
Copy link
Contributor

It was a long journey! Thank you @vladikr 🙂
/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 15, 2024
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2024
Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
@vladikr
Copy link
Member Author
vladikr commented Mar 18, 2024

/unhold

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 18, 2024
@fossedihelm
Copy link
Contributor

/lgtm
/retest

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2024
@kubevirt-commenter-bot
Copy link

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-windows2016
/test pull-kubevirt-e2e-kind-1.27-vgpu
/test pull-kubevirt-e2e-kind-sriov
/test pull-kubevirt-e2e-k8s-1.29-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.27-sig-network
/test pull-kubevirt-e2e-k8s-1.27-sig-storage
/test pull-kubevirt-e2e-k8s-1.27-sig-compute
/test pull-kubevirt-e2e-k8s-1.27-sig-operator
/test pull-kubevirt-e2e-k8s-1.28-sig-network
/test pull-kubevirt-e2e-k8s-1.28-sig-storage
/test pull-kubevirt-e2e-k8s-1.28-sig-compute
/test pull-kubevirt-e2e-k8s-1.28-sig-operator

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fossedihelm
Copy link
Contributor

@vladikr code-lint lane is complaining about adding stuff in tests/utils file :)
/hold

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 19, 2024
Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
@kubevirt-bot kubevirt-bot added sig/storage and removed lgtm Indicates that a PR is ready to be merged. labels Mar 19, 2024
@fossedihelm
Copy link
Contributor

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2024
@kubevirt-commenter-bot
Copy link

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-windows2016
/test pull-kubevirt-e2e-kind-1.27-vgpu
/test pull-kubevirt-e2e-kind-sriov
/test pull-kubevirt-e2e-k8s-1.29-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.27-sig-network
/test pull-kubevirt-e2e-k8s-1.27-sig-storage
/test pull-kubevirt-e2e-k8s-1.27-sig-compute
/test pull-kubevirt-e2e-k8s-1.27-sig-operator
/test pull-kubevirt-e2e-k8s-1.28-sig-network
/test pull-kubevirt-e2e-k8s-1.28-sig-storage
/test pull-kubevirt-e2e-k8s-1.28-sig-compute
/test pull-kubevirt-e2e-k8s-1.28-sig-operator

@fossedihelm
Copy link
Contributor

/retest
/unhold
code lint lane has been fixed

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 20, 2024
@vladikr
Copy link
Member Author
vladikr commented Mar 20, 2024

/test pull-kubevirt-e2e-k8s-1.27-sig-storage

@kubevirt-bot kubevirt-bot merged commit a18b070 into kubevirt:main Mar 20, 2024
38 checks passed
@vladikr
Copy link
Member Author
vladikr commented Mar 26, 2024

/cherry-pick release-1.2

@kubevirt-bot
Copy link
Contributor

@vladikr: #11479 failed to apply on top of branch "release-1.2":

Applying: fail VMIs when it reaches an irrecoverable state
Applying: test: vms are restarted after irrecoverable post-copy migration
Using index info to reconstruct a base tree...
M	tests/migration/BUILD.bazel
M	tests/migration/migration.go
M	tests/utils.go
Falling back to patching base and 3-way merge...
Auto-merging tests/utils.go
Auto-merging tests/migration/migration.go
CONFLICT (content): Merge conflict in tests/migration/migration.go
Auto-merging tests/migration/BUILD.bazel
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 test: vms are restarted after irrecoverable post-copy migration
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fossedihelm
Copy link
Contributor

/cherry-pick release-1.2

@vladikr Do you have the capacity to create a manual cherry-pick? Thank you!

dankenigsberg added a commit to dankenigsberg/hyperconverged-cluster-operator that referenced this pull request Apr 10, 2024
With kubevirt/kubevirt#11479 we can now drop the warning about VMs reaching an inconsistent state.

Signed-off-by: Dan Kenigsberg <danken@redhat.com>
kubevirt-bot pushed a commit to kubevirt/hyperconverged-cluster-operator that referenced this pull request Apr 10, 2024
With kubevirt/kubevirt#11479 we can now drop the warning about VMs reaching an inconsistent state.

Signed-off-by: Dan Kenigsberg <danken@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/compute sig/storage size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants