-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail VMIs in irrecoverable state after an interrupted postcopy migration #11479
Conversation
9e2cbef
to
9183acb
Compare
9183acb
to
7373b10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vladikr!
I am still reviewing the second commit but can you please tell me if the missing tryGracefully
is relevant? Thank you!
return d.helperVmShutdown(vmi, domain, tryGracefully) | ||
} | ||
|
||
func (d *VirtualMachineController) helperVmShutdown(vmi *v1.VirtualMachineInstance, domain *api.Domain, tryGracefully bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tryGracefully
is never used, so we perform gracefully shutdown if ACPI feature is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fossedihelm eagle eyes! Thanks.
tests/migration/postcopy.go
Outdated
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
* | ||
* Copyright 2024 Red Hat, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copyright 2024 Red Hat, Inc. | |
* Copyright The KubeVirt Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
c3bc4ba
to
5380f7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladikr Sorry for the splitted review :)
Some comments below
tests/migration/postcopy.go
Outdated
|
||
}) | ||
|
||
Context("[Serial] should migrate using for postcopy", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This [Serial]
is misleading since actually one of the Entry
of the DescribeTable
inside of it will run in parallel due to the missing Serial
decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yes, the top level [Serial] was a mistake, it wasn't actually Serial
tests/migration/postcopy.go
Outdated
}, 150, 1*time.Second).Should(Equal(v1.MigrationFailed)) | ||
} | ||
|
||
It("should make sure that VM restarts after failur", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It("should make sure that VM restarts after failur", func() { | |
It("should make sure that VM restarts after failure", func() { |
tests/migration/postcopy.go
Outdated
runMigrationKillerPod(vmi.Status.NodeName) | ||
|
||
By("Making sure that post-copy migration failed") | ||
observePostCopyMigrationFailure(migration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can be dropped in favor of
Eventually(matcher.ThisMigration(migration), 150, 1*time.Second).Should(BeInPhase(v1.MigrationFailed))
tests/migration/postcopy.go
Outdated
func beRestarted(oldUID types.UID) gomegatypes.GomegaMatcher { | ||
return gstruct.PointTo(gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{ | ||
"ObjectMeta": gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{ | ||
"UID": Not(Equal(oldUID)), | ||
}), | ||
"Status": gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{ | ||
"Phase": Equal(v1.Running), | ||
}), | ||
})) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is great and it's the second time I see it. Maybe we should start thinking of creating a new file under matcher
pkg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it looks like there are many functions that need to be separated into the matcher. I prefer to do it separately in a follow up.
For now, I just made the beRestarted
public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sure! I just to mention it to share the issue :)
tests/migration/postcopy.go
Outdated
func confirmMigrationMode(virtClient kubecli.KubevirtClient, vmi *v1.VirtualMachineInstance, expectedMode v1.MigrationMode) { | ||
By("Retrieving the VMI post migration") | ||
vmi, err := virtClient.VirtualMachineInstance(vmi.Namespace).Get(context.Background(), vmi.Name, &metav1.GetOptions{}) | ||
Expect(err).ToNot(HaveOccurred()) | ||
|
||
By("Verifying the VMI's migration mode") | ||
Expect(vmi.Status.MigrationState.Mode).To(Equal(expectedMode)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this under libmigration
pkg and remove the duplicate in the swap_test? I know it's outside of this PR :(
90dca2f
to
aceca1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vladikr for your patience in the review process.
Last things from my side and then we are ready to go :)
tests/migration/postcopy.go
Outdated
vmi := libvmi.NewFedora(libnet.WithMasqueradeNetworking()...) | ||
vmi.Namespace = namespace | ||
vmi.Spec.Domain.Resources.Requests[k8sv1.ResourceMemory] = resource.MustParse("3Gi") | ||
vmi.Spec.Domain.Devices.Rng = &v1.Rng{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we could save some time using Alpine or Cirros instead of Fedora. Here we don't need stress-ng
:)
And in general we can drop the namespace parameter since we are creating the VM using directly:
virtClient.VirtualMachine(testsuite.NamespacePrivileged)
.
Last, we can also use use libvmi Options:
vmi := libvmi.NewFedora(libnet.WithMasqueradeNetworking()...) | |
vmi.Namespace = namespace | |
vmi.Spec.Domain.Resources.Requests[k8sv1.ResourceMemory] = resource.MustParse("3Gi") | |
vmi.Spec.Domain.Devices.Rng = &v1.Rng{} | |
vmi := libvmi.NewFedora( | |
append(libnet.WithMasqueradeNetworking(), | |
libvmi.WithResourceMemory("2Gi"), | |
libvmi.WithRng())..., | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fossedihelm, we need stress-ng here.
- to make sure there is enough time before the migration switches to post copy.
- that the memory of that guest is already "polluted" so it will take time copy the rest of the memory to target after switch to post copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! Sorry, I missed the line! Sorry for the noise
tests/migration/postcopy.go
Outdated
// check VMI, confirm migration state | ||
waitUntilMigrationMode(vmi, v1.MigrationPostCopy, 300) | ||
|
||
// launch killer pod on every node that isn't the vmi's node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// leftorver
Actually we are doing the opposite :)
tests/migration/postcopy.go
Outdated
})) | ||
} | ||
|
||
func VMIMigrationWithGuestaAgent(virtClient kubecli.KubevirtClient, pvName string, memoryRequestSize string, migrationPolicy *migrationsv1.MigrationPolicy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, typo:
VMIMigrationWithGuestAgent
IMO we should do something to this "test" shared between two files. But definitely outside of the scope of this PR.
So we should do it in a follow up :)
* | ||
*/ | ||
|
||
package matcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
func BeReady() gomegatypes.GomegaMatcher { | ||
return gstruct.PointTo(gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{ | ||
"Status": gstruct.MatchFields(gstruct.IgnoreExtras, gstruct.Fields{ | ||
"Ready": BeTrue(), | ||
}), | ||
})) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another beReady
definition in tests/hotplug/cpu.go
if you want to unify everything :D
aceca1d
to
359df8a
Compare
@vladikr To me looks good! |
ea01f70
to
6208133
Compare
6208133
to
de72a51
Compare
de72a51
to
dce4336
Compare
It was a long journey! Thank you @vladikr 🙂 |
4938a4d
to
552f7fd
Compare
552f7fd
to
c6872da
Compare
Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
c6872da
to
832a708
Compare
/unhold |
/lgtm |
Required labels detected, running phase 2 presubmits: |
/retest-required |
1 similar comment
/retest-required |
@vladikr code-lint lane is complaining about adding stuff in |
Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
/lgtm |
Required labels detected, running phase 2 presubmits: |
/retest |
/test pull-kubevirt-e2e-k8s-1.27-sig-storage |
/cherry-pick release-1.2 |
@vladikr: #11479 failed to apply on top of branch "release-1.2":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@vladikr Do you have the capacity to create a manual cherry-pick? Thank you! |
With kubevirt/kubevirt#11479 we can now drop the warning about VMs reaching an inconsistent state. Signed-off-by: Dan Kenigsberg <danken@redhat.com>
With kubevirt/kubevirt#11479 we can now drop the warning about VMs reaching an inconsistent state. Signed-off-by: Dan Kenigsberg <danken@redhat.com>
What this PR does
Before this PR:
Virtual machine instances would get stuck in a Paused/PostcopyFailed state.
Users cannot recover their VMs from this state.
Furthermore, the virt-handler will endlessly try to unpause the VMI, which will never succeed.
After this PR:
This PR lets the virt-handler Fail VMIs that have reached an irrecoverable state.
It is also adding functional tests to make sure that VMs will be eventually restarted.
Fixes # issues.redhat.com/browse/CNV-38195