adding reboot functionality #378

gregszumel · 2024-06-04T20:45:02Z

This is something I whipped up, thoughts? The basic idea is to be able to restart the tokio runtime if it for some reason goes down.

It introduces a few new pieces, mainly:

Reboot state, which just saves the args to `Engine::new
adds the reboot function, which updates the MistralRs sender
Modifies the MistralRs.sender to have a RwLock, so we can safely swap it out with a new sender if the engine goes down
Since we now have a RwLock on the sender, the get_sender() function can potentially fail, so now get_sender() returns a Result.

Before I keep going on this, wanted to get anyone's thoughts on it!

Maybe still to do

I have not worked on the PyO3 stuff, I'm not super familiar with it -- is anything needed there in terms of supporting a restart?
Make more formalized errors to allow not using the anyhow Errors in the main MistralRs struct

github-actions · 2024-06-04T20:46:04Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 31         1217         1038           37          142
 TOML                   16          440          400            1           39
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1135            0          836          299
 |- BASH                 5          100           97            0            3
 |- Python               6          122          110            0           12
 |- Rust                 2           80           72            3            5
 (Total)                           1437          279          839          319
-------------------------------------------------------------------------------
 Rust                  115        34276        31041          583         2652
 |- Markdown            57          641           13          594           34
 (Total)                          34917        31054         1177         2686
===============================================================================
 Total                 191        37565        32894         1457         3214
===============================================================================

EricLBuehler

Thank you for adding this functionality. My concern is that we will not know if the engine itself has crashed and therefore there is no more sender to reboot to. I think that if the sender was closed, we should check if the engine itself is alive somehow before restoring it.

Also, I assume that you left out the rebooting logic from interactive mode on purpose? I think that we should include this logic though in the mistralrs_pyo3::Runner methods send_completion_request and send_chat_completion_request.

mistralrs-core/src/lib.rs

mistralrs-core/src/scheduler.rs

mistralrs-server/src/chat_completion.rs

mistralrs-server/src/completions.rs

mistralrs-core/src/lib.rs

gregszumel · 2024-06-05T13:05:38Z

@EricLBuehler thanks for the thoughts!

My concern is that we will not know if the engine itself has crashed and therefore there is no more sender to reboot to. I think that if the sender was closed, we should check if the engine itself is alive somehow before restoring it.

Agreed - let me think more about this before my next commit here

Also, I assume that you left out the rebooting logic from interactive mode on purpose? I think that we should include this logic though in the mistralrs_pyo3::Runner methods send_completion_request and send_chat_completion_request.

Will add those when I've made another stab at the above

gregszumel · 2024-06-06T13:22:41Z

Okay this is my next stab, here's a quick overview:

On init, save engine parameters into the RebootState, so we can reboot a 'fresh' engine if needed (not purely fresh, uses the same pipeline).
On init, also save the engine thread handler into engine_handler, so we can check if the thread has completed
On get_sender, checks to see if the engine thread has completed. If it has, then the engine is dead.
Then we try to reboot the engine by setting up a new engine, and storing a new mspc channel sender + engine handler

This is a little more robust than the last approach; previously, maybe somehow the sender was dropped without the engine going down. In this new approach, the engine has definitely crashed if the thread has returned.

Also this now auto-reboots on get_sender(), so no extra logic is needed on the server side.

I don't believe there's a potential for deadlock with the RwLocks on the sender + engine_handler, just added that comment in reboot_engine for clarity. It could probably be slightly more efficient; i.e., if multiple senders try to reboot the engine at the same time, we might have a bit of holdup on the locks, but I don't think it should be too drastic since we also check if the engine is alive in reboot_engine before actually rebooting (which prevents multiple subsequent reboots).

Thoughts on this newest version? Still some to-dos with the error stuff, but I'll handle that if this looks like a solid approach.

EricLBuehler

Looks great, I think this is almost ready to merge after you fix the errors you mentioned.

Before merging, however, I would appreciate it if you run the updated code to make sure it works, and maybe throw in a panic! in the engine to confirm that the reboot works?

gregszumel · 2024-06-10T19:00:40Z

Here's a version that adds some error types, reworks the logging (warns on a reboot needed), and resolves the merge conflict.

I was able to test that adding a panic into the engine.run() did crash the runtime and it was able to restart (before immediately crashing again because of the panic).

Not sure why the windows check failed, let me know if there's something I can do about that -- looks like a tokenizer didn't have permissions?

EricLBuehler · 2024-06-10T19:03:43Z

Ok, great!

The Windows check is flaky for some reason.

mistralrs-core/src/lib.rs

gregszumel · 2024-06-13T16:03:50Z

I don't think the clippy stuff is from this PR -- let me know if there's anything else to do for this one!

EricLBuehler

Looks good, thank you!

adding reboot functionality

3782349

clippy

3713b19

EricLBuehler reviewed Jun 4, 2024

View reviewed changes

gregszumel added 2 commits June 5, 2024 15:45

addressing comments

e326c17

thread handler checking

c9f7628

EricLBuehler reviewed Jun 6, 2024

View reviewed changes

gregszumel added 4 commits June 10, 2024 13:02

adding errors, removing extra logs

6b613a7

resolving merge conflict

24a44fe

simplifying error display impl

82cc3a8

fixing phi3v impl and the docs

69735f9

EricLBuehler reviewed Jun 10, 2024

View reviewed changes

mistralrs-core/src/lib.rs Outdated Show resolved Hide resolved

gregszumel added 2 commits June 10, 2024 15:30

nit: capitalizing error messages

e1994ff

Merge branch 'master' into reboot

c3c11e5

EricLBuehler approved these changes Jun 13, 2024

View reviewed changes

EricLBuehler merged commit 1ae5bfe into EricLBuehler:master Jun 13, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding reboot functionality #378

adding reboot functionality #378

adding reboot functionality #378

adding reboot functionality #378

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment