[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding reboot functionality #378

Merged
merged 10 commits into from
Jun 13, 2024
Merged

Conversation

gregszumel
Copy link
Contributor

This is something I whipped up, thoughts? The basic idea is to be able to restart the tokio runtime if it for some reason goes down.

It introduces a few new pieces, mainly:

  • Reboot state, which just saves the args to `Engine::new
  • adds the reboot function, which updates the MistralRs sender
  • Modifies the MistralRs.sender to have a RwLock, so we can safely swap it out with a new sender if the engine goes down
  • Since we now have a RwLock on the sender, the get_sender() function can potentially fail, so now get_sender() returns a Result.

Before I keep going on this, wanted to get anyone's thoughts on it!

Maybe still to do

  • I have not worked on the PyO3 stuff, I'm not super familiar with it -- is anything needed there in terms of supporting a restart?
  • Make more formalized errors to allow not using the anyhow Errors in the main MistralRs struct

Copy link
github-actions bot commented Jun 4, 2024
Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 31         1217         1038           37          142
 TOML                   16          440          400            1           39
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1135            0          836          299
 |- BASH                 5          100           97            0            3
 |- Python               6          122          110            0           12
 |- Rust                 2           80           72            3            5
 (Total)                           1437          279          839          319
-------------------------------------------------------------------------------
 Rust                  115        34276        31041          583         2652
 |- Markdown            57          641           13          594           34
 (Total)                          34917        31054         1177         2686
===============================================================================
 Total                 191        37565        32894         1457         3214
===============================================================================
  

Copy link
Owner
@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this functionality. My concern is that we will not know if the engine itself has crashed and therefore there is no more sender to reboot to. I think that if the sender was closed, we should check if the engine itself is alive somehow before restoring it.

Also, I assume that you left out the rebooting logic from interactive mode on purpose? I think that we should include this logic though in the mistralrs_pyo3::Runner methods send_completion_request and send_chat_completion_request.

mistralrs-core/src/lib.rs Outdated Show resolved Hide resolved
mistralrs-core/src/scheduler.rs Show resolved Hide resolved
mistralrs-server/src/chat_completion.rs Outdated Show resolved Hide resolved
mistralrs-server/src/completions.rs Outdated Show resolved Hide resolved
mistralrs-core/src/lib.rs Outdated Show resolved Hide resolved
@gregszumel
Copy link
Contributor Author

@EricLBuehler thanks for the thoughts!

My concern is that we will not know if the engine itself has crashed and therefore there is no more sender to reboot to. I think that if the sender was closed, we should check if the engine itself is alive somehow before restoring it.

Agreed - let me think more about this before my next commit here

Also, I assume that you left out the rebooting logic from interactive mode on purpose? I think that we should include this logic though in the mistralrs_pyo3::Runner methods send_completion_request and send_chat_completion_request.

Will add those when I've made another stab at the above

@gregszumel
Copy link
Contributor Author

Okay this is my next stab, here's a quick overview:

  • On init, save engine parameters into the RebootState, so we can reboot a 'fresh' engine if needed (not purely fresh, uses the same pipeline).
  • On init, also save the engine thread handler into engine_handler, so we can check if the thread has completed
  • On get_sender, checks to see if the engine thread has completed. If it has, then the engine is dead.
  • Then we try to reboot the engine by setting up a new engine, and storing a new mspc channel sender + engine handler

This is a little more robust than the last approach; previously, maybe somehow the sender was dropped without the engine going down. In this new approach, the engine has definitely crashed if the thread has returned.

Also this now auto-reboots on get_sender(), so no extra logic is needed on the server side.

I don't believe there's a potential for deadlock with the RwLocks on the sender + engine_handler, just added that comment in reboot_engine for clarity. It could probably be slightly more efficient; i.e., if multiple senders try to reboot the engine at the same time, we might have a bit of holdup on the locks, but I don't think it should be too drastic since we also check if the engine is alive in reboot_engine before actually rebooting (which prevents multiple subsequent reboots).

Thoughts on this newest version? Still some to-dos with the error stuff, but I'll handle that if this looks like a solid approach.

Copy link
Owner
@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I think this is almost ready to merge after you fix the errors you mentioned.

Before merging, however, I would appreciate it if you run the updated code to make sure it works, and maybe throw in a panic! in the engine to confirm that the reboot works?

@gregszumel
Copy link
Contributor Author

Here's a version that adds some error types, reworks the logging (warns on a reboot needed), and resolves the merge conflict.

I was able to test that adding a panic into the engine.run() did crash the runtime and it was able to restart (before immediately crashing again because of the panic).

Not sure why the windows check failed, let me know if there's something I can do about that -- looks like a tokenizer didn't have permissions?

@EricLBuehler
Copy link
Owner

Ok, great!

The Windows check is flaky for some reason.

@gregszumel
Copy link
Contributor Author

I don't think the clippy stuff is from this PR -- let me know if there's anything else to do for this one!

Copy link
Owner
@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@EricLBuehler EricLBuehler merged commit 1ae5bfe into EricLBuehler:master Jun 13, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants