Use local LLMs in Python
Unofficial LM Studio Python Client SDK - Pre-Pre-Release
This is an unofficial effort to port the functionality of the lmstudio.js
TypeScript SDK to Python.
Since the TypeScript SDK is in pre-release alpha, I guess this is in pre-pre-release pre-alpha while I work to get feature parity up. Expect the library to be completely nonfunctional for the time being, and expect absurdly breaking changes!
Follow announcements about the base lmstudio.js
library on Twitter and Discord. Read the TypeScript docs.
Discuss all things related to developing with LM Studio in #dev-chat in LM Studio's Community Discord server. Contact me at @christianazinn
on Discord or via GitHub about this particular repository.
pip install lmstudio_sdk
from lmstudio_sdk import LMStudioClient
client = LMStudioClient()
llama3 = client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF")
for fragment in llama3.respond([{"role": "user", "content": "Say hello to lmstudio.py!"}], {}):
print(fragment, end="", flush=True)
client.close()
Asynchronously, just await
the constructor and use examples:
import asyncio
from lmstudio_sdk import LMStudioClient
async def main():
client = await LMStudioClient()
llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF")
prediction = await llama3.respond([{"role": "user", "content": "Say hello to lmstudio.py!"}], {})
async for fragment in prediction:
print(fragment, end="", flush=True)
await client.close()
asyncio.run(main())
Note
If you wish to await
on the result of the prediction rather than streaming it, note that the call to
model.respond()
/model.complete()
will take two await
s, like so:
result = await (await model.respond([{"role": "user", "content": "Say hello to lmstudio.py!"}]))
Henceforth all examples will be given synchronously; the asynchronous versions simply have await
sprinkled in all over.
This example shows how to connect to LM Studio running on a different port (e.g., 8080).
from lmstudio_sdk import LMStudioClient
client = new LMStudioClient({"baseUrl": "ws://localhost:8080"})
# client.llm.load(...)
You can set an identifier for a model when loading it. This identifier can be used to refer to the model later.
client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {"identifier": "my-model",})
# You can refer to the model later using the identifier
my_model = client.llm.get("my-model")
# my_model.complete(...)
By default, the load configuration for a model comes from the preset associated with the model (Can be changed on the "My Models" page in LM Studio).
llama3 = client.llm.load(
"lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF",
{
"config": {
"context_length": 1024,
"gpu_offload": 0.5, # Offloads 50% of the computation to the GPU
},
}
)
# llama3.complete(...)
You can track the loading progress of a model by providing an on_progress
callback.
llama3 = client.llm.load(
"lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF",
{
"verbose": False, # Disables the default progress logging
"on_progress": lambda progress: print(f"Progress: {round(progress * 100, 1)}")
}
)
If you wish to find all models that are available to be loaded, you can use the list_downloaded_models
method on the system
object.
downloaded_models = client.system.list_downloaded_models()
downloaded_llms = [model for model in downloaded_models if model.type == "llm"]
# Load the first model
model = client.llm.load(downloaded_llms[0].path)
# model.complete(...);
You can cancel a load by using an AbortSignal, ported from TypeScript. This API will probably change with time. Currently we have one AbortSignal class for each of sync and async.
from lmstudio_sdk import SyncAbortSignal
signal = SyncAbortSignal()
try:
llama3 = client.llm.load(
"lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF",
{"signal": signal,}
)
# llama3.complete(...);
except Exception as e:
logger.error(e)
# Somewhere else in your code:
controller.abort()
You can unload a model by calling the unload
method.
llama3 = client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {"identifier": "my-model"})
# ...Do stuff...
client.llm.unload("my-model")
Note, by default, all models loaded by a client are unloaded when the client disconnects. Therefore, unless you want to precisely control the lifetime of a model, you do not need to unload them manually.
To look up an already loaded model by its identifier, use the following:
my_model = client.llm.get({"identifier": "my-model"})
# Or just
my_model = client.llm.get("my-model")
# myModel.complete(...)
To look up an already loaded model by its path, use the following:
# Matches any quantization
llama3 = client.llm.get({
"path": "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF",
})
# Or if a specific quantization is desired:
llama3 = client.llm.get({
"path": "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf",
})
# llama3.complete(...)
If you do not have a specific model in mind, and just want to use any loaded model, you can use the unstable_get_any
method:
any_model = client.llm.unstable_get_any()
# any_model.complete(...)
To list all loaded models, use the client.llm.list_loaded
method.
loaded_models = client.llm.list_loaded()
if loadedModels.length == 0:
raise Exception("No models loaded")
# Use the first one
const first_model = await client.llm.get({
"identifier": loaded_models[0]["identifier"],
})
# first_model.complete(...);
Example
list_loaded
Response:[ { "identifier": "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", "path": "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", }, { "identifier": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf", "path": "microsoft/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf", }, ]
To perform text completion, use the complete
method:
prediction = model.complete("The meaning of life is")
for fragment in prediction:
print(fragment, end="", flush=True)
Note that this would be an async for
in asynchronous code.
By default, the inference parameters in the preset is used for the prediction. You can override them like this:
const prediction = any_model.complete("Meaning of life is", {
"contextOverflowPolicy": "stopAtLimit",
"maxPredictedTokens": 100,
"stopStrings": ["\n"],
"temperature": 0.7,
})
# ...Do stuff with the prediction...
To perform a conversation, use the respond
method:
prediction = any_model.respond([
{ "role": "system", "content": "Answer the following questions." },
{ "role": "user", "content": "What is the meaning of life?" },
])
for fragment in prediction:
print(fragment, end="", flush=True)
Similarly, you can override the inference parameters for the conversation (Note the available options are different from text completion):
prediction = any_model.respond(
[
{ "role": "system", "content": "Answer the following questions." },
{ "role": "user", "content": "What is the meaning of life?" },
],
{
"contextOverflowPolicy": "stopAtLimit",
"maxPredictedTokens": 100,
"stopStrings": ["\n"],
"temperature": 0.7,
}
)
# ...Do stuff with the prediction...
Important
Always Provide the Full History/Context
LLMs are stateless. They do not remember or retain information from previous inputs. Therefore, when predicting with an LLM, you should always provide the full history/context.
If you wish to get the prediction statistics, you can await on the prediction object to get a PredictionResult
, through which you can access the stats via the stats
property.
prediction = model.complete("The meaning of life is")
stats = prediction.result().stats
print(stats)
Asynchronously, this looks more like
prediction = await model.complete("The meaning of life is")
stats = (await prediction).stats
print(stats)
Note
No Extra Waiting
When you have already consumed the prediction stream, awaiting on the prediction object will not cause any extra waiting, as the result is cached within the prediction object.
On the other hand, if you only care about the final result, you don't need to iterate through the stream. Instead, you can await on the prediction object directly to get the final result. An example of what this looks like in JSON is below:
{
"stopReason": "eosFound",
"tokensPerSecond": 26.644333102146646,
"numGpuLayers": 33,
"timeToFirstTokenSec": 0.146,
"promptTokensCount": 5,
"predictedTokensCount": 694,
"totalTokensCount": 699
}
LM Studio supports structured prediction, which will force the model to produce content that conforms to a specific structure. To enable structured prediction, you should set the structured
field. It is available for both complete
and respond
methods.
Here is an example of how to use structured prediction:
prediction = model.complete("Here is a joke in JSON:", {
"max_predicted_tokens": 100,
"structured": {"type": "json"},
})
result = prediction.result()
try:
# Although the LLM is guaranteed to only produce valid JSON, when it is interrupted, the
# partial result might not be. Always check for errors. (See caveats below)
parsed = json.loads(result.content)
print(parsed)
except Exception as e:
print(f"Error: {e}")
Example output:
{ "title": "The Shawshank Redemption", "genre": [ "drama", "thriller" ], "release_year": 1994, "cast": [ { "name": "Tim Robbins", "role": "Andy Dufresne" }, { "name": "Morgan Freeman", "role": "Ellis Boyd" } ] }
Sometimes, any JSON is not enough. You might want to enforce a specific JSON schema. You can do this by providing a JSON schema to the structured
field. Read more about JSON schema at json-schema.org.
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
},
"required": ["name", "age"],
}
prediction = model.complete("...", {
"max_predicted_tokens": 100,
"structured": {"type": "json", "jsonSchema": schema},
})
Important
Caveats with Structured Prediction
- Although the model is forced to generate predictions that conform to the specified structure, the prediction may be interrupted (for example, if the user stops the prediction). When that happens, the partial result may not conform to the specified structure. Thus, always check the prediction result before using it, for example, by wrapping the
JSON.parse
inside a try-catch block. - In certain cases, the model may get stuck. For example, when forcing it to generate valid JSON, it may generate a opening brace
{
but never generate a closing brace}
. In such cases, the prediction will go on forever until the context length is reached, which can take a long time. Therefore, it is recommended to always set amaxPredictedTokens
limit. This also contributes to the point above.
A prediction may be canceled by calling the cancel
method on the prediction object.
prediction = model.complete("The meaning of life is")
# ...Do stuff...
prediction.cancel()
When a prediction is canceled, the prediction will stop normally but with stopReason
set to "userStopped"
. You can detect cancellation like so:
for fragment in prediction:
print(fragment, end="", flush=True)
stats = prediction.result().stats
if stats.stopReason == "userStopped":
print("Prediciton was canceled by the user")
Asynchronous paradigms are typically best suited for LLM applications, considering how long inference can take. The lmstudio.js
library, for instance, is written asynchronously, which is facilitated by JavaScript favoring asynchronous paradigms in general.
Python tends to favor synchronous programming; for instance, the OpenAI client runs synchronously. I designed this library so that it could in theory be used as a not-quite drop-in replacement, but very nearly one, for the OpenAI client (though of course only for LM Studio) - therefore synchronicity is a must. Many people also don't know how to use asyncio
or don't care to for their purposes, so a synchronous client is provided for these users.
However, leveraging asyncio
allows for greater control over program flow, so we also provide an asynchronous client for those who need it. Simply await
the LMStudioClient
constructor! The only quirk is that channel functions (predict
, complete
, respond
, and get
) will require you to await twice: once to send the request, and then once to get the result. For instance, if you just wanted the result, you could do the following:
print(await(await model.respond(["role": "user", "content": "Hello world!"], {})))
but you could also use the output of the first await
as an async iterable, like so:
prediction = await model.respond(["role": "user", "content": "Hello world!"], {})
async for fragment in prediction:
print(fragment, end="", flush=True)
Pick whichever backend best suits your use case. Incidentally, the asynchronous code runs on websockets
while the synchronous code runs on websocket-client
(confusingly imported as websocket
).
We provide internal logging for your debugging pleasure(?). Import the lmstudio.py
logger using from lmstudio_sdk import logger
, then logger.setLevel
to one of the default logging
levels, or one of the following (which can be imported from lmstudio_sdk
as well):
RECV = 5
: debugs all packets sent and received to/from the LM Studio server.SEND = 7
: debugs all packets sent to the LM Studio server.WEBSOCKET = 9
: debugs WebSocket connection events.
As usual, each level logs all levels above it. Depending on the backend, you can also use the websocket-client
(sync) or websockets
(async) loggers for more granular communications logging.
Currently aborting doesn't play very well with synchronous code. Preprocessors are not implemented, and please disregard TODOs.
Synchronous code is less tested than async code. Synchronous code might hang and not respect SIGINT if calling result()
on a model load or prediction request. In fact this is a known issue with the way threading.Event.wait()
blocks.
Documentation is also still coming, as are unit tests - maybe there's a typo somewhere in untested code that causes a method to not function properly. This library isn't released yet for a reason.