Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AzureOpenAIConfig validation to cater for reverse proxy endpoint #875

Merged
merged 8 commits into from
Nov 19, 2024

Conversation

alexmg
Copy link
Contributor

@alexmg alexmg commented Nov 1, 2024

Motivation and Context (Why the change? What's the scenario?)

The intent of the change in this PR is to allow Azure OpenAI requests to be made over HTTP to a reverse proxy that is running within the same Docker Compose orchestration for local development.

We have reverse proxy in place for Azure OpenAI that routes requests to different service instances based on available quota. When deployed to Azure all services can use HTTPS and Kernel Memory can use the reverse proxy for Azure OpenAI requests.

Visual Studio with Docker Compose orchestration is used for local development. The reverse proxy project is started and exposes both HTTP and HTTPS ports. A local dev certificate is used automatically by the tooling and the reverse proxy can be called over HTTPS from the host machine using localhost.

The problem is that the Kernel Memory container is also started as part of the Docker Compose orchestration and uses the public Docker image with a settings file mounted from the host. When Kernel Memory is configured to call the reverse proxy container over HTTPS for Azure OpenAI the requests fail because it does not trust the dev certificate and cannot establish the SSL connection. This is because the dev certificate created by ASP.NET Core is for localhost and not the hostname of the container running within the orchestration.

Installing certificates into the containers within the Docker Compose orchestration without modifying the container image is difficult. I also don't want to build another container image with trust for a self-signed certificate because the same container should be used for development and production. We instead use a Kernel Memory setting file mounted into the Kernel Memory container that is designed for running in the local Docker Compose orchestration. This setting file configures the AzureOpenAIConfig.Endpoint settings with a URL containing the name of the container running the reverse proxy, but the connection cannot be established due to the certificate trust issues mentioned above.

Other services do not have the same restrictions applied, making them far easier to run in containers for local development. When running SQL Server in a local container the connection string can have TrustServerCertificate set to True to avoid trust issues with the server certificate. Redis can be run in a container with a connection string that allows access to the container without SSL being enabled. RabbitMQ can even have SSL explicitly disabled using the RabbitMQConfig.SslEnabled property.

Even though the real Azure OpenAI service is only exposed over HTTPS, the Azure.AI.OpenAI library allows for HTTP requests, which I would imagine is to help with local development and testing scenarios such as this. Traffic is not being directed at localhost or 127.0.0.1 because the services are running in containers, but conceptually the scenario is the same because the connections are between containers running within an orchestration on a developer machine.

High level description (Approach, Design)

Update the validation in the AzureOpenAIConfig class:

- First ensure the Endpoint value is a valid Uri and throw a ConfigurationException when invalid
- If the host domain ends with .openai.azure.com and requests are going directly to the Azure OpenAI endpoints ensure that the scheme is https
- If the host domain is not the public Azure OpenAI endpoints ensure that the schema is either http or https

  • Allow to specify a custom list of trusted cert thumbprints. A proxy over HTTPS can be trusted even if the cert is not signed, as long as the thumbprint matches.

@alexmg alexmg requested a review from dluc as a code owner November 1, 2024 04:08
@dluc
Copy link
Collaborator

dluc commented Nov 1, 2024

Even when using proxies, sending traffic over HTTP without encryption is not secure. Removing the HTTPS validation on the AzureOpenAIconfig class isn’t something we can merge, as it would compromise our security baseline.

Why not configure the proxy to use HTTPS instead? This would allow to maintain the necessary security without compromising the routing solution.

The only potential exception could be traffic on 127.0.0.1, but since it appears that traffic is going beyond the host, it’s especially important to keep this protection in place.

@alexmg
Copy link
Contributor Author

alexmg commented Nov 4, 2024

Hi @dluc. My apologies, I was in a rush to open the PR and did not fully explain the scenario, which is something I should have done for a change such as this. Hopefully, the details in the updated description better describe the problem and a workable solution can be found. The situation is like using 127.0.0.1 except the services are running within containers on the local machine, and not directly on the host.

@dluc
Copy link
Collaborator

dluc commented Nov 4, 2024

Hi @dluc. My apologies, I was in a rush to open the PR and did not fully explain the scenario, which is something I should have done for a change such as this. Hopefully, the details in the updated description better describe the problem and a workable solution can be found. The situation is like using 127.0.0.1 except the services are running within containers on the local machine, and not directly on the host.

Hi @alexmg, thanks for the clarification, and no worries about the rush, the additional context is useful.

I’d suggest adjusting the code to allow "http" only when using 127.0.0.1. If the proxy is bound to any other IP or hostname, enforcing SSL would be best. While it adds minimal computational overhead, the added security is well worth it.

Even though your proxies are local, the code itself can’t verify that. For example, consider a proxy on 192.168.1.10, running on the same host. You know it’s safe because you’re familiar with the network topology. But someone else might set up the same IP on a different machine, creating a potential risk where LLM traffic could be intercepted or manipulated without any indication in the code.

@alexmg
Copy link
Contributor Author

alexmg commented Nov 5, 2024

I’d suggest adjusting the code to allow "http" only when using 127.0.0.1. If the proxy is bound to any other IP or hostname, enforcing SSL would be best. While it adds minimal computational overhead, the added security is well worth it.

This will not help when running Kernel Memory and its dependent services in containers for local development. The containers communicate with each other using a hostname visible only within the private network. Limiting "http" to loopback would only be useful when running all services on the host machine. Running all services in a bridge network in Docker on the host machine has the benefit of network isolation.

I don't think there is a reliable way to know that the container is running within Docker Compose launched from Visual Studio. When adding containerized projects to the orchestration I modify the docker-compose.override.yml file to include an environment variable called RUNNING_IN_DOCKER_COMPOSE, but this is not something that is included automatically by Visual Studio. The official dotnet images have the DOTNET_RUNNING_IN_CONTAINER environment variable but this only lets you know you are running in a container and not under what circumstances. The person configuring the service knows where it will be running and should be able to configure things accordingly.

I'm not worried about the additional computational overhead of SSL, it's the management of certificates within all the containers that need to communicate with each other that is difficult. To facilitate the SSL connections self-signed certificates would need to be generated, and the containers forced to trust those certificates, with the fake CA being added to the operating system. With the many different base images and Linux distributions available getting all these certificates in place with the right DNS names would be challenging to say the least. Even with all that in place the SSL connections would not truly be trustworthy, and the traffic between containers would never have left the private network.

I am still confused why the configuration for this one service has these concerns and restrictions while the other services do not. Azure Blob, Azure Queues, MongoDB, Postgresql, Redis, and SQL Server all have connection strings that can contain any desired configuration including hostname. Ollama, OpenAI, Qdrant, S3 all have endpoints that can contain any hostname. These are all friendly to local development with containers because they allow the person configuring the service to make decisions based on the information they have. Placing similar restrictions on these services would only make Kernel Memory harder to work with and less attractive to developers. Running containers for local development with Docker Compose is already a popular option, and with .NET Aspire gaining traction the trend is likely to grow even more.

Even though your proxies are local, the code itself can’t verify that. For example, consider a proxy on 192.168.1.10, running on the same host. You know it’s safe because you’re familiar with the network topology. But someone else might set up the same IP on a different machine, creating a potential risk where LLM traffic could be intercepted or manipulated without any indication in the code.

This is not a problem when running the containers in Docker Compose because they can only see other containers in the same bridge network and resolve them by hostname. There are no other machines running in the network other than the ones you explicitly start there as containers. The network isolation provided by the bridge network is beneficial and I feel more comfortable with HTTP communications taking place there than on the host loopback.

@dluc
Copy link
Collaborator

dluc commented Nov 5, 2024

I’d suggest adjusting the code to allow "http" only when using 127.0.0.1. If the proxy is bound to any other IP or hostname, enforcing SSL would be best. While it adds minimal computational overhead, the added security is well worth it.

This will not help when running Kernel Memory and its dependent services in containers for local development. The containers communicate with each other using a hostname visible only within the private network. Limiting "http" to loopback would only be useful when running all services on the host machine. Running all services in a bridge network in Docker on the host machine has the benefit of network isolation.

I don't think there is a reliable way to know that the container is running within Docker Compose launched from Visual Studio. If the container is built from a project in the solution a RUNNING_IN_DOCKER_COMPOSE environment variable is added, but this is not the case if the service points to an existing image in a container registry. The official dotnet images have the DOTNET_RUNNING_IN_CONTAINER environment variable but this only lets you know you are running in a container and not under what circumstances. The person configuring the service knows where it will be running and should be able to configure things accordingly.

I'm not worried about the additional computational overhead of SSL, it's the management of certificates within all the containers that need to communicate with each other that is difficult. To facilitate the SSL connections self-signed certificates would need to be generated, and the containers forced to trust those certificates, with the fake CA being added to the operating system. With the many different base images and Linux distributions available getting all these certificates in place with the right DNS names would be challenging to say the least. Even with all that in place the SSL connections would not truly be trustworthy, and the traffic between containers would never have left the private network.

I am still confused why the configuration for this one service has these concerns and restrictions while the other services do not. Azure Blob, Azure Queues, MongoDB, Postgresql, Redis, and SQL Server all have connection strings that can contain any desired configuration including hostname. Ollama, OpenAI, Qdrant, S3 all have endpoints that can contain any hostname. These are all friendly to local development with containers because they allow the person configuring the service to make decisions based on the information they have. Placing similar restrictions on these services would only make Kernel Memory harder to work with and less attractive to developers. Running containers for local development with Docker Compose is already a popular option, and with .NET Aspire gaining traction the trend is likely to grow even more.

Even though your proxies are local, the code itself can’t verify that. For example, consider a proxy on 192.168.1.10, running on the same host. You know it’s safe because you’re familiar with the network topology. But someone else might set up the same IP on a different machine, creating a potential risk where LLM traffic could be intercepted or manipulated without any indication in the code.

This is not a problem when running the containers in Docker Compose because they can only see other containers in the same bridge network and resolve them by hostname. There are no other machines running in the network other than the ones you explicitly start there as containers. The network isolation provided by the bridge network is beneficial and I feel more comfortable with HTTP communications taking place there than on the host loopback.

👍 Thanks for the details about docker network. I'll take some time to evaluate the options

@dluc
Copy link
Collaborator

dluc commented Nov 11, 2024

@alexmg if we fixed this problem, would that be sufficient?

When Kernel Memory is configured to call the reverse proxy container over HTTPS for Azure OpenAI the requests fail because it does not trust the dev certificate and cannot establish the SSL connection. This is because the dev certificate created by ASP.NET Core is for localhost and not the hostname of the container running within the orchestration.

To address this, we could add a configuration option to define a list of trusted hostnames for which certificate validation would be bypassed. This would enable secure, encrypted traffic while allowing connections to hosts with untrusted SSL certificates, based on a controlled hostname list. Optionally, we could implement cert pinning to further restrict which invalid certificates are allowed.

The config could look something like this:

{
  "KernelMemoery": {
    // ...
    "Network": {
      "TrustedHostsWithoutSSL": [
        "localhost",
        "my-proxy.local:8080"
        // ...
      ],
      "UnsignedCertsTrustedThumbprints": [
        "ABCDEF1234567890ABCDEF1234567890ABCDEF12",
        "ABCDEF1234567890ABCDEF1234567890ABCDEF34"
        // ...
      ]
    }
    // ...
  }
}

By the way, all these can be managed via env vars, so you could set values also in docker config files

KernelMemory__Network__TrustedHostsWithoutSSL__0=localhost
KernelMemory__Network__TrustedHostsWithoutSSL__1=my-proxy.local:8080
KernelMemory__Network__UnsignedCertsTrustedThumbprints__0=ABCDEF1234567890ABCDEF1234567890ABCDEF12
KernelMemory__Network__UnsignedCertsTrustedThumbprints__1=ABCDEF1234567890ABCDEF1234567890ABCDEF34

@alexmg
Copy link
Contributor Author

alexmg commented Nov 12, 2024

@dluc Yes, I believe this would solve the certificate trust issue and allow for HTTPS traffic to move between the containers within orchestration.

The ASP.NET Core dev certificate that is mounted into the container has a CN of localhost so I believe that would be the most common value. I'm sure having the ability to provide additional host names would be beneficial to others too. My understanding is that the CN in a certificate should not have protocol, port, or path.

I also like the idea of pinning to certificate thumbprints but agree this should be optional because the dev certificate on each developer machine will have a different thumbprint. It's nice to have a F5 experience out of the box that doesn't require additional configuration.

It's possible to run pre-built container images with the dev certificate and this should work for that too.

https://learn.microsoft.com/en-us/aspnet/core/security/docker-https?view=aspnetcore-8.0#running-pre-built-container-images-with-https

@dluc
Copy link
Collaborator

dluc commented Nov 12, 2024

The ASP.NET Core dev certificate that is mounted into the container has a CN of localhost so I believe that would be the most common value. I'm sure having the ability to provide additional host names would be beneficial to others too. My understanding is that the CN in a certificate should not have protocol, port, or path

I would not use CN for validation, but rather the cert thumbprint. It's very difficult, about impossible as far as I know, to create a cert with a given thumbprint.

You can run this command to find the value we would put in the config:

dotnet dev-certs https --check

A valid certificate was found: E12B2726F0DA59D112807735CFB6F5BC2E4CADBC - CN=localhost - Valid from 2024-10-10 14:12:42Z to 2025-10-10 14:12:42Z - IsHttpsDevelopmentCertificate: true - IsExportable: true

Config value: E12B2726F0DA59D112807735CFB6F5BC2E4CADBC

@dluc
Copy link
Collaborator

dluc commented Nov 12, 2024

Here's a sample code if you want to try updating the PR and testing with your certs:

// thumbprint of the cert in your docker image
// TODO: this will be a list, for flexibility
string trustedThumbprint = "ABCDEF1234567890ABCDEF1234567890ABCDEF12";

var handler = new HttpClientHandler
{
    ServerCertificateCustomValidationCallback = (request, certificate, chain, sslPolicyErrors) =>
    {
        // Allow invalid root CA (ignores certificate errors related to untrusted roots)
        if (sslPolicyErrors == SslPolicyErrors.None || sslPolicyErrors == SslPolicyErrors.RemoteCertificateChainErrors)
        {
            string remoteCertThumbprint = certificate?.GetCertHashString()?.ToUpperInvariant();
            // TODO: check against a list 
            return string.Equals(remoteCertThumbprint, trustedThumbprint, StringComparison.OrdinalIgnoreCase);
        }

        // reject other SSL errors
        return false;
    }
};

using var httpClient = new HttpClient(handler);
try
{
    // note: using "https"
    var response = await httpClient.GetAsync("https://your-proxy");
    // ...
}
catch (HttpRequestException ex)
{
    // ...
}

@dluc dluc added waiting for author Waiting for author to reply or address comments work in progress labels Nov 12, 2024
@alexmg alexmg force-pushed the AzureOpenAIEndpointValidation branch from e6dfc73 to 9b9324e Compare November 18, 2024 06:05
@alexmg
Copy link
Contributor Author

alexmg commented Nov 18, 2024

Hi @dluc.

I have updated the branch with the ServerCertificateCustomValidationCallback implementation and had to adjust the logic around checking the SslPolicyErrors values. Because the ASP.NET Core development certificate is self-signed and the request between containers is not to localhost, both the RemoteCertificateNameMismatch and RemoteCertificateChainErrors flags are set on the SslPolicyErrors provided to callback.

I renamed the option to TrustedCertificateThumbprints because it is both the certificate chain and name that can be mismatched when using the dev certificates. The dev certificate has a CN of localhost but the call to another container would be to an internal host name like proxy.

  • If there are no SSL policy errors (SslPolicyErrors.None) and the certificate is valid, the callback immediately returns true to avoid further processing
  • If the thumbprint of the certificate cannot be retrieved false is returned
  • If a match for the thumbprint is found true is returned
  • If no match for the thumbprint is found false is returned

The HttpClient with the custom HttpClientHandler is only created if one is not already provided and the configuration has values configured in TrustedCertificateThumbprints.

I noted there was an opportunity to utilize the IHttpClientFactory to extend the functionality to other services like Anthropic but decided to leave the options on the AzureOpenAIConfig class for now. I guess they could be moved to a more central location if others are proxying additional services and would benefit from the same feature.

@alexmg alexmg force-pushed the AzureOpenAIEndpointValidation branch from 9b9324e to 330c9e4 Compare November 19, 2024 07:15
@dluc dluc merged commit 4b6b0e8 into microsoft:main Nov 19, 2024
6 checks passed
@alexmg alexmg deleted the AzureOpenAIEndpointValidation branch November 19, 2024 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for author Waiting for author to reply or address comments work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants