This repository contains a Python Streamlit application that allows users to search for movies using semantic search powered by OpenAI embeddings and Azure Cosmos DB. The application includes vector search, full text search, text ranking, and hybrid search. Three containers with no index, qflat, and DiskANN indexes are created.
- Integration with Azure Cosmos DB for storing and querying
- Semantic search for movies using OpenAI embeddings.
- Full text search for movies.
- Hybrid search combining semantic and full text search.
- Text ranking for search results.
- Support for different indexes (No Index, Qflat Index, DiskANN Index).
- Interactive UI built with Streamlit.
- Azure Cosmos DB account with NoSQL API.
- Azure OpenAI account.
-
Clone the repository and navigate to the folder:
git clone https://github.com/AzureCosmosDB/BRK193-Ignite2024.git cd BRK193-Ignite2024/cosmos-search-demo
-
Create file name ".env" in the "app" folder, and update with required environment variables for Azure Cosmos DB and Azure OpenAI
AZURE_OPENAI_APIKEY= AZURE_OPENAI_ENDPOINT= AZURE_COSMOSDB_ENDPOINT= AZURE_COSMOSDB_KEY=
-
Install the required packages:
pip install -r requirements.txt
-
Run the Streamlit application:
streamlit run src/app/cosmos-app.py
-
Install the Azure App Service extension:
- Open the Extensions view by clicking on the square icon in the Sidebar.
- Search for "Azure App Service" and click on the Install button.
-
Deploy the application:
- create Azure Web App with Linux service plan (choose B1 SKU or higher), and Python 3.10.
- In App Service go to configuration and paste the following in the
Startup Command
field:python -m pip install requirements.txt python -m streamlit run src/app/cosmos-app.py --server.port 8000 --server.address 0.0.0.0
- Ctrl + Shift + P and select "Azure App Service: Deploy to Web App"
- Select this folder on your machine
- Select subscription
- Select the Azure Web App you created above.
- Wait until app deployed (can take up to 5 minutes).
- The app will create the containers with required vector and text search policies and indexes. You need to load the data into the containers separately.
- The
data-loader.py
script is provided in thesrc/data
folder. You can run this script for any data as long as it is a json array of documents with a uniqueid
field, and a field of any name containing text to be vectorized. You can also use an existing vectorized field, or re-embed that field using OpenAI embeddings if necessary. The below command will load the Movie Lens dataset into 3 containers in a database calledignite2024demo
(created by the streamlit app above), re-embedding theoverview
field and re-naming ittext
, also naming the embedding fieldembedding
to match the streamlit app, and discarding the existingvector
embedding field.python src/data/data-loader.py --text_field_name "overview" --path_to_json_array "https://raw.githubusercontent.com/microsoft/AzureDataRetrievalAugmentedGenerationSamples/refs/heads/main/DataSet/Movies/MovieLens-4489-256D.json" --database_name "ignite2024demo" --concurrency 20 --vector_field_name "vector" --re_embed True