Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: simplified user administration #20

Open
wants to merge 31 commits into
base: sqlite
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
0160977
chore: code cleanup
NanamiNakano Jan 4, 2025
8702017
feat(user_utils)!: manage user purely from a CSV file
NanamiNakano Jan 4, 2025
6046a80
feat(user_utils): delete user through csv
NanamiNakano Jan 4, 2025
fb9641f
chore: vocabulary
NanamiNakano Jan 4, 2025
88f03ee
docs: user administration
NanamiNakano Jan 4, 2025
9faff6b
docs: user administration
NanamiNakano Jan 4, 2025
bd72e42
feat:(database): version control
NanamiNakano Jan 5, 2025
5515a76
feat: mercury version
NanamiNakano Jan 5, 2025
7d7fd28
fix: config key is not unique
NanamiNakano Jan 5, 2025
0ef373f
feat: migrate to database with version control
NanamiNakano Jan 5, 2025
8d5adb0
chore: make code more readable
NanamiNakano Jan 5, 2025
b56587e
fix:: drop old table
NanamiNakano Jan 5, 2025
37ce30f
user admin without CSV
forrestbao Jan 14, 2025
803d84d
feat:(database): version control
NanamiNakano Jan 5, 2025
0953425
feat: mercury version
NanamiNakano Jan 5, 2025
5b8d7e3
fix: config key is not unique
NanamiNakano Jan 5, 2025
10b712d
feat: migrate to database with version control
NanamiNakano Jan 5, 2025
c143fda
chore: make code more readable
NanamiNakano Jan 5, 2025
d6f1d78
fix:: drop old table
NanamiNakano Jan 5, 2025
efb6dc3
clean up migration scripts
forrestbao Jan 14, 2025
158b7a4
chore: code cleanup
NanamiNakano Jan 4, 2025
ef4610f
feat(user_utils)!: manage user purely from a CSV file
NanamiNakano Jan 4, 2025
3d8399d
feat(user_utils): delete user through csv
NanamiNakano Jan 4, 2025
7fc191a
chore: vocabulary
NanamiNakano Jan 4, 2025
cd9376f
docs: user administration
NanamiNakano Jan 4, 2025
47874b6
docs: user administration
NanamiNakano Jan 4, 2025
645aedf
user admin without CSV
forrestbao Jan 14, 2025
1cc2864
clean up migration scripts
forrestbao Jan 14, 2025
e4580f6
chore: fix typo
NanamiNakano Jan 20, 2025
0597a4f
Merge branch 'feat/database-version-control' into feat/user-management
NanamiNakano Jan 20, 2025
6b77e61
fix: user admin
NanamiNakano Jan 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Currently, Mercury only supports labeling inconsistencies between the source and

![Header](usage/selection_from_highlight.png)

## Dependencies
## Dependencies and setup

> [!NOTE]
> You need Python and Node.js.
Expand All @@ -22,7 +22,9 @@ Mercury uses [`sqlite-vec`](https://github.com/asg017/sqlite-vec) to store and s

2. If you don't have `pnpm` installed, please install with `npm install -g pnpm` - you may need `sudo`. If you don't have `npm`, try `sudo apt install npm`.

3. To use `sqlite-vec` via Python's built-in `sqlite3` module, you must have SQLite>3.41 (otherwise `LIMIT` or `k=?` will not work properly with `rowid IN (?)` for vector search) installed and ensure Python's built-in `sqlite3` module is built for SQLite>3.41. Note that Python's built-in `sqlite3` module uses its own binary library that is independent of the OS's SQLite. So upgrading the OS's SQLite will not upgrade Python's `sqlite3` module.
3. Compile the frontend: `pnpm install && pnpm build`

4. To use `sqlite-vec` via Python's built-in `sqlite3` module, you must have SQLite>3.41 (otherwise `LIMIT` or `k=?` will not work properly with `rowid IN (?)` for vector search) installed and ensure Python's built-in `sqlite3` module is built for SQLite>3.41. Note that Python's built-in `sqlite3` module uses its own binary library that is independent of the OS's SQLite. So upgrading the OS's SQLite will not upgrade Python's `sqlite3` module.
To manually upgrade Python's `sqlite3` module to use SQLite>3.41, here are the steps:
* Download and compile SQLite>3.41.0 from source
```bash
Expand All @@ -48,7 +50,7 @@ Mercury uses [`sqlite-vec`](https://github.com/asg017/sqlite-vec) to store and s
* If you are using Mac and run into troubles, please follow
SQLite-vec's [instructions](https://alexgarcia.xyz/sqlite-vec/python.html#updated-sqlite).

4. To use `sqlite-vec` directly in `sqlite` prompt, simply [compile
5. To use `sqlite-vec` directly in `sqlite` prompt, simply [compile
`sqlite-vec` from source](https://alexgarcia.xyz/sqlite-vec/compiling.html) and load the compiled `vec0.o`. The usage
can be found in the SQLite-vec's [README](https://github.com/asg017/sqlite-vec?tab=readme-ov-file#sample-usage).

Expand All @@ -58,13 +60,14 @@ Mercury uses [`sqlite-vec`](https://github.com/asg017/sqlite-vec) to store and s

Run `python3 ingester.py -h` to see the options.

The ingester takes a CSV, JSON, or JSONL file and loads texts from two text columns (configurable via option `ingest_column_1` and `ingest_column_2` which default to `source` and `summary`) of the file. After ingestion, the data will be stored in the SQLite database, denoted as `MERCURY_DB` in the following steps.
The ingester takes a CSV, JSON, or JSONL file and loads texts from two text columns (configurable via option `ingest_column_1` and `ingest_column_2` which default to `source` and `summary`) of the file. After ingestion, the data will be stored in the SQLite database, denoted as `CORPUS_DB` in the following steps.

2. Manually set the labels for annotators to choose from in the `labels.yaml` file. Mercury supports hierarchical labels.
3. Generate and set a JWT secret key: `export SECRET_KEY=$(openssl rand -base64 32)`. You can rerun the command above to generate a new secret key when needed, especially when the old one is compromised. Note that changing the JWT token will log out all users. Optionally, you can also set `EXPIRE_MINUTES` to change the expiration time of the JWT token. The default is 7 days (10080 minutes).
4. Start the Mercury annotation server: `python3 server.py --corpus_db {CORPUS_DB} --user_db {USER_DB}`.

2. `pnpm install && pnpm build` (You need to recompile the frontend each time the UI code changes.)
3. Manually set the labels for annotators to choose from in the `labels.yaml` file. Mercury supports hierarchical labels.
4. Generate and set a JWT secret key: `export SECRET_KEY=$(openssl rand -base64 32)`. You can rerun the command above to generate a new secret key when needed, especially when the old one is compromised. Note that changing the JWT token will log out all users. Optionally, you can also set `EXPIRE_MINUTES` to change the expiration time of the JWT token. The default is 7 days (10080 minutes).
5. Administer the users: `python3 user_utils.py -h`. You need to create users before they can work on the annotation task. You can register new users, reset passwords, and delete users. User credentials are stored in a separate SQLite database, denoted as `USER_DB` in the following steps.
6. Start the Mercury annotation server: `python3 server.py --mercury_db {MERCURY_DB} --user_db {USER_DB}`. Be sure to set the candidate labels to choose from in the `labels.yaml` file.
Be sure to set the candidate labels to choose from in the `labels.yaml` file. The server will run on `http://localhost:8000` by default. The default `USER_DB`, namely `users.sqlite`, is distributed with the code repo with the default Email and password as `[email protected]` and `test`, respectively.
5. **Optional** To add/update/list users in a `USER_DB`, see [User administration in Mercury](user_admin.md) for more details.

The annotations are stored in the `annotations` table in a SQLite database (hardcoded name `mercury.sqlite`). See the
section [`annotations` table](#annotations-table-the-human-annotations) for the schema.
Expand Down
15 changes: 9 additions & 6 deletions database.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import sqlite_vec

from dotenv import load_dotenv
from version import __version__


class OldLabelData(TypedDict): # readable by frontend
Expand Down Expand Up @@ -162,6 +163,13 @@ def __init__(self, mercury_db_path: str, user_db_path: str):
# prepare the database
mercury_db = sqlite3.connect(mercury_db_path)
print("Open db at ", mercury_db_path)
version = mercury_db.execute("SELECT value FROM config WHERE key = 'version'").fetchone()
if version is None:
print("Cannot find Mercury version in the database. Please migrate the database.")
exit(1)
elif version[0] != __version__:
print (f"Mercury version mismatch between the code and the database file. The version in the database is {version[0]}, but the code version is {__version__}. Please migrate the database.")
exit(1)
mercury_db.execute("CREATE TABLE IF NOT EXISTS annotations (\
annot_id INTEGER PRIMARY KEY AUTOINCREMENT, \
sample_id INTEGER, \
Expand Down Expand Up @@ -446,12 +454,6 @@ def delete_annotation(self, record_id: str, annotator: str):
self.mercury_db.execute(sql_cmd, (int(record_id), annotator))
self.mercury_db.commit()

@database_lock()
def add_user(self, user_id: str, user_name: str): # TODO: remove this method since now only admin can add user
sql_cmd = "INSERT INTO users (user_id, user_name) VALUES (?, ?)"
self.mercury_db.execute(sql_cmd, (user_id, user_name))
self.mercury_db.commit()

@database_lock()
def change_user_name(self, user_id: str, user_name: str):
self.user_db.execute("UPDATE users SET user_name = ? WHERE user_id = ?", (user_name, user_id))
Expand Down Expand Up @@ -690,6 +692,7 @@ def get_env_id_value(env_name: str) -> int | None:
parser.add_argument("--mercury_db_path", type=str, required=True, help="Path to the Mercury SQLite database")
parser.add_argument("--user_db_path", type=str, required=True, help="Path to the user SQLite database")
parser.add_argument("--dump_file", type=str, required=True, default="mercury_annotations.json")
parser.add_argument("--version", action="version", version="__version__")
args = parser.parse_args()

# db = Database(args.annotation_corpus_id)
Expand Down
9 changes: 8 additions & 1 deletion ingester.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

from dotenv import load_dotenv
from tqdm.auto import tqdm
from version import __version__

import struct

Expand Down Expand Up @@ -118,7 +119,7 @@ def prepare_db(self):
f"CREATE VIRTUAL TABLE embeddings USING vec0(embedding float[{self.embedding_dimension}])"
)
self.db.execute(
"CREATE TABLE IF NOT EXISTS config (key TEXT PRIMARY KEY, value TEXT)"
"CREATE TABLE IF NOT EXISTS config (key TEXT PRIMARY KEY UNIQUE , value TEXT)"
)
self.db.execute(
"CREATE TABLE IF NOT EXISTS sample_meta (sample_id INTEGER PRIMARY KEY, json_meta TEXT)"
Expand All @@ -138,6 +139,10 @@ def prepare_db(self):
"INSERT OR REPLACE INTO config (key, value) VALUES ('embedding_dimension', ?)",
[self.embedding_dimension],
)
self.db.execute(
"INSERT OR REPLACE INTO config (key, value) VALUES ('version', ?)",
[__version__]
)

self.db.commit()

Expand Down Expand Up @@ -249,9 +254,11 @@ def get_env_id_value(env_name: str) -> int | None:
default="summary",
help="The name of the 2nd column to ingest",
)
parser.add_argument("--version", action="version", version="__version__")

args = parser.parse_args()

print("Mercury version: ", __version__)
print("Ingesting data")
ingester = Ingester(
file_to_ingest=args.file_to_ingest,
Expand Down
File renamed without changes.
28 changes: 28 additions & 0 deletions migration/database_version_control.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import argparse
import sqlite3


class Migrator:
def __init__(self, db_path):
conn = sqlite3.connect(db_path)
self.conn = conn
version = self.conn.execute("SELECT count(*) FROM config WHERE key = 'version'").fetchone()
if version[0] != 0:
print("Can not migrate database with existing version")
exit(0)

def migrate(self):
self.conn.execute("ALTER TABLE config RENAME TO config_old")
self.conn.execute("CREATE TABLE config(key TEXT PRIMARY KEY UNIQUE , value TEXT)")
self.conn.execute("INSERT INTO config SELECT key, value FROM config_old")
self.conn.execute("INSERT INTO config VALUES ('version', '0.1.0')")
self.conn.execute("DROP TABLE config_old")
self.conn.commit()
print("Migration completed")

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Migrate the database to 0.1.0")
parser.add_argument("--db_path", help="Path to the database", default="../mercury.sqlite")
args = parser.parse_args()
migrator = Migrator(args.db_path)
migrator.migrate()
32 changes: 32 additions & 0 deletions migration/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Migrating data from old versions

Mercury, and its database strucuture, are rapidly iterating.


## Adding user log in (December 14, 2024)
This change enables credential-based login. This frees the user from the need to always use the same browser.

To migrate, use the following steps:

```bash
python3 add_login.py export --workdir {DIR_OF_SQLITE_FILES} --csv unified_users.csv
python3 add_login.py register --csv unified_users.csv --db unified_users.sqlite
```

`{DIR_OF_SQLITE_FILES}` is the directory of SQLite corpus DB files that are created before login was implemented.
The script `add_login.py` extracts `user_id` and `user_name` from corpus DB file that contain annotations and dump them as a CSV file.
Then, the script creates a SQLite DB file, referred to as `USER_DB` which can be passed to updated Mercury.

## Adding versioning (January 15, 2025)

To deal with the ever-changing database structure, we introduce versioning to Mercury. The version of the Mercury is stored in the `config` table of a corpus DB.
The version of Mercury code is stored in a special file called `version.py`.
The first version is 0.1.0.

To migrate, use the following steps:

```bash
python3 database_version_control.py --db_path {OLD_CORPUS_DB}
```

It will happen in-place.
11 changes: 3 additions & 8 deletions server.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import sqlite_vec
from ingester import Embedder
from database import Database
from version import __version__

import jwt
from jwt.exceptions import InvalidTokenError
Expand Down Expand Up @@ -135,14 +136,6 @@ async def get_labels() -> list: # get all candidate labels for human annotators
return labels


@app.get("/user/new") # please update the route name to be more meaningful, e.g., /user/new_user
async def create_new_user():
user_id = uuid.uuid4().hex
user_name = "New User"
database.add_user(user_id, user_name)
return {"key": user_id, "name": user_name}


@app.get("/user/me")
async def get_user(token: Annotated[str, Depends(oauth2_scheme)], config: Config = Depends(get_config)) -> User:
credentials_exception = HTTPException(
Expand Down Expand Up @@ -449,6 +442,7 @@ async def login():
parser.add_argument("--mercury_db", type=str, required=True, default="./mercury.sqlite")
parser.add_argument("--user_db", type=str, required=True, default="./user.sqlite")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument("--version", action="version", version="__version__")
args = parser.parse_args()

env_secret_key = os.getenv("SECRET_KEY")
Expand All @@ -458,6 +452,7 @@ async def login():
expire = int(os.getenv("EXPIRE_MINUTES", 10080))
env_config = Config(secret_key=env_secret_key, expire=expire)

print("Mercury version: ", __version__)
print("Using Mercury SQLite db: ", args.mercury_db)
print("Using User SQLite db: ", args.user_db)

Expand Down
66 changes: 66 additions & 0 deletions user_admin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# User administration in Mercury

Mercury uses a SQLite DB for user info (denoted as `USER_DB`) that is separate from the main corpus DB `CORPUS_DB`. By decoupling the user administration from the corpus, we can have a single user DB for multiple corpora and the annotation is always de-anonymized. The Default names for the user DB is `users.sqlite`.

In a Mercury `USER_DB`, the following fields are stored for each user:
* `user_id`: Hash string that uniquely identifies a user
* `user_name`: User's name (for display purpose only, not for login)
* `email`: User's email (for login)
* `hashed_password`: Hashed password (for login)

The script for user administration is `user_admin.py`.

Actions that can be performed:
* Creating a new user

There are two ways to create a new user:

1. Using interactive mode:
```bash
python user_admin.py new
```
then follow the prompts.

2. Using command line arguments:

```bash
python user_admin.py new -n <user_name> -e <email> -p <password>
```
For example, to create a user with name `Test User`, email `[email protected]` and a random password:

```bash
python user_admin.py new -n "Test User" -e "[email protected]"
```

* Listing all users

```bash
python user_admin.py list
```

* Changing the password or email of a user, including resetting password

There are two ways to update a user's info:
1. Using interactive mode:

```bash
python user_admin.py update
```
then follow the prompts.

2. Using command line arguments:
```bash
python user_admin.py update -k <field_to_locate_user> -v <value_to_locate_user> -f <field_to_update> -n <new_value_of_the_field>
```

For example, to change the password of a user with email `[email protected]` to `abcdefg`:

```bash
python user_admin.py update -k email -v [email protected] -f password -n abcdefg
```

For various reasons, Mercury does not support deleting users. However, you can simply change the password of a user to a random string to effectively disable the user.



Mercury has minimal exception handling for user administration.
Loading