Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hub: cleanup data of deleted projects #7667

Draft
wants to merge 44 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
c3e2409
hub: cleanup data of deleted projects
haraldschilly Jul 5, 2024
f0b6d4c
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 7, 2024
a7a09c5
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 10, 2024
b546a4d
npm: updating pg + @types/pg
haraldschilly Jul 10, 2024
3bcabeb
database/cleanup: WIP bulk delete
haraldschilly Jul 10, 2024
af28b7f
fix fallout of b546a4d268e5560
haraldschilly Jul 10, 2024
4126e2c
database/bulk-delete: delete many rows without overwhelming the DB
haraldschilly Jul 10, 2024
59131f7
database/delete-projects: expand scope
haraldschilly Jul 10, 2024
5d7c4aa
database/test: attempting to actually fix testing the database in the…
haraldschilly Jul 10, 2024
b0b21a6
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 10, 2024
0857338
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 11, 2024
d8f481c
database/delete-project: refactor/fixes
haraldschilly Jul 11, 2024
39d0065
hub/delete-project: expand functionality and acutally delete files
haraldschilly Jul 11, 2024
2429b15
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 12, 2024
f4cabc5
hub/delete-project: add explicit delete_project_data setting
haraldschilly Jul 12, 2024
74936ec
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 16, 2024
a792a39
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 16, 2024
d163c82
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 17, 2024
3ba695a
database/bulk-delete: add env vars and improve test
haraldschilly Jul 17, 2024
a0dea09
frontend/settings: modernize project visibility and delete controls
haraldschilly Jul 17, 2024
4826f84
hub/delete-project: refactor, delete files, bugfixes
haraldschilly Jul 17, 2024
b497866
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 17, 2024
55d2a11
backend/logger: tighter typing
haraldschilly Jul 18, 2024
33ddc7a
hub/delete-projects: fix homePath, to make it work for the OnPrem sit…
haraldschilly Jul 18, 2024
993c74b
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 19, 2024
a753951
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 26, 2024
b88fde6
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 29, 2024
bc6be27
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 30, 2024
8d3877f
hub/delete-projects: also blobs table
haraldschilly Jul 30, 2024
3ceb67d
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 30, 2024
2c27cb2
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Jul 30, 2024
8617568
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Aug 12, 2024
91037d5
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Aug 27, 2024
83f34ea
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Sep 9, 2024
0c260e4
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Oct 15, 2024
1d06656
util/compute-states: catch up new project state with translations
haraldschilly Oct 15, 2024
d1989d8
frontend/project: translate project deleted banner and make it a prop…
haraldschilly Oct 15, 2024
5d797ef
hub/delete-projects: reset more fields and clear more table entries
haraldschilly Oct 15, 2024
74bb274
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Oct 24, 2024
b7023da
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Nov 4, 2024
c522750
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Nov 5, 2024
9158f85
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Nov 20, 2024
8920e28
Merge remote-tracking branch 'origin/master' into delete-project-data
haraldschilly Nov 25, 2024
ee48888
database/bulk-delete: disable "blob" table for now
haraldschilly Nov 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions src/packages/backend/files/path-to-files.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
/*
* This file is part of CoCalc: Copyright © 2020 Sagemath, Inc.
* License: AGPLv3 s.t. "Commons Clause" – see LICENSE.md for details
*/

// This is used to find files on the share server (public_paths) in "next"
// and also in the hub, for deleting shared files of projects

import { join } from "node:path";

import { projects } from "@cocalc/backend/data";

// Given a project_id/path, return the directory on the file system where
// that path should be located.
export function pathToFiles(project_id: string, path: string): string {
return join(projects.replace("[project_id]", project_id), path);
}
54 changes: 28 additions & 26 deletions src/packages/backend/logger.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@ process.env.DEBUG_HIDE_DATE = "yes"; // since we supply it ourselves
// otherwise, maybe stuff like this works: (debug as any).inspectOpts["hideDate"] = true;

import debug, { Debugger } from "debug";
import { mkdirSync, createWriteStream, statSync, ftruncate } from "fs";
import { format } from "util";
import { dirname, join } from "path";

import { createWriteStream, ftruncate, mkdirSync, statSync } from "node:fs";
import { dirname, join } from "node:path";
import { format } from "node:util";

import { logs } from "./data";

const MAX_FILE_SIZE_BYTES = 20 * 1024 * 1024; // 20MB
Expand All @@ -25,12 +27,12 @@ let _trimLogFileSizePath = "";
export function trimLogFileSize() {
// THIS JUST DOESN'T REALLY WORK!
return;

if (!_trimLogFileSizePath) return;
let stats;
try {
stats = statSync(_trimLogFileSizePath);
} catch(_) {
} catch (_) {
// this happens if the file doesn't exist, which is fine since "trimming" it would be a no-op
return;
}
Expand Down Expand Up @@ -141,27 +143,27 @@ function initTransports() {

initTransports();

const DEBUGGERS = {
error: COCALC.extend("error"),
warn: COCALC.extend("warn"),
info: COCALC.extend("info"),
http: COCALC.extend("http"),
verbose: COCALC.extend("verbose"),
debug: COCALC.extend("debug"),
silly: COCALC.extend("silly"),
};

type Level = keyof typeof DEBUGGERS;

const LEVELS: Level[] = [
const LEVELS = [
"error",
"warn",
"info",
"http",
"verbose",
"debug",
"silly",
];
] as const;

type Level = (typeof LEVELS)[number];

const DEBUGGERS: { [key in Level]: Debugger } = {
error: COCALC.extend("error"),
warn: COCALC.extend("warn"),
info: COCALC.extend("info"),
http: COCALC.extend("http"),
verbose: COCALC.extend("verbose"),
debug: COCALC.extend("debug"),
silly: COCALC.extend("silly"),
} as const;

class Logger {
private name: string;
Expand Down Expand Up @@ -194,13 +196,13 @@ class Logger {
}

export interface WinstonLogger {
error: Function;
warn: Function;
info: Function;
http: Function;
verbose: Function;
debug: Function;
silly: Function;
error: Debugger;
warn: Debugger;
info: Debugger;
http: Debugger;
verbose: Debugger;
debug: Debugger;
silly: Debugger;
extend: (name: string) => WinstonLogger;
isEnabled: (level: Level) => boolean;
}
Expand Down
4 changes: 2 additions & 2 deletions src/packages/backend/metrics.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { Counter, Gauge, Histogram } from "prom-client";

type Aspect = "db" | "database" | "server" | "llm";
type Aspect = "db" | "database" | "server" | "llm" | "database";

function withPrefix(aspect: Aspect, name: string): string {
return `cocalc_${aspect}_${name}`;
Expand All @@ -13,7 +13,7 @@ export function newCounter(
name: string,
help: string,
labelNames: string[] = [],
) {
): Counter<string> {
name = withPrefix(aspect, name);
const key = `counter-${name}`;
if (cache[key] != null) {
Expand Down
14 changes: 14 additions & 0 deletions src/packages/backend/misc.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
import { createHash } from "crypto";
import { join } from "node:path";

import { projects } from "@cocalc/backend/data";
import { is_valid_uuid_string } from "@cocalc/util/misc";

/*
Expand Down Expand Up @@ -69,3 +72,14 @@ export function envForSpawn() {
}
return env;
}

// return the absolute home directory of given @project_id project on disk
export function homePath(project_id: string): string {
// $MOUNTED_PROJECTS_ROOT is for OnPrem and that "projects" location is only for dev/single-user
const projects_root = process.env.MOUNTED_PROJECTS_ROOT;
if (projects_root) {
return join(projects_root, project_id);
} else {
return projects.replace("[project_id]", project_id);
}
}
1 change: 1 addition & 0 deletions src/packages/database/jest.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ module.exports = {
preset: 'ts-jest',
testEnvironment: 'node',
testMatch: ['**/?(*.)+(spec|test).ts?(x)'],
setupFiles: ['./test/setup.js'], // Path to your setup file
};
6 changes: 3 additions & 3 deletions src/packages/database/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"@cocalc/database": "workspace:*",
"@cocalc/util": "workspace:*",
"@types/lodash": "^4.14.202",
"@types/pg": "^8.6.1",
"@types/pg": "^8.11.10",
"@types/uuid": "^8.3.1",
"async": "^1.5.2",
"awaiting": "^3.0.0",
Expand All @@ -32,7 +32,7 @@
"lodash": "^4.17.21",
"lru-cache": "^7.18.3",
"node-fetch": "2.6.7",
"pg": "^8.7.1",
"pg": "^8.13.0",
"random-key": "^0.3.2",
"read": "^1.0.7",
"sql-string-escape": "^1.1.6",
Expand All @@ -48,7 +48,7 @@
"build": "../node_modules/.bin/tsc --build && coffee -c -o dist/ ./",
"clean": "rm -rf dist",
"tsc": "../node_modules/.bin/tsc --watch --pretty --preserveWatchOutput",
"test": "pnpm exec jest",
"test": "TZ=UTC jest --forceExit --runInBand",
"prepublishOnly": "pnpm test"
},
"repository": {
Expand Down
6 changes: 5 additions & 1 deletion src/packages/database/postgres-server-queries.coffee
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ read = require('read')
{site_license_manager_set} = require('./postgres/site-license/manager')
{matching_site_licenses, manager_site_licenses} = require('./postgres/site-license/search')
{project_datastore_set, project_datastore_get, project_datastore_del} = require('./postgres/project-queries')
{permanently_unlink_all_deleted_projects_of_user, unlink_old_deleted_projects} = require('./postgres/delete-projects')
{permanently_unlink_all_deleted_projects_of_user, unlink_old_deleted_projects, cleanup_old_projects_data} = require('./postgres/delete-projects')
{get_all_public_paths, unlist_all_public_paths} = require('./postgres/public-paths')
{get_personal_user} = require('./postgres/personal')
{set_passport_settings, get_passport_settings, get_all_passport_settings, get_all_passport_settings_cached, create_passport, passport_exists, update_account_and_passport, _passport_key} = require('./postgres/passport')
Expand Down Expand Up @@ -2541,6 +2541,10 @@ exports.extend_PostgreSQL = (ext) -> class PostgreSQL extends ext
unlink_old_deleted_projects: () =>
return await unlink_old_deleted_projects(@)

# async function
cleanup_old_projects_data: (max_run_m) =>
return await cleanup_old_projects_data(@, max_run_m)

# async function
unlist_all_public_paths: (account_id, is_owner) =>
return await unlist_all_public_paths(@, account_id, is_owner)
Expand Down
70 changes: 70 additions & 0 deletions src/packages/database/postgres/bulk-delete.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
/*
* This file is part of CoCalc: Copyright © 2024 Sagemath, Inc.
* License: AGPLv3 s.t. "Commons Clause" – see LICENSE.md for details
*/

import getPool, { initEphemeralDatabase } from "@cocalc/database/pool";
import { uuid } from "@cocalc/util/misc";
import { bulkDelete } from "./bulk-delete";

beforeAll(async () => {
await initEphemeralDatabase({});
}, 15000);

afterAll(async () => {
await getPool().end();
});

describe("bulk delete", () => {
test("deleting projects", async () => {
const p = getPool();
const project_id = uuid();
const N = 100000;

// extra entry, which has to remain
const other = uuid();
await p.query(
"INSERT INTO project_log (id, project_id, time) VALUES($1::UUID, $2::UUID, $3::TIMESTAMP)",
[other, uuid(), new Date()],
);

await p.query(
`INSERT INTO project_log (id, project_id, time)
SELECT gen_random_uuid(), $1::UUID, NOW() - interval '1 second' * g.n
FROM generate_series(1, $2) AS g(n)`,
[project_id, N],
);

const num1 = await p.query(
"SELECT COUNT(*)::INT as num FROM project_log WHERE project_id = $1",
[project_id],
);
expect(num1.rows[0].num).toEqual(N);

const res = await bulkDelete({
table: "project_log",
field: "project_id",
value: project_id,
});

// if this ever fails, the "ret.rowCount" value is inaccurate.
// This must be replaced by "RETURNING 1" in the the query and a "SELECT COUNT(*) ..." and so.
// (and not only here, but everywhere in the code base)
expect(res.rowsDeleted).toEqual(N);
expect(res.durationS).toBeGreaterThan(0.1);
expect(res.totalPgTimeS).toBeGreaterThan(0.1);
expect(res.totalWaitS).toBeGreaterThan(0.1);
expect((res.totalPgTimeS * 10) / res.totalWaitS).toBeGreaterThan(0.5);

const num2 = await p.query(
"SELECT COUNT(*)::INT as num FROM project_log WHERE project_id = $1",
[project_id],
);
expect(num2.rows[0].num).toEqual(0);

const otherRes = await p.query("SELECT * FROM project_log WHERE id = $1", [
other,
]);
expect(otherRes.rows[0].id).toEqual(other);
}, 10000);
});
98 changes: 98 additions & 0 deletions src/packages/database/postgres/bulk-delete.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
import { escapeIdentifier } from "pg";

import getLogger from "@cocalc/backend/logger";
import { envToInt } from "@cocalc/backend/misc/env-to-number";
import getPool from "@cocalc/database/pool";
import { SCHEMA } from "@cocalc/util/schema";

const log = getLogger("db:bulk-delete");
const D = log.debug;

type Field =
| "project_id"
| "account_id"
| "target_project_id"
| "source_project_id";

const MAX_UTIL_PCT = envToInt("COCALC_DB_BULK_DELETE_MAX_UTIL_PCT", 10);
// adjust the time limits: by default, we aim to keep the operation between 0.1 and 0.2 secs
const MAX_TIME_TARGET_MS = envToInt(
"COCALC_DB_BULK_DELETE_MAX_TIME_TARGET_MS",
100,
);
const MAX_TARGET_S = MAX_TIME_TARGET_MS / 1000;
const MIN_TARGET_S = MAX_TARGET_S / 2;
const DEFAULT_LIMIT = envToInt("COCALC_DB_BULK_DELETE_DEFAULT_LIMIT", 16);
const MAX_LIMIT = envToInt("COCALC_DB_BULK_DELETE_MAX_LIMIT", 32768);

interface Opts {
table: string; // e.g. project_log, etc.
field: Field; // for now, we only support a few
id?: string; // default "id", the ID field in the table, which identifies each row uniquely
value: string; // a UUID
limit?: number; // default 1024
maxUtilPct?: number; // 0-100, percent
}

type Ret = Promise<{
rowsDeleted: number;
durationS: number;
totalWaitS: number;
totalPgTimeS: number;
}>;

function deleteQuery(table: string, field: string, id: string) {
const T = escapeIdentifier(table);
const F = escapeIdentifier(field);
const ID = escapeIdentifier(id);

return `
DELETE FROM ${T}
WHERE ${ID} IN (
SELECT ${ID} FROM ${T} WHERE ${F} = $1 LIMIT $2
)`;
}

export async function bulkDelete(opts: Opts): Ret {
const { table, field, value, id = "id", maxUtilPct = MAX_UTIL_PCT } = opts;
let { limit = DEFAULT_LIMIT } = opts;
// assert table name is a key in SCHEMA
if (!(table in SCHEMA)) {
throw new Error(`table ${table} does not exist`);
}

if (maxUtilPct < 1 || maxUtilPct > 99) {
throw new Error(`maxUtilPct must be between 1 and 99`);
}

const q = deleteQuery(table, field, id);
const pool = getPool();
const start_ts = Date.now();

let rowsDeleted = 0;
let totalWaitS = 0;
let totalPgTimeS = 0;
while (true) {
const t0 = Date.now();
const ret = await pool.query(q, [value, limit]);
const dt = (Date.now() - t0) / 1000;
rowsDeleted += ret.rowCount ?? 0;
totalPgTimeS += dt;

const next =
dt > MAX_TARGET_S ? limit / 2 : dt < MIN_TARGET_S ? limit * 2 : limit;
limit = Math.max(1, Math.min(MAX_LIMIT, Math.round(next)));

// wait for a bit, but not more than 1 second ~ this aims for a max utilization of 10%
const waitS = Math.min(1, dt * ((100 - maxUtilPct) / maxUtilPct));
await new Promise((done) => setTimeout(done, 1000 * waitS));
totalWaitS += waitS;

D(`deleted ${ret.rowCount} | dt=${dt} | wait=${waitS} | limit=${limit}`);

if (ret.rowCount === 0) break;
}

const durationS = (Date.now() - start_ts) / 1000;
return { durationS, rowsDeleted, totalWaitS, totalPgTimeS };
}
Loading