Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance investigation #7278

Closed
wants to merge 17 commits into from
Closed

Performance investigation #7278

wants to merge 17 commits into from

Conversation

wouterdb
Copy link
Contributor

@wouterdb wouterdb commented Feb 28, 2024

For Reviewers

I changed some things, I would like to have reviewer input on what is acceptable:

  1. more fine grained triggering of agents on put_partial: only notify agents that are in the increment
  2. put_version and put_partial no longer wait for auto deploy to be completed this may break wait conditions in tests everywhere
  3. increment cache is pre sorted per agent (slower on release (done once), faster for every agent (done often))
  4. increment cache now refuses to move back to older versions
  5. micro optimizations to use the DB more efficiently

** Do I need more tests anywhere? **

close #7262

Self Check:

Strike through any lines that are not applicable (~~line~~) then check the box

  • Attached issue to pull request
  • Changelog entry
  • Type annotations are present
  • Code is clear and sufficiently documented
  • No (preventable) type errors (check using make mypy or make mypy-diff)

src/inmanta/config.py:219: error: Argument 1 to "CronTab" has incompatible type "str | int"; expected "str"  [arg-type]
src/inmanta/config.py:311: error: Incompatible default for argument "validator" (default has type "Callable[[str], str]", argument has type "Callable[[str | T], T]")  [assignment]
src/inmanta/data/__init__.py:4982: error: Argument 1 to "loads" has incompatible type "object"; expected "str | bytes | bytearray"  [arg-type]
src/inmanta/data/__init__.py:5441: error: Signature of "get_list" incompatible with supertype "BaseDocument"  [override]
src/inmanta/data/__init__.py:5441: note:      Superclass:
src/inmanta/data/__init__.py:5441: note:          @classmethod
src/inmanta/data/__init__.py:5441: note:          def get_list(cls, *, order_by_column: str | None = ..., order: str | None = ..., limit: int | None = ..., offset: int | None = ..., no_obj: bool | None = ..., lock: RowLockMode | None = ..., connection: Connection | None = ..., **query: object) -> Coroutine[Any, Any, list[ConfigurationModel]]
src/inmanta/data/__init__.py:5441: note:      Subclass:
src/inmanta/data/__init__.py:5441: note:          @classmethod
src/inmanta/data/__init__.py:5441: note:          def get_list(cls, *, order_by_column: str | None = ..., order: str | None = ..., limit: int | None = ..., offset: int | None = ..., no_obj: bool | None = ..., lock: RowLockMode | None = ..., connection: Connection | None = ..., no_status: bool = ..., **query: object) -> Coroutine[Any, Any, list[ConfigurationModel]]
src/inmanta/data/__init__.py:5509: error: Argument 2 to "_get_status_field" of "ConfigurationModel" has incompatible type "object"; expected "str"  [arg-type]
src/inmanta/data/__init__.py:5513: error: Argument 1 to "append" of "list" has incompatible type "ConfigurationModel"; expected "dict[str, object]"  [arg-type]
src/inmanta/data/__init__.py:5514: error: Incompatible return value type (got "list[dict[str, object]]", expected "list[ConfigurationModel]")  [return-value]
src/inmanta/data/__init__.py:5796: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]
src/inmanta/data/__init__.py:5886: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]
src/inmanta/data/__init__.py:5962: error: Incompatible types in assignment (expression has type "object", variable has type "int")  [assignment]
src/inmanta/server/services/orchestrationservice.py:849: error: Argument 1 to "add_background_task" of "TaskHandler" has incompatible type "Coroutine[Any, Any, tuple[int, dict[str, Any] | None]]"; expected "Coroutine[object, None, Result | None]"  [arg-type]
src/inmanta/server/services/compilerservice.py:795: error: Incompatible types in assignment (expression has type "bool | int | float | str | dict[str, str | int | bool]", variable has type "int")  [assignment]
Generated HTML report (via XSLT): /home/wouter/projects/inmanta/mypy/index.html
  • Sufficient test cases (reproduces the bug/tests the requested feature)
  • Correct, in line with design
  • End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )
  • If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see test-fixes for more info)

Preliminary results

(on 15k resources)

  1. we cause a storm of agent pulls (each put_partial makes every agent pull, we recompile fast)
    • make it smarter
  2. increment calculation is both on agent pull path and the release version
    • very performance sensitive
    • it pulled in all attributes, so large config, slow increment
    • it interferes with itself somehow or mucks up the cache (to be investigated)
      • make cache invalidation monotonic (only ever allow newer versions)
    • auto deploy triggering is done in-line: the compile has to wait for it, we could change that
  3. pyinstrument works somewhat as an async profiler, but not to the point where the numbers add up
  4. slow query log, lock timing log and all query log on postgresql help somehwat
  5. we still have some slow queries, basically everything related to the whole version or the part that remains the same
  • UPDATE resource SET status='deployed' WHERE environment=$1 AND model=$2 AND resource_id =ANY($3)
INSERT INTO resource(
                   environment,
                   model,
                   resource_id,
                   resource_type,
                   resource_id_value,
                   agent,
                   status,
                   attributes,
                   attribute_hash,
                   resource_set,
                   provides
               )(
                   SELECT
                       r.environment,
                       $3,
                       r.resource_id,
                       r.resource_type,
                       r.resource_id_value,
                       r.agent,
                       (
                           CASE WHEN r.status='undefined'::resourcestate
                           THEN 'undefined'::resourcestate
                           ELSE 'available'::resourcestate
                           END
                       ) AS status,
                       r.attributes AS attributes,
                       r.attribute_hash,
                       r.resource_set,
                       r.provides
                   FROM resource AS r
                   WHERE r.environment=$1 AND r.model=$2 AND r.resource_set IS NOT NULL AND NOT r.resource_set=ANY($4)
               )
               RETURNING resource_id, resource_set
  SUM(CASE WHEN r.status NOT IN($1,$2) THEN 1 ELSE 0 END) AS done,
	                           to_json(array(SELECT jsonb_build_object('status', r2.status, 'id', r2.resource_id)
	                                         FROM resource AS r2
	                                         WHERE c.environment=r2.environment AND c.version=r2.model
	                                        )
	                           ) AS status
	                    FROM configurationmodel AS c LEFT OUTER JOIN resource AS r
	                    ON c.environment = r.environment AND c.version = r.model
	                    WHERE c.environment=$3 
	                    GROUP BY c.environment, c.version

INSERT INTO public.resourceaction_resource (resource_id, resource_version, environment, resource_action_id) SELECT unnest($1::text[]), unnest($2::int[]), $3, $4

Current time taken over parts of put_partial

2024-02-28 17:41:57,308 performance              WARNING STARTING PUT PARTIAL
2024-02-28 17:41:57,312 performance              WARNING INPUT VALIDATION: 0.0035941700043622404
2024-02-28 17:41:57,441 performance              WARNING LOAD STAGE: 0.1291558850207366
2024-02-28 17:41:57,802 performance              WARNING MERGE STAGE: 0.3613146049901843
2024-02-28 17:41:59,651 performance              WARNING PUT STAGE: 1.849367157992674
2024-02-28 17:42:01,870 performance              WARNING AUTO DEPLOY STAGE: 2.218535807012813

@@ -4796,27 +4826,38 @@ async def get_resources_for_version_raw_with_persistent_state(
version: int,
projection: Optional[list[str]],
projection_presistent: Optional[list[str]],
project_attributes: Optional[list[str]] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so we don't have to pull in all attributes for the resource when not needed, this can be a LOT of data

@@ -838,7 +847,7 @@ async def _trigger_auto_deploy(
agent_trigger_method_on_autodeploy = cast(str, await env.get(data.AGENT_TRIGGER_METHOD_ON_AUTO_DEPLOY))
agent_trigger_method_on_autodeploy = const.AgentTriggerMethod[agent_trigger_method_on_autodeploy]
await self.release_version(
Copy link
Contributor Author

@wouterdb wouterdb Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only push agents that are/might be affected

def on_agent(res: ResourceIdStr) -> bool:
idr = Id.parse_id(res)
return idr.get_agent_name() == agent
return ON_AGENT_REGEX.match(res) is not None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimization because this is called a on every resource for every agent ( about o(n**2) )

only_update_from_states=only_update_from_states,
connection=connection,
)
resources_id = [res_id for res_id in resources_id if filter(res_id)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely inlined Resource Action Update and optimized a LOT

async with inner_connection.transaction():
# validate resources
if only_update_from_states is not None:
resources = await data.Resource.get_resource_ids_with_status(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid having full resources

resource_action = data.ResourceAction(
environment=env.id,
version=version,
resource_version_ids=resources_version_ids,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this remains very heavy on the DB, as this blows up into a join table

"Setting deployed due to known good status",
)

await data.Resource.set_deployed_multi(env.id, resources_id, version, connection=inner_connection)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One query instead of a loop

)

await data.Resource.set_deployed_multi(env.id, resources_id, version, connection=inner_connection)
# Resource persistent state should not be affected
Copy link
Contributor Author

@wouterdb wouterdb Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't update resource persistent state, this is slightly off spec, I may need to update the definition there
  2. I don't send out event notification to agents (because these resource where already deployed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why don't we need to update it? Because we assume they're already in that state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is a copy

def post_deploy_update() -> None:
# Make sure tasks are scheduled AFTER the tx is done.
# This method is only called if the transaction commits successfully.
self.add_background_task(data.ConfigurationModel.mark_done_if_done(env.id, version))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even this may not be required?

@@ -69,6 +72,8 @@
from inmanta.types import Apireturn, JsonType, PrimitiveTypes, ReturnTupple

LOGGER = logging.getLogger(__name__)
PLOGGER = logging.getLogger("performance")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo alert

@wouterdb wouterdb marked this pull request as ready for review February 29, 2024 09:59
tests/utils.py Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
changelogs/unreleased/7262-performance.yml Outdated Show resolved Hide resolved
src/inmanta/data/__init__.py Show resolved Hide resolved
@@ -4796,27 +4830,38 @@ async def get_resources_for_version_raw_with_persistent_state(
version: int,
projection: Optional[list[str]],
projection_presistent: Optional[list[str]],
project_attributes: Optional[list[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
project_attributes: Optional[list[str]] = None,
project_attributes: Optional[Sequence[LiteralString]] = None,

See typing.LiteralString docs. I think we should use this ideally because we bake it into the query itself, which is not safe with any user input. Ideally we'd do the same for the other lists but I'll leave that decision up to you.

Copy link
Contributor Author

@wouterdb wouterdb Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy doesn't support it python/mypy#12554

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, shame

src/inmanta/data/__init__.py Show resolved Hide resolved
src/inmanta/server/config.py Outdated Show resolved Hide resolved
src/inmanta/server/services/resourceservice.py Outdated Show resolved Hide resolved
)

await data.Resource.set_deployed_multi(env.id, resources_id, version, connection=inner_connection)
# Resource persistent state should not be affected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why don't we need to update it? Because we assume they're already in that state?

src/inmanta/server/services/resourceservice.py Outdated Show resolved Hide resolved
(version_cache_entry, incr, neg_incr) = cache_entry
if version_cache_entry != version:
(version_cache_entry, incr, neg_incr, neg_incr_per_agent, run_ahead_lock) = cache_entry
if version_cache_entry > version:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to wait if version_cache_entry == version?

Copy link
Contributor Author

@wouterdb wouterdb Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, should be impossible, but still

increment: Optional[tuple[abc.Set[ResourceIdStr], abc.Set[ResourceIdStr]]] = _get_cache_entry()
increment: Optional[
tuple[int, abc.Set[ResourceIdStr], abc.Set[ResourceIdStr], abc.Mapping[str, abc.Set[ResourceIdStr]]]
] = await _get_cache_entry()
if increment is None or (connection is not None and connection.is_in_transaction()):
lock = self._increment_cache_locks[env.id]
async with lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this lock create a deadlock with the run_ahead_lock? In the release flow the run_ahead lock is acquired first but if a fast agent calls get_increment before the release flow does it acquires this lock first. Then it starts waiting for the run_ahead_lock which will in turn start waiting for this lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the release flow never acquires the run_ahead_lock , it creates it.

  • the release flow is always head, so it passes in the lock for others, but doesn't every block on it
  • the run_ahead_lock is placed just prior to leaving the critical section here, so it will be resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it acquire it in order to populate the associated cache? i.e. here. So it creates the lock, and then eventually populates the cache by calling into the same flow that acquires both this lock and the other one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, I see. The context manager does not place the lock, it's the get_increment call itself that places it. And indeed it is only placed after populating the cache (even if it is created before).

@wouterdb wouterdb requested a review from sanderr March 1, 2024 13:23
@wouterdb wouterdb added the merge-tool-ready This ticket is ready to be merged in label Mar 4, 2024
@inmantaci
Copy link
Contributor

Processing this pull request

inmantaci pushed a commit that referenced this pull request Mar 4, 2024
# For Reviewers

I changed some things, I would like to have reviewer input on what is acceptable:
1. more fine grained triggering of agents on put_partial: only notify agents that are in the increment
2. `put_version` and `put_partial` no longer wait for auto deploy to be completed **this may break wait conditions in tests everywhere**
3. increment cache is pre sorted per agent (slower on release (done once), faster for every agent (done often))
4. increment cache now refuses to move back to older versions
5. micro optimizations to use the DB more efficiently

** Do I need more tests anywhere? **

close #7262

# Self Check:

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] Attached issue to pull request
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [ ] No (preventable) type errors (check using make mypy or make mypy-diff)
```

src/inmanta/config.py:219: error: Argument 1 to "CronTab" has incompatible type "str | int"; expected "str"  [arg-type]
src/inmanta/config.py:311: error: Incompatible default for argument "validator" (default has type "Callable[[str], str]", argument has type "Callable[[str | T], T]")  [assignment]
src/inmanta/data/__init__.py:4982: error: Argument 1 to "loads" has incompatible type "object"; expected "str | bytes | bytearray"  [arg-type]
src/inmanta/data/__init__.py:5441: error: Signature of "get_list" incompatible with supertype "BaseDocument"  [override]
src/inmanta/data/__init__.py:5441: note:      Superclass:
src/inmanta/data/__init__.py:5441: note:          @classmethod
src/inmanta/data/__init__.py:5441: note:          def get_list(cls, *, order_by_column: str | None = ..., order: str | None = ..., limit: int | None = ..., offset: int | None = ..., no_obj: bool | None = ..., lock: RowLockMode | None = ..., connection: Connection | None = ..., **query: object) -> Coroutine[Any, Any, list[ConfigurationModel]]
src/inmanta/data/__init__.py:5441: note:      Subclass:
src/inmanta/data/__init__.py:5441: note:          @classmethod
src/inmanta/data/__init__.py:5441: note:          def get_list(cls, *, order_by_column: str | None = ..., order: str | None = ..., limit: int | None = ..., offset: int | None = ..., no_obj: bool | None = ..., lock: RowLockMode | None = ..., connection: Connection | None = ..., no_status: bool = ..., **query: object) -> Coroutine[Any, Any, list[ConfigurationModel]]
src/inmanta/data/__init__.py:5509: error: Argument 2 to "_get_status_field" of "ConfigurationModel" has incompatible type "object"; expected "str"  [arg-type]
src/inmanta/data/__init__.py:5513: error: Argument 1 to "append" of "list" has incompatible type "ConfigurationModel"; expected "dict[str, object]"  [arg-type]
src/inmanta/data/__init__.py:5514: error: Incompatible return value type (got "list[dict[str, object]]", expected "list[ConfigurationModel]")  [return-value]
src/inmanta/data/__init__.py:5796: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]
src/inmanta/data/__init__.py:5886: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]
src/inmanta/data/__init__.py:5962: error: Incompatible types in assignment (expression has type "object", variable has type "int")  [assignment]
src/inmanta/server/services/orchestrationservice.py:849: error: Argument 1 to "add_background_task" of "TaskHandler" has incompatible type "Coroutine[Any, Any, tuple[int, dict[str, Any] | None]]"; expected "Coroutine[object, None, Result | None]"  [arg-type]
src/inmanta/server/services/compilerservice.py:795: error: Incompatible types in assignment (expression has type "bool | int | float | str | dict[str, str | int | bool]", variable has type "int")  [assignment]
Generated HTML report (via XSLT): /home/wouter/projects/inmanta/mypy/index.html
```
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [ ] End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )
- [ ] If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)

# Preliminary results
(on 15k resources)

1. we cause a storm of agent pulls (each put_partial makes every agent pull, we recompile fast)
   - [x] make it smarter
3. increment calculation is both on agent pull path and the release version
   - very performance sensitive
   - [x] it pulled in all attributes, so large config, slow increment
   - [ ] it interferes with itself somehow or mucks up the cache (to be investigated)
       - [x] make cache invalidation monotonic (only ever allow newer versions)
   - [x] auto deploy triggering is done in-line: the compile has to wait for it, we could change that
4. pyinstrument works somewhat as an async profiler, but not to the point where the numbers add up
6. slow query log, lock timing log and all query log on postgresql help somehwat
7. we still have some slow queries, basically everything related to the whole version or the part that remains the same
 - ` UPDATE resource SET status='deployed' WHERE environment=$1 AND model=$2 AND resource_id =ANY($3) `
 -
 ```
 INSERT INTO resource(
	                environment,
	                model,
	                resource_id,
	                resource_type,
	                resource_id_value,
	                agent,
	                status,
	                attributes,
	                attribute_hash,
	                resource_set,
	                provides
	            )(
	                SELECT
	                    r.environment,
	                    $3,
	                    r.resource_id,
	                    r.resource_type,
	                    r.resource_id_value,
	                    r.agent,
	                    (
	                        CASE WHEN r.status='undefined'::resourcestate
	                        THEN 'undefined'::resourcestate
	                        ELSE 'available'::resourcestate
	                        END
	                    ) AS status,
	                    r.attributes AS attributes,
	                    r.attribute_hash,
	                    r.resource_set,
	                    r.provides
	                FROM resource AS r
	                WHERE r.environment=$1 AND r.model=$2 AND r.resource_set IS NOT NULL AND NOT r.resource_set=ANY($4)
	            )
	            RETURNING resource_id, resource_set
```
```
  SUM(CASE WHEN r.status NOT IN($1,$2) THEN 1 ELSE 0 END) AS done,
	                           to_json(array(SELECT jsonb_build_object('status', r2.status, 'id', r2.resource_id)
	                                         FROM resource AS r2
	                                         WHERE c.environment=r2.environment AND c.version=r2.model
	                                        )
	                           ) AS status
	                    FROM configurationmodel AS c LEFT OUTER JOIN resource AS r
	                    ON c.environment = r.environment AND c.version = r.model
	                    WHERE c.environment=$3
	                    GROUP BY c.environment, c.version

```
```
INSERT INTO public.resourceaction_resource (resource_id, resource_version, environment, resource_action_id) SELECT unnest($1::text[]), unnest($2::int[]), $3, $4
```

Current time taken over parts of put_partial

```
2024-02-28 17:41:57,308 performance              WARNING STARTING PUT PARTIAL
2024-02-28 17:41:57,312 performance              WARNING INPUT VALIDATION: 0.0035941700043622404
2024-02-28 17:41:57,441 performance              WARNING LOAD STAGE: 0.1291558850207366
2024-02-28 17:41:57,802 performance              WARNING MERGE STAGE: 0.3613146049901843
2024-02-28 17:41:59,651 performance              WARNING PUT STAGE: 1.849367157992674
2024-02-28 17:42:01,870 performance              WARNING AUTO DEPLOY STAGE: 2.218535807012813
```
@inmantaci
Copy link
Contributor

Merged into branches master in 3c05ea2

@inmantaci inmantaci closed this Mar 4, 2024
inmantaci pushed a commit that referenced this pull request Mar 4, 2024
# For Reviewers

I changed some things, I would like to have reviewer input on what is acceptable:
1. more fine grained triggering of agents on put_partial: only notify agents that are in the increment
2. `put_version` and `put_partial` no longer wait for auto deploy to be completed **this may break wait conditions in tests everywhere**
3. increment cache is pre sorted per agent (slower on release (done once), faster for every agent (done often))
4. increment cache now refuses to move back to older versions
5. micro optimizations to use the DB more efficiently

** Do I need more tests anywhere? **

close #7262

# Self Check:

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] Attached issue to pull request
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [ ] No (preventable) type errors (check using make mypy or make mypy-diff)
```

src/inmanta/config.py:219: error: Argument 1 to "CronTab" has incompatible type "str | int"; expected "str"  [arg-type]
src/inmanta/config.py:311: error: Incompatible default for argument "validator" (default has type "Callable[[str], str]", argument has type "Callable[[str | T], T]")  [assignment]
src/inmanta/data/__init__.py:4982: error: Argument 1 to "loads" has incompatible type "object"; expected "str | bytes | bytearray"  [arg-type]
src/inmanta/data/__init__.py:5441: error: Signature of "get_list" incompatible with supertype "BaseDocument"  [override]
src/inmanta/data/__init__.py:5441: note:      Superclass:
src/inmanta/data/__init__.py:5441: note:          @classmethod
src/inmanta/data/__init__.py:5441: note:          def get_list(cls, *, order_by_column: str | None = ..., order: str | None = ..., limit: int | None = ..., offset: int | None = ..., no_obj: bool | None = ..., lock: RowLockMode | None = ..., connection: Connection | None = ..., **query: object) -> Coroutine[Any, Any, list[ConfigurationModel]]
src/inmanta/data/__init__.py:5441: note:      Subclass:
src/inmanta/data/__init__.py:5441: note:          @classmethod
src/inmanta/data/__init__.py:5441: note:          def get_list(cls, *, order_by_column: str | None = ..., order: str | None = ..., limit: int | None = ..., offset: int | None = ..., no_obj: bool | None = ..., lock: RowLockMode | None = ..., connection: Connection | None = ..., no_status: bool = ..., **query: object) -> Coroutine[Any, Any, list[ConfigurationModel]]
src/inmanta/data/__init__.py:5509: error: Argument 2 to "_get_status_field" of "ConfigurationModel" has incompatible type "object"; expected "str"  [arg-type]
src/inmanta/data/__init__.py:5513: error: Argument 1 to "append" of "list" has incompatible type "ConfigurationModel"; expected "dict[str, object]"  [arg-type]
src/inmanta/data/__init__.py:5514: error: Incompatible return value type (got "list[dict[str, object]]", expected "list[ConfigurationModel]")  [return-value]
src/inmanta/data/__init__.py:5796: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]
src/inmanta/data/__init__.py:5886: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]
src/inmanta/data/__init__.py:5962: error: Incompatible types in assignment (expression has type "object", variable has type "int")  [assignment]
src/inmanta/server/services/orchestrationservice.py:849: error: Argument 1 to "add_background_task" of "TaskHandler" has incompatible type "Coroutine[Any, Any, tuple[int, dict[str, Any] | None]]"; expected "Coroutine[object, None, Result | None]"  [arg-type]
src/inmanta/server/services/compilerservice.py:795: error: Incompatible types in assignment (expression has type "bool | int | float | str | dict[str, str | int | bool]", variable has type "int")  [assignment]
Generated HTML report (via XSLT): /home/wouter/projects/inmanta/mypy/index.html
```
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [ ] End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )
- [ ] If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)

# Preliminary results
(on 15k resources)

1. we cause a storm of agent pulls (each put_partial makes every agent pull, we recompile fast)
   - [x] make it smarter
3. increment calculation is both on agent pull path and the release version
   - very performance sensitive
   - [x] it pulled in all attributes, so large config, slow increment
   - [ ] it interferes with itself somehow or mucks up the cache (to be investigated)
       - [x] make cache invalidation monotonic (only ever allow newer versions)
   - [x] auto deploy triggering is done in-line: the compile has to wait for it, we could change that
4. pyinstrument works somewhat as an async profiler, but not to the point where the numbers add up
6. slow query log, lock timing log and all query log on postgresql help somehwat
7. we still have some slow queries, basically everything related to the whole version or the part that remains the same
 - ` UPDATE resource SET status='deployed' WHERE environment=$1 AND model=$2 AND resource_id =ANY($3) `
 -
 ```
 INSERT INTO resource(
	                environment,
	                model,
	                resource_id,
	                resource_type,
	                resource_id_value,
	                agent,
	                status,
	                attributes,
	                attribute_hash,
	                resource_set,
	                provides
	            )(
	                SELECT
	                    r.environment,
	                    $3,
	                    r.resource_id,
	                    r.resource_type,
	                    r.resource_id_value,
	                    r.agent,
	                    (
	                        CASE WHEN r.status='undefined'::resourcestate
	                        THEN 'undefined'::resourcestate
	                        ELSE 'available'::resourcestate
	                        END
	                    ) AS status,
	                    r.attributes AS attributes,
	                    r.attribute_hash,
	                    r.resource_set,
	                    r.provides
	                FROM resource AS r
	                WHERE r.environment=$1 AND r.model=$2 AND r.resource_set IS NOT NULL AND NOT r.resource_set=ANY($4)
	            )
	            RETURNING resource_id, resource_set
```
```
  SUM(CASE WHEN r.status NOT IN($1,$2) THEN 1 ELSE 0 END) AS done,
	                           to_json(array(SELECT jsonb_build_object('status', r2.status, 'id', r2.resource_id)
	                                         FROM resource AS r2
	                                         WHERE c.environment=r2.environment AND c.version=r2.model
	                                        )
	                           ) AS status
	                    FROM configurationmodel AS c LEFT OUTER JOIN resource AS r
	                    ON c.environment = r.environment AND c.version = r.model
	                    WHERE c.environment=$3
	                    GROUP BY c.environment, c.version

```
```
INSERT INTO public.resourceaction_resource (resource_id, resource_version, environment, resource_action_id) SELECT unnest($1::text[]), unnest($2::int[]), $3, $4
```

Current time taken over parts of put_partial

```
2024-02-28 17:41:57,308 performance              WARNING STARTING PUT PARTIAL
2024-02-28 17:41:57,312 performance              WARNING INPUT VALIDATION: 0.0035941700043622404
2024-02-28 17:41:57,441 performance              WARNING LOAD STAGE: 0.1291558850207366
2024-02-28 17:41:57,802 performance              WARNING MERGE STAGE: 0.3613146049901843
2024-02-28 17:41:59,651 performance              WARNING PUT STAGE: 1.849367157992674
2024-02-28 17:42:01,870 performance              WARNING AUTO DEPLOY STAGE: 2.218535807012813
```
@inmantaci inmantaci deleted the issue/optimize_put_partial branch March 4, 2024 07:45
@inmantaci
Copy link
Contributor

Processing #7297.

inmantaci pushed a commit that referenced this pull request Mar 4, 2024
Pull request opened by the merge tool on behalf of #7278
@sanderr sanderr mentioned this pull request May 10, 2024
9 tasks
inmantaci pushed a commit that referenced this pull request May 13, 2024
…od (Issue #7612, PR #7612)

# Description

Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered.

~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~

~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window.

EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo.

This PR makes the following changes:
- An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process.
- The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh.
- When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process.
- Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up.
- Made some log messages a bit more informative, no major changes.

# Self Check:

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] ~Attached issue to pull request~
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [x] No (preventable) type errors (check using make mypy or make mypy-diff)
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~
- [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~
inmantaci pushed a commit that referenced this pull request May 13, 2024
…od (Issue #7612, PR #7612)

# Description

Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered.

~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~

~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window.

EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo.

This PR makes the following changes:
- An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process.
- The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh.
- When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process.
- Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up.
- Made some log messages a bit more informative, no major changes.

# Self Check:

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] ~Attached issue to pull request~
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [x] No (preventable) type errors (check using make mypy or make mypy-diff)
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~
- [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~
sanderr added a commit that referenced this pull request May 15, 2024
…od (Issue #7612, PR #7612)

Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered.

~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~

~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window.

EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo.

This PR makes the following changes:
- An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process.
- The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh.
- When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process.
- Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up.
- Made some log messages a bit more informative, no major changes.

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] ~Attached issue to pull request~
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [x] No (preventable) type errors (check using make mypy or make mypy-diff)
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~
- [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~
arnaudsjs pushed a commit that referenced this pull request May 23, 2024
…od (Issue #7612, PR #7612)

Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered.

~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~

~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window.

EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo.

This PR makes the following changes:
- An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process.
- The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh.
- When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process.
- Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up.
- Made some log messages a bit more informative, no major changes.

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] ~Attached issue to pull request~
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [x] No (preventable) type errors (check using make mypy or make mypy-diff)
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~
- [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~
arnaudsjs pushed a commit that referenced this pull request May 23, 2024
…od (Issue #7612, PR #7612)

Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered.

~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~

~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window.

EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo.

This PR makes the following changes:
- An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process.
- The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh.
- When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process.
- Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up.
- Made some log messages a bit more informative, no major changes.

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] ~Attached issue to pull request~
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [x] No (preventable) type errors (check using make mypy or make mypy-diff)
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~
- [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-tool-ready This ticket is ready to be merged in
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate performance of very large deployment
4 participants