Autosharding tests: Sharding & Replication #159

donhardman · 2023-09-26T13:08:11Z

Confirm the correct number of shard creations.
Validate that the distributed table is created correctly.
Check the random distribution of shards across nodes with an average weight per node.
Ensure accurate data replication according to the replication factor.
Assert sharding and replication within a single node.
Validate on different sets of nodes: 2, 3, 5.

donhardman · 2023-10-03T08:50:15Z

Here's what I've done during this task:

I have increased the waiting timeout for table creation from 15 seconds to 300 seconds. This change was necessary because when we create a large number of shards, it takes more time.
I have implemented validations on the 2, 3, and 5 cluster sets, as well as on a single node. These validations ensure that the data is accurate and reliable.
I have validated that the distributed table is created correctly and contains all the required shards on each node. This ensures that the data is properly distributed across the system.
I have also validated that when we require 6 shards and 2 replicas, we see exactly 6 shards and 2 replicas. To further ensure accuracy, I have added tests for both 10/1, 6/2, and 60/3 scenarios.
Lastly, I have checked the count of shards on each node and validated that the weight is balanced. This means that the distribution of data is fair and we do not have any unbalanced distribution.

During the test on 5 nodes discovered the issue with galera, preparing the reproducible case for it.

Tue Oct  3 07:41:18.002 2023] [695] WARNING: unknown table, or wrong type of table 't_s2', removed from cluster 'bbbb0f03c7eee87847f09a9b562933af'
[Tue Oct  3 07:41:18.002 2023] [695] WARNING: unknown table, or wrong type of table 't_s8', removed from cluster 'bbbb0f03c7eee87847f09a9b562933af'
[Tue Oct  3 07:41:23.833 2023] [711] WARNING: no nodes coming from prim view, prim not possible
[Tue Oct  3 07:41:28.833 2023] [711] WARNING: no nodes coming from prim view, prim not possible
[Tue Oct  3 07:41:28.842 2023] [753] WARNING: Quorum: No node with complete state:
[Tue Oct  3 07:41:28.842 2023] [753] WARNING: Member 4.0 (node_127.0.0.1_c_713) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)
	 at /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/gcomm/src/pc.cpp:connect():159
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out)
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'bbbb0f03c7eee87847f09a9b562933af' at 'gcomm://127.0.0.1:19328,127.0.0.1:39326': -110 (Connection timed out)
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: gcs connect failed: Connection timed out
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: 'bbbb0f03c7eee87847f09a9b562933af' cluster start error: replication connection failed: 7 'error in node state, must reinit', nodes '127.0.0.1:19328,127.0.0.1:39326'

And the first error show:

[Tue Oct  3 07:41:12.459 2023] [665] FATAL: Failed to apply trx: source: 38fb4e00-61c0-11ee-be82-1619b2fcf5a4 version: 4 local: 0 state: COMMITTING flags: 65 conn_id: 204 trx_id: -1 seqnos (l: 5, g: 3, s: 2, d: 2, ts: 26783447202105136)
[Tue Oct  3 07:41:12.459 2023] [665] FATAL: Commit failed. Trx: 0x7f8bfc003010 (FATAL)
         at /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_smm.cpp:apply_trx():516
[Tue Oct  3 07:41:12.459 2023] [665] FATAL: Node consistency compromised, aborting...
[Tue Oct  3 07:41:17.460 2023] [665] WARNING: abort from cluster 'bbbb0f03c7eee87847f09a9b562933af'
[Tue Oct  3 07:41:17.462 2023] [471] watchdog: main process 472 killed dirtily with signal 6, will be restarted
[Tue Oct  3 07:41:17.462 2023] [471] watchdog: main process 694 forked ok

Current tests are pushed to this pull: manticoresoftware/manticoresearch#1478

donhardman · 2023-10-05T08:04:19Z

All tests pushed to the branch tests/sharding, also added tests for cluster replication without using sharding.

Created the following issue: https://github.com/orgs/manticoresoftware/projects/3/views/2?pane=issue&itemId=40675680

All tests are done for this task

donhardman self-assigned this Sep 26, 2023

sanikolaev assigned PavelShilin89 and unassigned donhardman Sep 27, 2023

sanikolaev added the est::TO_ESTIMATE label Sep 27, 2023

sanikolaev assigned donhardman and unassigned PavelShilin89 Sep 29, 2023

donhardman mentioned this issue Oct 2, 2023

Autosharding (Buddy's part) #150

Closed

donhardman added est::size_M and removed est::TO_ESTIMATE labels Oct 3, 2023

donhardman closed this as completed Oct 5, 2023

sanikolaev mentioned this issue Nov 17, 2023

Autosharding tests: inputs & syntax validation #158

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autosharding tests: Sharding & Replication #159

Autosharding tests: Sharding & Replication #159

donhardman commented Sep 26, 2023 •

edited

Loading

donhardman commented Oct 3, 2023

donhardman commented Oct 5, 2023

Autosharding tests: Sharding & Replication #159

Autosharding tests: Sharding & Replication #159

Comments

donhardman commented Sep 26, 2023 • edited Loading

donhardman commented Oct 3, 2023

donhardman commented Oct 5, 2023

donhardman commented Sep 26, 2023 •

edited

Loading