Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autosharding tests: Sharding & Replication #159

Closed
6 tasks done
donhardman opened this issue Sep 26, 2023 · 2 comments
Closed
6 tasks done

Autosharding tests: Sharding & Replication #159

donhardman opened this issue Sep 26, 2023 · 2 comments
Assignees

Comments

@donhardman
Copy link
Collaborator

donhardman commented Sep 26, 2023

  • Confirm the correct number of shard creations.
  • Validate that the distributed table is created correctly.
  • Check the random distribution of shards across nodes with an average weight per node.
  • Ensure accurate data replication according to the replication factor.
  • Assert sharding and replication within a single node.
  • Validate on different sets of nodes: 2, 3, 5.
@donhardman
Copy link
Collaborator Author

Here's what I've done during this task:

  • I have increased the waiting timeout for table creation from 15 seconds to 300 seconds. This change was necessary because when we create a large number of shards, it takes more time.
  • I have implemented validations on the 2, 3, and 5 cluster sets, as well as on a single node. These validations ensure that the data is accurate and reliable.
  • I have validated that the distributed table is created correctly and contains all the required shards on each node. This ensures that the data is properly distributed across the system.
  • I have also validated that when we require 6 shards and 2 replicas, we see exactly 6 shards and 2 replicas. To further ensure accuracy, I have added tests for both 10/1, 6/2, and 60/3 scenarios.
  • Lastly, I have checked the count of shards on each node and validated that the weight is balanced. This means that the distribution of data is fair and we do not have any unbalanced distribution.

During the test on 5 nodes discovered the issue with galera, preparing the reproducible case for it.

Tue Oct  3 07:41:18.002 2023] [695] WARNING: unknown table, or wrong type of table 't_s2', removed from cluster 'bbbb0f03c7eee87847f09a9b562933af'
[Tue Oct  3 07:41:18.002 2023] [695] WARNING: unknown table, or wrong type of table 't_s8', removed from cluster 'bbbb0f03c7eee87847f09a9b562933af'
[Tue Oct  3 07:41:23.833 2023] [711] WARNING: no nodes coming from prim view, prim not possible
[Tue Oct  3 07:41:28.833 2023] [711] WARNING: no nodes coming from prim view, prim not possible
[Tue Oct  3 07:41:28.842 2023] [753] WARNING: Quorum: No node with complete state:
[Tue Oct  3 07:41:28.842 2023] [753] WARNING: Member 4.0 (node_127.0.0.1_c_713) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)
	 at /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/gcomm/src/pc.cpp:connect():159
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out)
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'bbbb0f03c7eee87847f09a9b562933af' at 'gcomm://127.0.0.1:19328,127.0.0.1:39326': -110 (Connection timed out)
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: gcs connect failed: Connection timed out
[Tue Oct  3 07:41:48.524 2023] [695] FATAL: 'bbbb0f03c7eee87847f09a9b562933af' cluster start error: replication connection failed: 7 'error in node state, must reinit', nodes '127.0.0.1:19328,127.0.0.1:39326'

And the first error show:

[Tue Oct  3 07:41:12.459 2023] [665] FATAL: Failed to apply trx: source: 38fb4e00-61c0-11ee-be82-1619b2fcf5a4 version: 4 local: 0 state: COMMITTING flags: 65 conn_id: 204 trx_id: -1 seqnos (l: 5, g: 3, s: 2, d: 2, ts: 26783447202105136)
[Tue Oct  3 07:41:12.459 2023] [665] FATAL: Commit failed. Trx: 0x7f8bfc003010 (FATAL)
         at /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_smm.cpp:apply_trx():516
[Tue Oct  3 07:41:12.459 2023] [665] FATAL: Node consistency compromised, aborting...
[Tue Oct  3 07:41:17.460 2023] [665] WARNING: abort from cluster 'bbbb0f03c7eee87847f09a9b562933af'
[Tue Oct  3 07:41:17.462 2023] [471] watchdog: main process 472 killed dirtily with signal 6, will be restarted
[Tue Oct  3 07:41:17.462 2023] [471] watchdog: main process 694 forked ok

Current tests are pushed to this pull: manticoresoftware/manticoresearch#1478

@donhardman
Copy link
Collaborator Author

All tests pushed to the branch tests/sharding, also added tests for cluster replication without using sharding.

Created the following issue: https://github.com/orgs/manticoresoftware/projects/3/views/2?pane=issue&itemId=40675680

All tests are done for this task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants