Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: support to query whether pd has loaded region #8749

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Oct 28, 2024

What problem does this PR solve?

Issue Number: Close #8748, close #8426

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test

Release note

None.

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Oct 28, 2024
Copy link
Contributor

ti-chi-bot bot commented Oct 28, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lhy1024, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 28, 2024
Copy link

codecov bot commented Oct 28, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 6 lines in your changes missing coverage. Please review.

Project coverage is 69.63%. Comparing base (0402e15) to head (df4ac32).
Report is 30 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8749      +/-   ##
==========================================
- Coverage   70.10%   69.63%   -0.48%     
==========================================
  Files         517      516       -1     
  Lines       79983    79706     -277     
==========================================
- Hits        56076    55500     -576     
- Misses      20339    20694     +355     
+ Partials     3568     3512      -56     
Flag Coverage Δ
unittests 69.63% <77.77%> (-0.48%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Comment on lines 81 to 82
regionLoadedFromDefault bool
regionLoadedFromStorage bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to distinguish these two kinds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid problems after ps.useRegionStorage switch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using string?

regionLoadedFrom: "", "etcd", "leveldb"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, we just have these two ways now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer just using a single field to store this info like:

type regionSource int

const (
	unloaded regionSource = iota
	etcd
	leveldb
)

Signed-off-by: lhy1024 <[email protected]>
@@ -39,11 +40,13 @@ func newStatusHandler(svr *server.Server, rd *render.Render) *statusHandler {
// @Success 200 {object} versioninfo.Status
// @Router /status [get]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it is suitable. ref #8748 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about /member or /health?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/members return all member info according etcd.ListEtcdMembers
/health is to query the status of each node through the /ping interface to return information about all the nodes, if we don't add a new api, we won't be able to get the result of whether the load region is complete or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status is more like some fixed information to me, at least for the current definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put it under /status, which is appropriate from the name, but currently status returns compiled information.

Yes, as I said in the issue.

Version string `json:"version"`
GitHash string `json:"git_hash"`
StartTimestamp int64 `json:"start_timestamp"`
AreRegionsLoaded bool `json:"are_regions_loaded"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to use loaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loaded bool json:"loaded"

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Signed-off-by: lhy1024 <[email protected]>
Signed-off-by: lhy1024 <[email protected]>
@@ -31,6 +31,7 @@ type Status struct {
Version string `json:"version"`
GitHash string `json:"git_hash"`
StartTimestamp int64 `json:"start_timestamp"`
Loaded bool `json:"loaded"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer RegionLoaded

Comment on lines 81 to 82
regionLoadedFromDefault bool
regionLoadedFromStorage bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using string?

regionLoadedFrom: "", "etcd", "leveldb"

Comment on lines 140 to 154
if atomic.LoadInt32(&ps.useRegionStorage) == 0 {
return ps.Storage.LoadRegions(ctx, f)
err := ps.Storage.LoadRegions(ctx, f)
if err == nil {
ps.regionLoadedFromDefault = true
}
return err
}

ps.mu.Lock()
defer ps.mu.Unlock()
if !ps.regionLoaded {
if !ps.regionLoadedFromStorage {
if err := ps.regionStorage.LoadRegions(ctx, f); err != nil {
return err
}
ps.regionLoaded = true
ps.regionLoadedFromStorage = true
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can use *coreStorage.LoadRegions() directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't check ps.regionLoadedFromStorage

Signed-off-by: lhy1024 <[email protected]>
Signed-off-by: lhy1024 <[email protected]>
useRegionStorage int32
regionLoaded bool
mu syncutil.Mutex
useRegionStorage int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that safe to switch between using or not using RegionStore in the master branch or the latest releases now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultUseRegionStorage is true and only check switch when it was just elected leader.

Signed-off-by: lhy1024 <[email protected]>
useRegionStorage int32
regionLoaded bool
mu syncutil.Mutex
useRegionStorageFlag int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use bool?

Signed-off-by: lhy1024 <[email protected]>
@lhy1024
Copy link
Contributor Author

lhy1024 commented Nov 18, 2024

/test pull-integration-realcluster-test

@@ -44,6 +45,7 @@ func (h *statusHandler) GetPDStatus(w http.ResponseWriter, _ *http.Request) {
GitHash: versioninfo.PDGitHash,
Version: versioninfo.PDReleaseVersion,
StartTimestamp: h.svr.StartTimestamp(),
RegionLoaded: storage.AreRegionsLoaded(h.svr.GetStorage()),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a more general field here like "Ready" which we can extend in the future if we find more conditions under which PD can't serve?

Does PDStatus API exposed by all components of disaggregated PD?

Any chance we introduce a new API /ready which we can use across all TiDB components in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tema, thanks for your advice
Q1: A general field will be helpful to expand different scenarios, but we found that currently there is only RegionLoaded is needed. If there are other situations in the future, we tend to add new fields with more accurate meanings.
Q2: Standalone PD or Scheduling Service will expose this API(especially the RegionLoaded field), others such as TSO Service don't.
Q3: IMO, it's good we can use the same interface to query service status, but the fields may be different. The benefits of unification may not be obvious, but it is indeed a better way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally you want to decouple data plane (pd) from control plane (tidb-operator, tiUp), so that if you make changes to one, you don't have to upgrade other and it can start leveraging any improvements right away. That is why usually the common convention is to use /ready across all components. This way the control-plane can have a common logic for rolling restart of all components, without need to necessarily overcustomize each of them. This is not just about PD microservices but many other components of TiDB cluster (tidb, dm, cdc, tikv). Some of them still require additional customization, but it would be nice to keep it to the minimum. Did you talk to tidb-operator team? Don't they want to have a /ready across all components. I've herd there is a work on tidb-operator v2 or something like that. This might be a good opportunity to standartize protocol between control plane and data place as once it is out there, it is hard to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Tema, for TiDB Operator, it's better to use a general API (like /ready) to check the status in most cases. If this API can't meet some other special cases in the future, then we can consider to add/use some other APIs.

For this RegionLoaded case, in fact, TiDB Operator just wants to know when it can restart the next instance safely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve availability during pd rolling restart Alleviate the latency impact when reloading PD
7 participants