Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fqdnCacheMinTTL configuration parameter #6808

Merged
merged 18 commits into from
Dec 11, 2024

Conversation

hkiiita
Copy link
Member

@hkiiita hkiiita commented Nov 13, 2024

This PR introduces a fqdnCacheMinTTL setting which would help address the problem of applications caching DNS response IPs indefinitely. Cluster administrators should be able to configure this value, ideally setting it to be equal to or greater than the maximum TTL value of the application's DNS cache.

This feature is a work towards resolving the issue of indefinite caching of DNS response IPs by certain applications.

pkg/config/agent/config.go Outdated Show resolved Hide resolved
build/yamls/antrea.yml Show resolved Hide resolved
Comment on lines 608 to 609
// If minTTL is greater than 1, it indicates that the value has been set externally by a cluster admin, and should be respected.
if o.config.MinTTL > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 1 and not 0?

Comment on lines 309 to 311
# The minTTL setting helps address the problem of applications caching DNS response IPs indefinitely.
# The Cluster administrators should configure this value, ideally setting it to be equal to or greater than the maximum TTL
# value of the application's DNS cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove "indefinitely". If the application does indeed cache the DNS entry forever, there is not much we can do.
We should also mention that this is for FQDN policy enforcement.
So maybe something like this:

The minTTL setting helps address the problem of applications caching DNS response IPs beyond the TTL value for the DNS record.
It is used to enforce FQDN policy rules, ensuring that resolved IPs are included in datapath rules for as long as the application is caching them.
This value should ideally be set to the maximum caching duration across all applications.

@@ -90,6 +90,7 @@ type Options struct {
nplEndPort int
dnsServerOverride string
nodeType config.NodeType
minTTL uint32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field doesn't seem necessary?

Comment on lines 641 to 647
getMaxTTL := func(ttl1, ttl2 uint32) uint32 {
if ttl1 > ttl2 {
return ttl1
} else {
return ttl2
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a max built in in Golang: https://pkg.go.dev/builtin#max
It was introduced in Go 1.21

currentTime := f.clock.Now()
for _, ans := range msg.Answer {
switch r := ans.(type) {
case *dns.A:
if f.ipv4Enabled {
responseIPs[r.A.String()] = ipWithExpiration{
ip: r.A,
expirationTime: currentTime.Add(time.Duration(r.Header().Ttl) * time.Second),
expirationTime: currentTime.Add(time.Duration(getMaxTTL(f.minTTL, r.Header().Ttl)) * time.Second),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like technically, minTTL should only apply to DNS responses which were not initiated by Antrea, but intercepted by Antrea (responses to DNS queries generated by the application). However, this code applies to responses to DNS queries sent by Antrea (when an override DNS server is configured). Maybe we need to introduce a flag to distinguish between the 2 cases?

cc @tnqn @Dyanngg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually spent the length of last syncup meeting to discuss this, because I had the same suggestion but @tnqn’s opinion was that we don’t have to differentiate between the two cases. For security purpose we advise users to use FQDN rules only in allowlists for Antrea-native policies, and he thinks it’s okay that the clients’ TTL for a FQDN goes “out of sync” with the antrea agent since that’s not the gaurentee we want: we only want to enforce that client cannot access unintended addresses. So having an address for a domain which has longer TTL in antrea cache compared to the client is ok. I’ll let Quan chime in to see if I’m summarizing this correctly, but the end result of the discussion was we told Hemant to not worry about differentiating Antrea and client initiated dns queries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For security purpose we advise users to use FQDN rules only in allowlists for Antrea-native policies

Got it, makes sense to me

@hkiiita please ignore this comment

- Remove the redundant minTTL field from Options struct.
- Use in built max function for comparison of max TTL values.
- Improve descriptive comment about minTTL in config file.

Signed-off-by: Hemant <[email protected]>
# The minTTL setting helps address the problem of applications caching DNS response IPs beyond the TTL value for the DNS record.
# It is used to enforce FQDN policy rules, ensuring that resolved IPs are included in datapath rules for as long as the application is caching them.
# This value should ideally be set to the maximum caching duration across all applications.
minTTL: {{ .Values.minTTL }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is a bit generic, maybe we should go with fqdnCacheMinTTL?

Comment on lines 608 to 609
// Ensure that the minTTL is not negative.
o.config.MinTTL = max(o.config.MinTTL, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to return an error if o.config.MinTTL < 0 IMO

while this code is in validateK8sNodeOptions, I am not sure this configuration parameter is specific to the "K8sNode" case. Maybe the check should be in the parent function? cc @tnqn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be used in ExternalNode too but guess no one validates it. I'm fine with keeping it in the parent function for now. We might create a validate function for networkpolicy related configurations later.

@@ -160,7 +161,7 @@ type fqdnController struct {
clock clock.Clock
}

func newFQDNController(client openflow.Client, allocator *idAllocator, dnsServerOverride string, dirtyRuleHandler func(string), v4Enabled, v6Enabled bool, gwPort uint32, clock clock.WithTicker) (*fqdnController, error) {
func newFQDNController(client openflow.Client, allocator *idAllocator, dnsServerOverride string, dirtyRuleHandler func(string), v4Enabled, v6Enabled bool, gwPort uint32, clock clock.WithTicker, minTTL int) (*fqdnController, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if minTTL should be uint32 here. This way we know it is a positive number. The conversion from int to uint32 can happen in cmd/antrea-agent.

# It is used to enforce FQDN policy rules, ensuring that resolved IPs are included in datapath rules for as long as the application is caching them.
# This value should ideally be set to the maximum caching duration across all applications.
minTTL: {{ .Values.minTTL }}
fqdnCacheMinTTL: {{ .Values.minTTL }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fqdnCacheMinTTL: {{ .Values.minTTL }}
fqdnCacheMinTTL: {{ .Values.fqdnCacheMinTTL }}

This is why the manifests are not generated correctly (fqdnCacheMinTTL: instead of fqdnCacheMinTTL: 0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry , my bad. Will correct that.

Comment on lines 163 to 171
switch o.config.NodeType {
case config.ExternalNode.String():
o.nodeType = config.ExternalNode
return o.validateExternalNodeOptions()
} else if o.config.NodeType == config.K8sNode.String() {
case config.K8sNode.String():
o.nodeType = config.K8sNode
return o.validateK8sNodeOptions()
} else {
default:
return fmt.Errorf("unsupported nodeType %s", o.config.NodeType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a bad change, but I would avoid doing it in this PR as it is unrelated

if o.config.NodeType == config.ExternalNode.String() {
// validate FqdnCacheMinTTL
if o.config.FqdnCacheMinTTL < 0 {
return fmt.Errorf("fqdnCacheMinTTL set to an invalid value, its must be a positive integer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("fqdnCacheMinTTL set to an invalid value, its must be a positive integer")
return fmt.Errorf("fqdnCacheMinTTL must be greater than or equal to 0")

@@ -158,7 +158,7 @@ type AgentConfig struct {
// The minTTL setting helps address the problem of applications caching DNS response IPs indefinitely.
// The Cluster administrators should configure this value, ideally setting it to be equal to or greater than the maximum TTL
// value of the application's DNS cache.
MinTTL int `yaml:"minTTL,omitempty"`
FqdnCacheMinTTL int `yaml:"fqdnCacheMinTTL,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the field name here should be FQDNCacheMinTTL per our conventions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment not addressed

Copy link
Contributor

@Dyanngg Dyanngg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add unit test cases to verify that with fqdnController with minTTL set will set correct dnsEntry expiration times in cache

@@ -306,6 +306,11 @@ kubeAPIServerOverride: {{ .Values.kubeAPIServerOverride | quote }}
# 10.96.0.10:53, [fd00:10:96::a]:53).
dnsServerOverride: {{ .Values.dnsServerOverride | quote }}

# The fqdnCacheMinTTL setting helps address the problem of applications caching DNS response IPs beyond the TTL value for the DNS record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/The fqdnCacheMinTTL setting helps address/fqdnCacheMinTTL helps address

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkiiita not addressed correctly, the current sentence is not grammatically correct

@@ -744,3 +747,140 @@ func TestOnDNSResponse(t *testing.T) {
})
}
}
func TestFQDNCacheMinTTL(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like in theory we only need to test parseDNSResponse, because this is the only place where minTTL is actually used. We just need to make sure that minTTL can override the TTL included in the DNS response. cc @Dyanngg

However, I don't feel super strongly about it, so if others think it is better to test onDNSResponseMsg "end-to-end", it is fine by me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am awaiting a comment on this one to implement the nit change asked above in previous comment , moreover, i had also pushed the commit below refactor to test just the parseDNSResponse considering the feedback.

name: "Response TTL less than FQDNCacheTTL",
expectedTTL: currentTime.Add(10 * time.Second),
fqdnCacheMinTTL: 10,
dnsMsg: &dns.Msg{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of repetition: we use the same DNS message every time as far as I can tell (with the only change being the ttl value).
One option is to use a closure such as this one:

getDNSMsg := func(ttl in) *dns.Msg {
        return &dns.Msg{ ... }
}

fakeClock := newFakeClock(currentTime)
controller := gomock.NewController(t)
f, _ := newMockFQDNController(t, controller, nil, fakeClock, tc.fqdnCacheMinTTL)
require.Zero(t, fakeClock.TimersAdded())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this useful in the context of this test? Please add a short comment if you think it is necessary.

f, _ := newMockFQDNController(t, controller, nil, fakeClock, tc.fqdnCacheMinTTL)
require.Zero(t, fakeClock.TimersAdded())
_, responseIPs, err := f.parseDNSResponse(tc.dnsMsg)
assert.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one should be require.NoError

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of nits, otherwise LGTM

Edit: the e2e test is not correct, it doesn't actually set "minTTL"

please resolve conflicts as well

@@ -186,6 +186,10 @@ kubeAPIServerOverride: ""
# -- Address of DNS server, to override the kube-dns Service. It's used to
# resolve hostnames in a FQDN policy.
dnsServerOverride: ""
# -- The minTTL setting helps address the problem of applications caching DNS response IPs indefinitely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment out of date

ttl uint32
}{
{
name: "Response TTL less than FQDNCacheTTL",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename all occurrences of FQDNCacheTTL to fqdnCacheMinTTL

}{
{
name: "Response TTL less than FQDNCacheTTL",
expectedTTL: currentTime.Add(10 * time.Second),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to improve readability of the test case definitions, it would be nice if expectedTTL followed the same "format" as fqdnCacheMinTTL and ttl

I would suggest using a plain integer for expectedTTL (like for the other 2 fields), and making the necessary conversion in the test implementation (t.Run)

name string
expectedTTL time.Time
fqdnCacheMinTTL uint32
ttl uint32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename ttl to responseTTL

- Added mocking of fqdnCacheMinTTl set and unset in the e2e test.
- Refactored fqdn test by enclosing the DNS message construction in a closure and
- setting the TTL value in DNS message in t.Run()
- updated outdated comment in values.yaml for the fqdnCacheMinTTL.

Signed-off-by: Hemant <[email protected]>
@hkiiita hkiiita marked this pull request as ready for review November 27, 2024 15:47
@@ -5383,8 +5388,8 @@ spec:
# Using "kubectl logs/exec/attach/cp" doesn't have to specify "-c antrea-agent" when troubleshooting.
kubectl.kubernetes.io/default-container: antrea-agent
# Automatically restart Pods with a RollingUpdate if the ConfigMap changes
# See https://helm.sh/docs/howto/charts_tips_and_tricks/#automatically-roll-deployments
checksum/config: ee43fc8c8c5ac8097a757da0545e43b25f98d8f2d831842f7ee76e1ed7581267
# See https://helm.sh/docs/howto/charts_tips_and_tricks/#automatically-roll-deploymentstl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an intentional change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed

dnsMsg := getDNSMsg(tc.responseTTL)
_, responseIPs, err := f.parseDNSResponse(dnsMsg)
require.NoError(t, err)
expectedTTL := currentTime.Add(tc.expectedTTL * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/expectedTTL/expectedExpirationTime for this variable name

Comment on lines 5282 to 5283
testWithFQDNCacheMinTTL(t, 0)
testWithFQDNCacheMinTTL(t, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make them independent subtests with their own name. For example:

t.Run("FQDNCacheMinTTL unset", func(t *testing.T) { testWithFQDNCacheMinTTL(t, 0) })
t.Run("FQDNCacheMinTTL set to 10s", func(t *testing.T) { testWithFQDNCacheMinTTL(t, 10) })

Comment on lines 5298 to 5302
if fqdnCacheMinTTL == 0 {
t.Logf("Running the test with FQDNCacheMinTTL unset")
} else {
t.Logf("Running the test with FQDNCacheMinTTL set to %d ", fqdnCacheMinTTL)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, each case should be a subtest, with its own name (see comment above)
After making that change, you can remove these logs

@@ -5342,6 +5359,7 @@ func TestFQDNCacheMinTTL(t *testing.T) {
}
return stdout, nil
}
setFQDNCacheMinTTLInAntrea(t, data, fqdnCacheMinTTL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better if you only mutate the Antrea config once. So you need to find a way to "merge" this with setDNSServerAddressInAntrea

additionally, the config should be restored at the end of the test

Comment on lines 5393 to 5396
assert.EventuallyWithT(t, func(t *assert.CollectT) {
_, err := curlFQDN(fqdnIP)
assert.Error(t, err)
}, 10*time.Second, 1*time.Second)
}, 20*time.Second, 1*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assertion should be different based on whether fqdnCacheMinTTL is set or not? otherwise there is something wrong with the test.

// The wait time here should be slightly longer than the reload value specified in the custom DNS configuration.
// TODO: This assertion currently verifies the issue described in https://github.com/antrea-io/antrea/issues/6229.
// It will need to be updated once minTTL support is implemented.
// TODO: This assertion verifies the fix to the issue described in https://github.com/antrea-io/antrea/issues/6229.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer a TODO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed

@@ -155,6 +155,11 @@ func (o *Options) validate(args []string) error {
return fmt.Errorf("nodeType %s requires feature gate ExternalNode to be enabled", o.config.NodeType)
}

// validate FqdnCacheMinTTL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to have the comment as the code is obvious and the comment doesn't provide more information

@@ -196,7 +196,7 @@ func NewNetworkPolicyController(antreaClientGetter client.AntreaClientProvider,
gwPort, tunPort uint32,
nodeConfig *config.NodeConfig,
podNetworkWait *utilwait.Group,
l7Reconciler *l7engine.Reconciler) (*Controller, error) {
l7Reconciler *l7engine.Reconciler, fqdnCacheMinTTL uint32) (*Controller, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a separate line for the new argument like other args.

@@ -158,7 +158,7 @@ type AgentConfig struct {
// The minTTL setting helps address the problem of applications caching DNS response IPs indefinitely.
// The Cluster administrators should configure this value, ideally setting it to be equal to or greater than the maximum TTL
// value of the application's DNS cache.
MinTTL int `yaml:"minTTL,omitempty"`
FqdnCacheMinTTL int `yaml:"fqdnCacheMinTTL,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment not addressed

Comment on lines 5278 to 5279
// applications might cache them, thereby preventing intermittent network connectivity issues to the FQDN concerned.
// Actual test logic runs in testWithFQDNCacheMinTTL, which gets called by TestFQDNCacheMinTTL with 2 fqdnCacheMinTTL values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are obviously longer than other lines.

And remove the extra space between "by TestFQDNCacheMinTTL"

@@ -5383,8 +5388,8 @@ spec:
# Using "kubectl logs/exec/attach/cp" doesn't have to specify "-c antrea-agent" when troubleshooting.
kubectl.kubernetes.io/default-container: antrea-agent
# Automatically restart Pods with a RollingUpdate if the ConfigMap changes
# See https://helm.sh/docs/howto/charts_tips_and_tricks/#automatically-roll-deployments
checksum/config: ee43fc8c8c5ac8097a757da0545e43b25f98d8f2d831842f7ee76e1ed7581267
# See https://helm.sh/docs/howto/charts_tips_and_tricks/#automatically-roll-deploymentstl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed

Comment on lines 310 to 314
# fqdnCacheMinTTL helps address the issue of applications caching DNS response IPs beyond the TTL value for the DNS record.
# It is used to enforce FQDN policy rules, ensuring that resolved IPs are included in datapath rules for as long as the application caches them.
# Ideally, this value should be set to the maximum caching duration across all applications.
fqdnCacheMinTTL: {{ .Values.fqdnCacheMinTTL }}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to wrap the lines here as well

Comment on lines 190 to 192
# fqdnCacheMinTTL helps address the issue of applications caching DNS response IPs beyond the TTL value for the DNS record.
# It is used to enforce FQDN policy rules, ensuring that resolved IPs are included in datapath rules for as long as the application caches them.
# Ideally, this value should be set to the maximum caching duration across all applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 5283 to 5284
t.Run("FQDNCacheMinTTL-unset", func(t *testing.T) { testWithFQDNCacheMinTTL(t, 0) })
t.Run("FQDNCacheMinTTL-set-to-10s", func(t *testing.T) { testWithFQDNCacheMinTTL(t, 10) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you adding dashes to the sub-test names?

Note that the go testing framework takes care of escaping whitespaces automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas Actually , i was getting the following error , hence i had done the said change .

fixtures.go:288: Creating 'testfqdncacheminttl-fqdncacheminttl_unset-c2njzgdw' K8s Namespace
    fixtures.go:355: Exporting test logs to '/tmp/antrea-test-2503355158/TestFQDNCacheMinTTL_FQDNCacheMinTTL_unset/afterSetupTest.Dec04-00-40-40'
    
    antreapolicy_test.go:5301: Error when setting up test: error when creating 'testfqdncacheminttl-fqdncacheminttl_unset-c2njzgdw'
     Namespace: Namespace "testfqdncacheminttl-fqdncacheminttl_unset-c2njzgdw" is invalid: metadata.name: Invalid value: "testfqdncacheminttl-fqdncacheminttl_unset-c2njzgdw": 
     a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')

After you pointed this out, Initially i thought maybe i should look at the test framework code to check if we are doing this replacement of whitespace with underscores but i found its not being done there as the setupTest function gets the name from *testing.T type, hence its being modified by go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved as per guidance in slack.

// The wait time here should be slightly longer than the reload value specified in the custom DNS configuration.
// TODO: This assertion currently verifies the issue described in https://github.com/antrea-io/antrea/issues/6229.
// It will need to be updated once minTTL support is implemented.
// TODO: This assertion verifies the fix to the issue described in https://github.com/antrea-io/antrea/issues/6229.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed

@@ -5322,8 +5334,8 @@ func TestFQDNCacheMinTTL(t *testing.T) {
createCustomDNSPod(t, data, configMap.Name)

// set the custom DNS server IP address in Antrea ConfigMap.
setDNSServerAddressInAntrea(t, data, dnsServiceIP)
defer setDNSServerAddressInAntrea(t, data, "") //reset after the test.
setDNSServerAddressInAntrea(t, data, dnsServiceIP, fqdnCacheMinTTL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename setDNSServerAddressInAntrea to configureFQDNPolicyEnforcement, which is more generic and more suitable since you are not configuring 2 different things

Comment on lines 5396 to 5399
assert.EventuallyWithT(t, func(t *assert.CollectT) {
_, err := curlFQDN(fqdnIP)
assert.NoError(t, err)
}, time.Duration(fqdnCacheMinTTL)*time.Second, 1*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like this should be assert.Never. I don't understand how assert.Eventually is valid.

- Fixed final e2e assertion by using assert.Never
- Fixed various formatting issues as pointe di review
- Renamed the function which modifies DNS ConfigMap

Signed-off-by: Hemant <[email protected]>
assert.Never(t, func() bool {
_, err := curlFQDN(fqdnIP)
return err != nil
}, time.Duration(fqdnCacheMinTTL)*time.Second, 1*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the test is likely to be "flaky" in practice. The assertion will verify that the condition is not met for fqdnCacheMinTTL seconds. However, by the time we start running the assertion, Antrea may have cached the IP for a few seconds already? So assuming that fqdnCacheMinTTL is 10s (as is the case currently), could the assertion actually fail after 7 or 8s?

Unless I am missing something, we should probably have some logic like this:

// get timestamp before the Pod resolves the FQDN for the first time
startCacheTime := time.Now()

// this is the first time we make the Pod curl the FQDN
assert.EventuallyWithT(t, func(t *assert.CollectT) {
        _, err := curlFQDN(testFQDN)
        assert.NoError(t, err)
}, 2*time.Second, 100*time.Millisecond, "failed to curl test FQDN: ", testFQDN)

// rest of the test code

// when we get to the "Never" assertion, we need to deduct some time from fqdnCacheMinTTL
// the extra 1s is for "safety"
waitFor := time.Duration(fqdnCacheMinTTL)*time.Second - time.Since(startCacheTime) - 1*time.Second
assert.Never(t, func() bool {
        _, err := curlFQDN(fqdnIP)
        return err != nil
}, waitFor, 1*time.Second)

It may make sense to increase the minTTL value from 10s to 20s for the test IMO. I am not exactly sure how much time elapses between startCacheTime and this assertion being executed.

@@ -5368,26 +5380,34 @@ func TestFQDNCacheMinTTL(t *testing.T) {
require.NoError(t, data.setPodAnnotation(data.testNamespace, "custom-dns-server", "test.antrea.io/random-value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not introduced in this PR, but I just noticed the following assertion above:

	assert.EventuallyWithT(t, func(t *assert.CollectT) {
		_, err := curlFQDN(testFQDN)
		assert.NoError(t, err)
	}, 2*time.Second, 1*time.Millisecond, "failed to curl test FQDN: ", testFQDN)

1ms is much too frequent to be executing the condition function. Please update to 100ms.

@@ -5368,26 +5380,34 @@ func TestFQDNCacheMinTTL(t *testing.T) {
require.NoError(t, data.setPodAnnotation(data.testNamespace, "custom-dns-server", "test.antrea.io/random-value",
randSeq(8)), "failed to update custom DNS Pod annotation.")

// finally verify that Curling the previously cached IP fails after DNS update.
// finally verify that Curling the previously cached IP does not fail after DNS update.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// finally verify that Curling the previously cached IP does not fail after DNS update.
// finally verify that Curling the previously cached IP does not fail after DNS update, as long as fqdnCacheMinTTL is set.

- added a waitFor time equivalent to the difference of time elapsed since last DNS request.
- updated manifests using `make manifests` for build fail.
- updated helm chart docs.

Signed-off-by: Hemant <[email protected]>
Comment on lines 5390 to 5400
// Calculate `waitFor` to determine the duration to wait for the 'Never' assertion.
// This accounts for the elapsed time since the initial DNS request was made from the Pod
// and the start of the FQDN cache's minimum TTL (fqdnCacheMinTTL). The duration is reduced
// by 1 second as a buffer acting as a safety margin.
safetyMargin := 1 * time.Second
waitFor := (time.Duration(fqdnCacheMinTTL)*time.Second - time.Since(startCacheTime)) - safetyMargin

if waitFor <= 0 {
t.Logf("Invalid waitFor computed: %v. Clamping to 1 second.", waitFor)
waitFor = 1 * time.Second
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved to the else block of the code below. We only need that for the case where fqdnCacheMinTTL is greater than 0.
I would also remove the sanity checking code for waitFor and replace it with require.GreaterOrEqual(t, waitFor, 5*time.Second). I think the test is only meaningful if waitFor is large enough. If this requirement fails, it either means that there is an issue with the test or that fqdnCacheMinTTL should be set to a larger value.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only a handful of comments to address.
@tnqn @Dyanngg do you want to take another look?

func TestFQDNCacheMinTTL(t *testing.T) {
t.Run("minTTLUnset", func(t *testing.T) { testWithFQDNCacheMinTTL(t, 0) })
t.Run("minTTL10s", func(t *testing.T) { testWithFQDNCacheMinTTL(t, 10) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Run("minTTL10s", func(t *testing.T) { testWithFQDNCacheMinTTL(t, 10) })
t.Run("minTTL20s", func(t *testing.T) { testWithFQDNCacheMinTTL(t, 20) })

In #6808 (comment), this is what I was actually recommending

assert.EventuallyWithT(t, func(t *assert.CollectT) {
_, err := curlFQDN(fqdnIP)
assert.Error(t, err)
}, 20*time.Second, 1*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can keep the timeout as 10s here. I think there was some confusion with my request in #6808 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, got it.

waitFor := (time.Duration(fqdnCacheMinTTL)*time.Second - time.Since(startCacheTime)) - safetyMargin
require.GreaterOrEqual(t, waitFor, 5*time.Second)

// fqdnCacheMinTTL is set hence we expect no error at least till the period equivalent to fqdnCacheMinTTL's value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// fqdnCacheMinTTL is set hence we expect no error at least till the period equivalent to fqdnCacheMinTTL's value.
// fqdnCacheMinTTL is set hence we expect no error at least until fqdnCacheMinTTL expires.

@@ -350,7 +353,7 @@ func TestDeleteFQDNRule(t *testing.T) {
func TestLookupIPFallback(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkiiita this test is failing for your PR. The test itself can probably be improved, but right now it is failing because you are enabling IPv6 unconditionally in newMockFQDNController, and this test cannot handle IPv6.
My recommendation for this PR would be to revert the change in newMockFQDNController so that IPv6 stays disabled by default. Then in your new test function (TestParseDNSResponseOnFQDNCacheMinTTL), which requires IPv6, you manually enabled it like this:

f, _ := newMockFQDNController(t, controller, nil, fakeClock, tc.fqdnCacheMinTTL)
f.ipv6Enabled = true // this test needs IPv6 enabled

I think this is good enough for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, did the changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test seems to be failing , am not sure but it seems to be failing at

gobgp_test.go:108: Getting peers of BGP server1 and verifying them while verifying peers for BGP Server 1 , i checked the test and though i cant find it but i am wondering if its related to above change in any way ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore the integration test failure here. It is unrelated to your change. The test is known to be flaky and we already have an issue to track it: #6501

Copy link
Contributor

@Dyanngg Dyanngg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall once Antonin's comments are addressed

@@ -155,6 +155,10 @@ func (o *Options) validate(args []string) error {
return fmt.Errorf("nodeType %s requires feature gate ExternalNode to be enabled", o.config.NodeType)
}

if o.config.FQDNCacheMinTTL < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: not necessarily need to be part of this PR, but I just want to note @salv-orlando's comment in the community meeting: we should consider setting a sensible max value for this field, so that Antrea will not just cache each FQDN ip forever

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking on it (i am inexperienced in a real world scenario for this though) and thought that IPs mapped to FQDN's wont probably update/change at a rate that it might make us ending up with too many datapath rules for each IP ? and maybe can this be left to the cluster admin to determine and set it ? What if the admin wishes to place a large value , for his own use case, which surpasses the 'max' that we set ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about a max value for FQDNCacheMinTTL, but we should have a upper bound on the size of the cache (which would also be a upper bound on the number of datapath rules). This is independent of this feature IMO, because I could also have a DNS server returning records with a very high TTL, and it seems that the issue would be the same.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all the work

@antoninbas
Copy link
Contributor

/test-all

@antoninbas antoninbas changed the title Implementation for minTTL Add fqdnCacheMinTTL configuration parameter Dec 11, 2024
@antoninbas antoninbas added the area/network-policy Issues or PRs related to network policies. label Dec 11, 2024
@antoninbas antoninbas merged commit e404ccc into antrea-io:main Dec 11, 2024
58 of 60 checks passed
@antoninbas antoninbas added the action/release-note Indicates a PR that should be included in release notes. label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. area/network-policy Issues or PRs related to network policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants