Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage when using coraza plugin #9

Open
mwantia opened this issue May 30, 2024 · 14 comments
Open

High memory usage when using coraza plugin #9

mwantia opened this issue May 30, 2024 · 14 comments

Comments

@mwantia
Copy link

mwantia commented May 30, 2024

I am currently trying to implement the coraza plugin into traefik, which sits behind a cloudflare tunnel for external access.

As soon as I activate the middleware for the services traefik starts using a lot of memory.
I increased the allowed memory usage of traefik to 4 GB, which were immediately consumed after navigating two times.
After the third, Traefik fails with an OOM exception and restarts.

I can't imagine that these kind of high memory usages are expected.
There also seems to be an ongoing discussion about the same topic here, where Traefik even seems to consume about 32 GB of memory.

I removed most other configuration, since they shouldn't be relevant but this is the config I have the Traefik running with:

# /secrets/traefik.yaml
experimental:
  plugins:
    coraza:
      moduleName: github.com/jcchavezs/coraza-http-wasm-traefik
      version: v0.2.1
providers:
  file:
    directory: /local/config
# /local/config/waf.yaml
http:
  middlewares:
    waf:
      plugin:
        coraza:
          directives:
            - SecRuleEngine On
            - SecDebugLog /dev/stdout
#           - Include @owasp_crs/**.conf
#           - Include /directives/*.conf

I intentionally removed the other two include-directives, but even with such a barebone setting I receive an OOM after a handful of requests.

@mwantia
Copy link
Author

mwantia commented May 31, 2024

Some updates:

I tinkered around with my setup, mostly adjusting the LogLevel, AccessLog and directives, but it seems to run smoothly between 2,5 to 2,8 GB right now.
Currently unsure why it behaves like this, so I will have to experiment with my settings again later to see if there are any noticeable changes.

I also adjusted the LogLevel to debug, since all other options on coraza don't seem to change anything and noticed that the following output gets repeated nearly every few requests.

2024-05-31T08:26:19Z DBG github.com/traefik/traefik/v3/pkg/logs/wasm.go:31 > Initializing WAF with directives:
SecRuleEngine On
SecDebugLog /dev/stdout
SecDebugLogLevel 2
Include @crs-setup.conf.example
SecRule REQUEST_URI "@streq /xyz" "id:101,phase:1,log,deny,status:403"

Isn't this part of the main function and only used during initialization?
I'm not that knowledgable in go programming and even less when it comes to Traefik plugins but I would assume to only see this log at the start but not during every request.

Additionally, this is the configuration Traefik is currently running with.
I will try to see if there are any noticable changes or spices in usage during the weekend.

experimental:
  plugins:
    traefik-real-ip:
      modulename: github.com/soulbalz/traefik-real-ip
      version: v1.0.3
    geoblock:
      moduleName: github.com/PascalMinder/GeoBlock
      version: v0.2.2
    coraza:
      moduleName: github.com/jcchavezs/coraza-http-wasm-traefik
      version: v0.2.1

entrypoints:
  websecure:
    address: ':443'
    forwardedHeaders:
      insecure: true
    http:
      tls: true
      middlewares:
        - 'realip@file'
        - 'geoblock-de@file'
        - 'waf@file'

global:
  sendAnonymousUsage: false
  checkNewVersion: false

api:
  dashboard: true
  insecure: true

metrics:
  prometheus:
    addRoutersLabels: true
    addServicesLabels: true

ping: {}
log:
  level: DEBUG
accessLog: {}

providers:
  file:
    directory: /local/config
  consulcatalog:
    endpoint:
      address: 'consul.service.consul:8501'
      scheme: https
      token: '${CONSUL_TOKEN}'
      tls:
        insecureSkipVerify: false
    connectAware: true
    connectByDefault: true
    exposedByDefault: false
    defaultRule: 'Host(`{{ .Name }}.${DOMAIN}`)'
    constraints: 'TagRegex(`cloudflare.enable=true`)'
http:
 middlewares:
   waf:
     plugin:
       coraza:
         directives:
           - SecRuleEngine On
           - SecDebugLog /dev/stdout
           - SecDebugLogLevel 9
           - Include @crs-setup.conf.example
           - SecRule REQUEST_URI "@streq /xyz" "id:101,phase:1,log,deny,status:403" # Testing
#          - Include @owasp_crs/**.conf
           - Include /directives/*.conf

@attrib
Copy link

attrib commented May 31, 2024

Can confirm this issue.

With

    waf:
      plugin:
        coraza:
          directives:
            - SecRuleEngine On
            - SecDebugLog /dev/stdout
            - SecDebugLogLevel 9
            - SecRule REQUEST_URI "@streq /wp-admin" "id:101,phase:1,log,deny,status:403"   

traefik process uses a bit under 2GB, without it 100MB.

@elkinaguas
Copy link

elkinaguas commented Jun 4, 2024

Hello, I noticed a similar behavior using Traefik in binary mode and the Coraza middleware. Here is my experience in case it can be useful to someone.

The System's RAM usage before starting Traefik with Coraza is 2,4G and after starting it 2,7G.

First test:
I set a Python server behind Traefik and using another Python script I sent 100 requests (which will reach the Python server) to the Traefik entrypoint, with a 100ms sleep time between requests. After this test the RAM increased to 4.8G. What is interesting in my opinion is that the RAM doesn't seem to go back down to 2.7G. I waited 10 minutes without sending any traffic and the RAM only came down to 3.9G. I ran another test with 200 requests instead of 100 and this didn't seem to affect the RAM usage, it went up to 4.8G again.

Second test:
I changed my Python script to send 100 requests to 5 different URLs (1 URL that reaches the Python server and 4 URLs that are filtered out by the Coraza middleware) one after the other, which will make a total of 500 requests, with a 100ms sleep time between requests. After running the script three times this is what I got:

First time ------------ RAM: 6.0G
Second time ------- RAM: 7.3G
Third time ----------- RAM: 8.3G

After waiting 10 minutes without sending traffic the RAM came down to 6.0G.

I ran the same tests without the Coraza middleware and the RAM didn't even budged, it stayed at 2.4G before starting Traefik, after starting Traefik, and during the traffic tests.

Here is my config:

waf:
  plugin:
    coraza:
      directives:
        - SecRuleEngine On
        - SecDebugLog /dev/stdout
        - SecDebugLogLevel 9
        - Include @crs-setup.conf.example
        - Include @owasp_crs/**.conf

@jcchavezs
Copy link
Owner

Hi everyone, thanks for coming by this repository.

The problem seems to be very similar to what we experienced in corazawaf/coraza-proxy-wasm#249. Although they are different code, what they have in common is the GC and that could be the issue.

One way to slice and dice this issue is to discriminate requests with/without payload and second, in directives set SecRequestBodyAccess Off before Include @owasp_crs/**.conf because the main hunch here is that the space we allocate for request bodies are the source of problem.

@jcchavezs
Copy link
Owner

jcchavezs commented Jun 4, 2024

In the mean time I released https://github.com/jcchavezs/coraza-http-wasm-traefik/releases/tag/v0.2.2 which attempts to introduce minor improvements in performance. Would be amazing if any of you could test it.

@markuskirch
Copy link

markuskirch commented Jun 17, 2024

We're currently facing the same issue while testing the coraza Traefik plugin.

Problem Description

The Traefik Coraza Plugin leads to very high memory usage on our servers.

The memory used by the Traefik container grows with the container lifetime until the server is out of memory (16GB), and docker restarts the container. This currently happens roughly every hour.
image

Configuration

Traefik v3.0
coraza-http-wasm-traefik v0.2.2

We run the following directives:

- SecRuleEngine On
- SecDebugLog /dev/stdout
- SecDebugLogLevel 3

- SecRequestBodyAccess On
- SecResponseBodyAccess Off

# set default error handling
- SecDefaultAction "phase:1,log,auditlog,deny,status:403"
- SecDefaultAction "phase:2,log,auditlog,deny,status:403"

# whitelist a trusted server used for end-to-end testing
- SecRule REMOTE_ADDR "@ipMatch 100.100.100.100" "id:1237,phase:1,allow"

# block access to specific paths
- SecRule REQUEST_URI "@rx \/web\/database\/.*" "id:1239,phase:1,log,deny,status:403,msg:'Access Denied'"

# Limit the size of the request body
- SecRequestBodyLimit 5242880 #5M
- SecRequestBodyNoFilesLimit 1048576 #1M
- SecRequestBodyInMemoryLimit 1048576 #1M

# Block SQL injection and XSS attacks
- SecRule ARGS "@detectSQLi" "id:1234,phase:2,log,deny,status:403,msg:'SQL Injection Detected'"
- SecRule ARGS "@detectXSS" "id:1235,phase:2,log,deny,status:403,msg:'XSS Attack Detected'"

# Block upload of files with dangerous extensions
- SecRule FILES_TMPNAMES "@rx \.(exe|bat|cmd|sh|php|pl|py)$" "id:1236,phase:2,log,deny,status:403,msg:'File Type Denied'"

Troubleshooting

We tried the following measures:

  • Disabled request body access SecRequestBodyAccess Off
  • Removed all phase 2 rules (SecDefaultAction "phase:2[...]", SecRule ARGS "@detectSQLi", SecRule ARGS "@detectXSS", SecRule FILES_TMPNAMES)
  • Reduced body size limit
  • Reduced logel level to 2

None of the measures above lead to Traefik leveling off at below 16GB of memory usage, albeit disabling request body access and all phase 2 rules made the container gain memory less quickly (1 hour between container restarts in comparison to about 30mins with request body access)

/proc/meminfo indicates that lots of memory is reserved but inactive.
We're wondering if there's a connection between the max body size and the reserved memory.

Any thoughts on the issue?

Thanks for your dedication to the project!

@david-garcia-garcia
Copy link

david-garcia-garcia commented Jul 2, 2024

Same issue here, memory usage grows even with minimal usage until pod is killed.

v0.2.2 has the same issue, tested.

@Mike-the-one
Copy link

Plan to use this plugin... is the memory leak still an issue?

@markuskirch
Copy link

Yes, unfortunately the problem hasn't been found or fixed.

There is an interesting idea for a workaround here: traefik/yaegi#1590 (comment)

@Mike-the-one
Copy link

@jcchavezs
Copy link
Owner

This PR is up http-wasm/http-wasm-host-go#86 and hopefully it will help in here.

@sourabh-agrawal
Copy link

This PR is up http-wasm/http-wasm-host-go#86 and hopefully it will help in here.

Hey @jcchavezs how can I get the fix running?
Do you have a timeline for releasing the new version (0.2.3) of the coraza waf plugin with this fix?

@ravenolf
Copy link

ravenolf commented Nov 7, 2024

We tested the newest version v0.3.0 on a small infrastructure. Typically, Traefik uses less than 100 MB of RAM in this setup. Once we enabled the plugin and configured the CRS and OWASP rules, it seemed to exhibit the same behaviour with significantly higher memory usage, going up to 2.5 GB with only a few HTTP requests reaching the reverse proxy. I assume this means the memory issue still persists?

Adding some details for reference:

  • Traefik version: v3.2.0
  • Plugin Version: v0.3.0
  • Deployment Type: Kubernetes
  • Configuration:
  plugin:
    coraza:
      crsEnabled: true
      directives:
        - Include @coraza.conf-recommended
        - Include @crs-setup.conf.example
        - Include @owasp_crs/*.conf
        - SecRuleEngine On

I'm not very experienced with Go or memory profiling, so this is the extent of what I can do.
But I wouldn't mind testing it again once there are more fixes!

@lva-itscope
Copy link

I am not quite sure if it is related, but when I tested v3.0.0 with traefik v3.2.0 the coraza plugin significantly increased response time and CPU usage.
When a normal request comes in without the plugin, it takes about 10ms (95percentile).
When the plugin is involved requests take about 1000ms (95percentile).
For the CPU usage without the plugin we observe around 10% , however with the plugin involved during the processing of requests ist spikes to aroun 70%. (I tested on a virtual machine with 12 Cores.)

Memory is also increasing which is why I think it might all be related in some way.
Interestingly enough, when requesting the same url several times in a row, response time decreases (is there some sort of caching) and after trying again a few minutes later it is back to 1000ms.

Details:

  • Traefik Version: v3.2.0
  • Plugin Version: v0.3.0"
  • Deployment: Docker
  • Configuration:
      plugin:
        coraza-waf:
          directives:
          # - SecDebugLog /dev/stdout
          # - SecDebugLogLevel 9
          - SecRule REQUEST_URI "@streq /admin" "id:101,phase:1,log,deny,status:403" 
          #  Allow some additional HTTP methods:
          # - SecAction "id:900200,phase:1,pass,t:none,nolog,setvar:'tx.allowed_methods=GET HEAD POST OPTIONS PUT PATCH DELETE CHECKOUT COPY LOCK MERGE MKACTIVITY MKCOL MOVE PROPFIND PROPPATCH UNLOCK REPORT'"
          # Allow some additional request content-types:
          - SecAction "id:900220,phase:1,pass,t:none,nolog,setvar:'tx.allowed_request_content_type=|application/x-www-form-urlencoded| |multipart/form-data| |multipart/related| |text/xml| |application/xml| |application/soap+xml| |application/json| |application/cloudevents+json| |application/cloudevents-batch+json| |text/plain| |application/proto|'"
          - SecRequestBodyAccess Off #Fix according to https://github.com/jcchavezs/coraza-http-wasm-traefik/issues/9#issuecomment-2146919384
          - Include @coraza.conf-recommended
          - Include @crs-setup.conf.example
          - Include @owasp_crs/**.conf
          - SecRuleEngine On

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants