Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Compactor in AWS CN EKS with S3 - crashes with InvalidToken: The provided token is malformed or otherwise invalid #15379

Open
gn-hiro-v opened this issue Dec 12, 2024 · 0 comments

Comments

@gn-hiro-v
Copy link

gn-hiro-v commented Dec 12, 2024

Describe the bug

  • I was deploying loki in both AWS international regions and AWS China. The international AWS region works fine.
  • However, in AWS China, only compactor pod crashes and here is the error
│ level=info ts=2024-12-12T04:21:14.588906833Z caller=main.go:126 msg="Starting Loki" version="(version=release-3.3.x-60f2af3, branch=release-3.3.x, revision=60f2af32)"                                                                                                                                │
│ level=info ts=2024-12-12T04:21:14.588954984Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml                                                                                                                                                                │
│ level=info ts=2024-12-12T04:21:14.59177017Z caller=server.go:351 msg="server listening on addresses" http=:3100 grpc=:9095                                                                                                                                                                            │
│ level=info ts=2024-12-12T04:21:14.595201715Z caller=memberlist_client.go:439 msg="Using memberlist cluster label and node name" cluster_label= node=loki-compactor-0-d7b3df81                                                                                                                         │
│ level=info ts=2024-12-12T04:21:14.597680315Z caller=memberlist_client.go:549 msg="memberlist fast-join starting" nodes_found=1 to_join=4                                                                                                                                                              │
│ level=info ts=2024-12-12T04:21:14.61287991Z caller=memberlist_client.go:569 msg="memberlist fast-join finished" joined_nodes=6 elapsed_time=15.202955ms                                                                                                                                               │
│ level=info ts=2024-12-12T04:21:14.612927801Z caller=memberlist_client.go:581 phase=startup msg="joining memberlist cluster" join_members=loki-memberlist                                                                                                                                              │
│ level=info ts=2024-12-12T04:21:14.628418061Z caller=memberlist_client.go:588 phase=startup msg="joining memberlist cluster succeeded" reached_nodes=6 elapsed_time=15.48115ms                                                                                                                         │
│ init compactor: failed to init delete store: failed to get s3 object: InvalidToken: The provided token is malformed or otherwise invalid.                                                                                                                                                             │
│     status code: 400, request id: <>, host id: <>
│
│ error initialising module: compactor                                                                                                                                                                                                                                                                  │
│ github.com/grafana/dskit/modules.(*Manager).initModule                                                                                                                                                                                                                                                │
│     /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138                                                                                                                                                                                                                                  │
│ github.com/grafana/dskit/modules.(*Manager).InitModuleServices                                                                                                                                                                                                                                        │
│     /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108                                                                                                                                                                                                                                  │
│ github.com/grafana/loki/v3/pkg/loki.(*Loki).Run                                                                                                                                                                                                                                                       │
│     /src/loki/pkg/loki/loki.go:492                                                                                                                                                                                                                                                                    │
│ main.main                                                                                                                                                                                                                                                                                             │
│     /src/loki/cmd/loki/main.go:129                                                                                                                                                                                                                                                                    │
│ runtime.main                                                                                                                                                                                                                                                                                          │
│     /usr/local/go/src/runtime/proc.go:272                                                                                                                                                                                                                                                             │
│ runtime.goexit                                                                                                                                                                                                                                                                                        │
│     /usr/local/go/src/runtime/asm_amd64.s:1700                                                                                                                                                                                                                                                        │
│ level=error ts=2024-12-12T04:21:19.868145407Z caller=log.go:216 msg="error running loki" err="init compactor: failed to init delete store: failed to get s3 object: InvalidToken: The provided token is malformed or otherwise invalid
  • Other pods look fine
    Image

To Reproduce
Steps to reproduce the behavior:

  1. Loki helm from https://grafana.github.io/helm-charts with loki chart version 6.23.0
  2. Deploy in EKS in AWS CN
  3. The config
    auth_enabled: false
    bloom_build:
      builder:
        planner_address: loki-bloom-planner-headless.monitoring.svc.cluster.local:9095
      enabled: false
    bloom_gateway:
      client:
        addresses: dnssrvnoa+_grpc._tcp.loki-bloom-gateway-headless.monitoring.svc.cluster.local
      enabled: false
    chunk_store_config:
      chunk_cache_config:
        background:
          writeback_buffer: 500000
          writeback_goroutines: 1
          writeback_size_limit: 500MB
        default_validity: 0s
        memcached:
          batch_size: 4
          parallelism: 5
        memcached_client:
          addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.monitoring.svc
          consistent_hash: true
          max_idle_conns: 72
          timeout: 2000ms
    common:
      compactor_address: 'http://loki-compactor:3100'
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        s3:
          bucketnames: loki-aws-dev-chunks-3
          insecure: false
          region: us-east-1
          s3forcepathstyle: false
    compactor:
      delete_request_store: s3
      retention_enabled: true
    frontend:
      scheduler_address: loki-query-scheduler.monitoring.svc.cluster.local:9095
      tail_proxy_url: http://loki-querier.monitoring.svc.cluster.local:3100
    frontend_worker:
      scheduler_address: loki-query-scheduler.monitoring.svc.cluster.local:9095
    index_gateway:
      mode: simple
    ingester:
      chunk_encoding: snappy
    limits_config:
      allow_structured_metadata: true
      max_cache_freshness_per_query: 10m
      query_timeout: 300s
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      retention_period: 672h
      split_queries_by_interval: 15m
      volume_enabled: true
    memberlist:
      join_members:
      - loki-memberlist
    pattern_ingester:
      enabled: true
    querier:
      max_concurrent: 4
    query_range:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          background:
            writeback_buffer: 500000
            writeback_goroutines: 1
            writeback_size_limit: 500MB
          default_validity: 12h
  1. Check the log for compactor

Expected behavior

  • Loki compactor pod works fine

Environment:

  • Infrastructure: EKS in AWS China (cn-northwest-1)
  • Deployment tool: helm

Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.

  • Attached role's policy using EKS Pod identity
{
    "Statement": [
        {
            "Action": "s3:ListBucket",
            "Effect": "Allow",
            "Resource": "arn:aws-cn:s3:::<bucket_name>"
        },
        {
            "Action": "s3:*Object",
            "Effect": "Allow",
            "Resource": "arn:aws-cn:s3:::<bucket_name>/*"
        },
    ],
    "Version": "2012-10-17"
}
  • Helm values
# https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml
loki:
  auth_enabled: false # "error from loki: no org id" - https://community.grafana.com/t/error-connecting-loki-data-source-to-kube-prometheus-stack/133296
  schemaConfig:
    configs:
      - from: "2024-04-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  storage_config:
    aws:
      region: us-east-1 # will be overriden
      bucketnames: loki-aws-dev-chunks-1 # will be overriden
      s3forcepathstyle: false
  ingester:
    chunk_encoding: snappy
  pattern_ingester:
    enabled: true
  limits_config:
    allow_structured_metadata: true
    volume_enabled: true
    retention_period: 672h # 28 days retention
  compactor:
    retention_enabled: true
    delete_request_store: s3
  ruler:
    enable_api: true
    storage:
      type: s3
      s3:
        region: us-east-1 # will be overriden
        bucketnames: loki-aws-dev-chunks-2 # will be overriden
        s3forcepathstyle: false
      alertmanager_url: http://prom:9093 # The URL of the Alertmanager to send alerts (Prometheus, Mimir, etc.)

  querier:
    max_concurrent: 4

  storage:
    type: s3
    bucketNames:
      chunks: "loki-aws-dev-chunks-3" # will be overrided
      ruler: "loki-aws-dev-chunks-4" # will be overrided
      region: "us-east-1" # will be overrided
      s3forcepathstyle: false
      # admin: "<Insert s3 bucket name>" # Your actual S3 bucket name (loki-aws-dev-admin) - GEL customers only
    s3:
      region: us-east-1 # will be overriden
      #insecure: false
    # s3forcepathstyle: false

serviceAccount:
  create: true
  name: loki-sa
  annotations:
    "eks.amazonaws.com/role-arn": "arn:aws:iam::<Account ID>:role/LokiServiceAccountRole" # will be overrided

deploymentMode: Distributed

ingester:
  replicas: 1 # 3
  persistence:
    storageClass: gp3
    accessModes:
      - ReadWriteOnce
    size: 10Gi

querier:
  replicas: 1 # 3
  # maxUnavailable: 2
  persistence:
    storageClass: gp3
    accessModes:
      - ReadWriteOnce
    size: 10Gi

queryFrontend:
  replicas: 1 # 2
  maxUnavailable: 1

queryScheduler:
  replicas: 1 # 2

distributor:
  replicas: 1 # 3
  # maxUnavailable: 2

compactor:
  replicas: 1
  persistence:
    storageClass: gp3
    accessModes:
      - ReadWriteOnce
    size: 10Gi
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  working_directory: /var/loki/compactor

indexGateway:
  replicas: 1 # 2
  # maxUnavailable: 1
  persistence:
    storageClass: gp3
    accessModes:
      - ReadWriteOnce
    size: 10Gi

ruler:
  replicas: 1
  # maxUnavailable: 1
  persistence:
    storageClass: gp3
    accessModes:
      - ReadWriteOnce
    size: 10Gi

backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

singleBinary:
  replicas: 0

# https://github.com/grafana/loki/issues/9849
test:
  enabled: false

lokiCanary:
  enabled: false
  • The EKS we have does not have active OIDC URL as we are using pod identity
  • I notice that loki.storage.bucketNames.region is currently us-east-1. Is it related to compactor connection to the bucket? For the note I override loki.storage_config.aws.region and loki.ruler.storage.s3.region to cn-northwest-1 which is the correct region
@gn-hiro-v gn-hiro-v changed the title [BUG] Loki compactor in AWS CN EKS crashes with InvalidToken: The provided token is malformed or otherwise invalid [BUG] Compactor in AWS CN EKS with S3 - crashes with InvalidToken: The provided token is malformed or otherwise invalid Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant