Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Field level statistics of lucene index files #12113

Open
rishabhmaurya opened this issue Jan 31, 2024 · 5 comments
Open

[Feature Request] Field level statistics of lucene index files #12113

rishabhmaurya opened this issue Jan 31, 2024 · 5 comments
Labels
enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Indexing:Performance Indexing Indexing, Bulk Indexing and anything related to indexing Other

Comments

@rishabhmaurya
Copy link
Contributor

Is your feature request related to a problem? Please describe

It seems impossible to get statistics such as disk consumption of individual lucene segment files per field. We do have a similar API which can aggregate statistics at shard, index and cluster level using index stats API (https://opensearch.org/docs/latest/api-reference/index-apis/stats/). Like -

/<index_name>/_stats/segments?level=shards&include_segment_file_sizes&pretty"
  "file_sizes" : {
    "nvm" : {
      "size_in_bytes" : 5486,
      "description" : "Norms"
    },
    "fnm" : {
      "size_in_bytes" : 127478,
      "description" : "Fields"
    },
    "kdd" : {
      "size_in_bytes" : 1119640726,
      "description" : "Others"
    },
    "tmd" : {
      "size_in_bytes" : 59232,
      "description" : "Others"
    },
    "fdm" : {
      "size_in_bytes" : 9175,
      "description" : "Others"
    },
    "kdi" : {
      "size_in_bytes" : 2926068,
      "description" : "Others"
    },
    "dvd" : {
      "size_in_bytes" : 1687766934,
      "description" : "DocValues"
    },
    "kdm" : {
      "size_in_bytes" : 10398,
      "description" : "Others"
    },
    "pos" : {
      "size_in_bytes" : 6051226,
      "description" : "Positions"
    },
    "si" : {
      "size_in_bytes" : 1176,
      "description" : "Segment Info"
    },
    "fdt" : {
      "size_in_bytes" : 9388206796,
      "description" : "Field Data"
    },
    "doc" : {
      "size_in_bytes" : 2964755700,
      "description" : "Frequencies"
    },
    "tim" : {
      "size_in_bytes" : 483838256,
      "description" : "Term Dictionary"
    },
    "dvm" : {
      "size_in_bytes" : 117015,
      "description" : "DocValues"
    },
    "tip" : {
      "size_in_bytes" : 20778232,
      "description" : "Term Index"
    },
    "fdx" : {
      "size_in_bytes" : 519052,
      "description" : "Field Index"
    },
    "nvd" : {
      "size_in_bytes" : 1534,
      "description" : "Norms"
    }
  }

Describe the solution you'd like

Introduce similar API for query param to existing index stats API to provide this information at field level which can be aggregated at shard, index and cluster level per field.
This would be useful in understanding usage statistics per field. There is no way other than writing script to read lucene indexes and compute this information.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

@rishabhmaurya rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 31, 2024
@github-actions github-actions bot added the Other label Jan 31, 2024
@rishabhmaurya rishabhmaurya added the good first issue Good for newcomers label Jan 31, 2024
@rishabhmaurya
Copy link
Contributor Author

It would be useful to also show additional information for numeric field like min and max value per segment, shard, index etc.
One utility could be issues like opensearch-project/opensearch-benchmark#398 where it can be used to understand what range of a values a segment contains.

@bbarani
Copy link
Member

bbarani commented Feb 1, 2024

+1. Adding a disk usage stats API to provide usage of each field would help surface the metrics for Benchmarking as well. We can add it as a telemetry devices to OSB. CC: @rishabh6788

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4]
@rishabhmaurya Thanks for filing, looking forward to seeing this developed.

@peternied peternied added Storage Issues and PRs relating to data and metadata storage Storage:Performance labels Feb 7, 2024
@msfroh
Copy link
Collaborator

msfroh commented Feb 14, 2024

  1. We need to decide on the possible API -- proposing something is probably a good first issue, but we would need to get consensus.
  2. @lukas-vlcek called out that remote store may make this more challenging (needing to fetch files from remote store).
  3. We may want to calculate when writing segments.
  4. Also, when addressing this, we should do something about all the files labeled Other, since we can provide better descriptions.

@sirish26
Copy link

sirish26 commented Mar 2, 2024

by Adding a new parameter, like include_field_sizes, to the existing index stats API (/_stats/segments). provide users with detailed disk consumption statistics for individual fields within an index

@gbbafna gbbafna added Indexing Indexing, Bulk Indexing and anything related to indexing and removed Storage Issues and PRs relating to data and metadata storage Storage:Performance labels Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Indexing:Performance Indexing Indexing, Bulk Indexing and anything related to indexing Other
Projects
Status: Todo
Development

No branches or pull requests

7 participants