Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add utf8 string operation tests to highlight substr inconsistencies #3699

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

f4t4nt
Copy link
Contributor

@f4t4nt f4t4nt commented Jan 17, 2025

The current implementation of UTF-8 substr operations behaves differently from standard frameworks like pandas/polars/spark. This PR adds tests to document these inconsistencies, which will need to be fixed to ensure compatibility.

The test files:

  • tests/table/utf8/test_substr.py: Documents current behavior and expected behavior
  • tests/table/utf8/test_concat.py: Related UTF-8 string operation tests
  • tests/table/utf8/test_length.py: Related UTF-8 string operation tests

Note: test_substr_baseline.py is a temporary file used to demonstrate the behavior in other frameworks and should be removed once the implementation is fixed.

@f4t4nt f4t4nt closed this Jan 17, 2025
@f4t4nt f4t4nt reopened this Jan 17, 2025
@f4t4nt f4t4nt force-pushed the nishant-utf8-substr-framework-parity branch from a973300 to fbea790 Compare January 17, 2025 01:08
Copy link

codspeed-hq bot commented Jan 17, 2025

CodSpeed Performance Report

Merging #3699 will degrade performances by 16.95%

Comparing nishant-utf8-substr-framework-parity (a973300) with main (5549d16)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 25 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main nishant-utf8-substr-framework-parity Change
test_iter_rows_first_row[100 Small Files] 198 ms 238.4 ms -16.95%
test_show[100 Small Files] 23.8 ms 16.4 ms +45.16%

@f4t4nt f4t4nt changed the title Current UTF-8 substr functionality does not match behavior of other data frameworks test: add utf8 string operation tests to highlight substr inconsistencies Jan 17, 2025
@github-actions github-actions bot added the test label Jan 17, 2025
@f4t4nt f4t4nt linked an issue Jan 17, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 substr behavior inconsistent with pandas/polars/spark
1 participant