Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgettext: don't strip by default, ignore empty strings and all whitespace strings, process special characters such as the "bell character" #11

Open
armijnhemel opened this issue Mar 15, 2024 · 1 comment

Comments

@armijnhemel
Copy link

I have some experience with building large databases with strings extracted from source code. Some of my findings:

  • ignore empty strings: you will find that many strings will be the empty string. These are quite useless for anything related to matching.
  • some strings will be white space only (before stripping). These tend to be useless as well.
  • there are quite a few characters that cannot be printed, such as the ASCII bell character. You might want to remove these. A test example would be the file libbb/lineedit.c from a recent version of BusyBox. The whole list of characters that I am currently removing:
 ['\a', '\b', '\v', '\f', '\x01', '\x02', '\x03', '\x04',
  '\x05', '\x06', '\x0e', '\x0f', '\x10', '\x11', '\x12',
  '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19',
  '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', '\x7f']

Currently you are not doing those clean ups. On the other hand you are stripping regular strings, where (I think) whitespace could be relevant. If you want to clean up, then at least you should be consistent :-)

My advise: do not strip strings, ignore empty strings or whitespace only strings, remove non-printable characters.

@armijnhemel
Copy link
Author

After having thought a bit more, you might want to make removing non-printable characters optional (but enabled by default) in case you would like to use symbols for source to source matching, as there they could be relevant.

pombredanne added a commit that referenced this issue Mar 15, 2024
Reference: #11
Reported-by: Armijn Hemel @armijnhemel
Signed-off-by: Philippe Ombredanne <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant