You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have some experience with building large databases with strings extracted from source code. Some of my findings:
ignore empty strings: you will find that many strings will be the empty string. These are quite useless for anything related to matching.
some strings will be white space only (before stripping). These tend to be useless as well.
there are quite a few characters that cannot be printed, such as the ASCII bell character. You might want to remove these. A test example would be the file libbb/lineedit.c from a recent version of BusyBox. The whole list of characters that I am currently removing:
Currently you are not doing those clean ups. On the other hand you are stripping regular strings, where (I think) whitespace could be relevant. If you want to clean up, then at least you should be consistent :-)
My advise: do not strip strings, ignore empty strings or whitespace only strings, remove non-printable characters.
The text was updated successfully, but these errors were encountered:
After having thought a bit more, you might want to make removing non-printable characters optional (but enabled by default) in case you would like to use symbols for source to source matching, as there they could be relevant.
I have some experience with building large databases with strings extracted from source code. Some of my findings:
libbb/lineedit.c
from a recent version of BusyBox. The whole list of characters that I am currently removing:Currently you are not doing those clean ups. On the other hand you are stripping regular strings, where (I think) whitespace could be relevant. If you want to clean up, then at least you should be consistent :-)
My advise: do not strip strings, ignore empty strings or whitespace only strings, remove non-printable characters.
The text was updated successfully, but these errors were encountered: