NeuralCI ((Neural modeling-based Compiler Identification) is a project that achieves fine-grained compiler identificaiton on the function-level with deep learning methods. It infers compiler details including compiler family, optimization level and compiler version on individual functions, with the basic idea of formulating sequence-oriented neural networks to process normalized instruction sequences generated using a lightweight function abstraction strategy. Details of our methods are discussed in a paper that has been sumbited for peer-review.
To facilitate other researchers conduct experiments and present their findings, we now made public the dataset we consturcted and NeuralCI’s source code. Also, with the expectation of more researchers participating in the compiler identification and related researches, and gradually refine and enrich the dataset with them all.
The dataset consists of 854,858 unique functions collected from 19 widely used real-world projects, where the compiled binaries with different compilers and compiler settings are accessbile and download from the GoogleDrive: https://drive.google.com/file/d/1q8-4b7iGlnWM_N4sOMpdHlg6Ga8lQ1pu/view?usp=sharing
The source code of our implementation are accessbile on github: https://github.com/zztian007/NeuralCI