- Eugene H Spafford and Stephen A Weeber. Software Forensics: Can we track code to its authors?. Technical Report CSD-TR 92-010, Purdue, 19 February 1992.
- Section 2.1 discusses features in compiled code: Data Structures and Algorithms, Compiler and System Information, Programming skill and System knowledge, Choice of System Calls (aka libraries and APIs), and Errors.
- Section 2.2 discusses features in source code and may be applicable to scripting languages: (several not included here...), Use of language features, Scoping, Execution paths, Bugs, and Metrics.
- Sections 3 and 4 discuss some of the problems that may be encountered with authorship analysis, such as code reuse and multi-author code projects.
- David B Hull. Computer Viruses: Naming and Classification. Virus Bulletin, September 1995.
- discussion of classification approaches, evolution and non-evolution changes
- relevant article that references this and other early work
- Ivan Krsul and Eugene H. Spafford. Authorship analysis: Identifying the author of a program. Computers & Security 16, no. 3 (1997): 233-257.
- Diego Doval, Spiros Mancoridis, and Brian S. Mitchell. Automatic clustering of software systems using a genetic algorithm. Proceedings of Software Technology and Engineering Practice (STEP'99), IEEE, 1999.
- The goal in this paper appears to be to chunk large software systems into "like" components, making them easier to analyze. They do this by using genetic algorithms and module-dependency graphs, or control flow graphs.
- Data used for testing includes: "Mini-Tunis, the RCS source code control system, a Turing compiler, the ispell spell checking tool, the Boxer graphical editor, and the Bison compiler-compiler."
- BinDiff software - an IDA (v6.95 and v7) plugin and a standalone tool
- BinDiff manual
- Halvar Flake. Structural comparison of executable objects. Proceedings of the International GI Workshop on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), number P-46 in Lecture Notes in Informatics, 2004.
- Thomas Dullien and Rolf Rolles. Graph-based comparison of Executable Objects (english version). Proceedings of Symposium sur la Securite des Technologies de l'Information et des Communications (SSTIC), 2005.
- Thomas Dullien, Ero Carrera, Soeren-Meyer Eppler, and Sebastian Porst. Automated attacker correlation for malicious code. BOCHUM UNIV (GERMANY FR), 2010.
- describes the MD-Index hash algorithm mentioned in the BinDiff manual
- discusses scaling BinDiff by using function-level MD-Index matching as a pre-filter, to avoid expensive comparison of everything to everything
- Ero Carrera and Gergely Erdélyi. Digital Genome Mapping - Advanced Binary Malware Analysis. Virus Bulletin Conference, September 2004.
- Andrew Walenstein and Arun Lakhotia. The Software Similarity Problem in Malware Analysis. In Proceedings Dagstuhl Seminar 06301: Duplication, Redundancy, and Similarity in Software, 10 pp., Dagstuhl, 2006
- Jesse Kornblum. Identifying almost identical files using context triggered piecewise hashing. In Proceedings of the 6th Annual Digital Forensic Research Workshop (DFRWS), 2006.
- the ssdeep paper
- Andrew Walenstein, Michael Venable, Matthew Hayes, Christopher Thompson, and Arun Lakhotia. Exploiting Similarity Between Variants to Defeat Malware: "Vilo" Method for Comparing and Searching Binary Programs. Blackhat DC, 2007.
- Konrad Rieck. Malheur: Automatic analysis of malware behavior. Accessed March 2017.
- reference implementation
- Philipp Trinius, Carsten Willems, Thorsten Holz, and Konrad Rieck. A Malware Instruction Set for Behavior-Based Analysis. Technical report TR-2009-07, University of Mannheim, 2009.
- Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz Automatic Analysis of Malware Behavior using Machine Learning. Journal of Computer Security (JCS), 19 (4) 639-668, 2011.
- Georg Wicherski. peHash: a novel approach to fast malware clustering. In Proceedings of the 2nd USENIX conference on Large-scale exploits and emergent threats: botnets, spyware, worms, and more (LEET'09), USENIX Association, 2009.
- No reference implementation was release, resulting in four known non-compatible implementations: Totalhash, Endgame, CRITS, and AnyMaster. In addition, AnyMaster attempted to improved upon the original algorithm and released a reference implementation under the project name peHashNG.
- Xin Hu, Tzi-cker Chiueh, and Kang G. Shin. Large-Scale Malware Indexing Using Function-Call Graphs. In Proceedings of ACM CCS, 2009.
- Peng Li, Limin Liu, Debin Gao, and Michael K. Reiter. On challenges in evaluating malware clustering. In International Workshop on Recent Advances in Intrusion Detection (RAID), pp. 238-255, Springer Berlin Heidelberg, September 2010.
- Roberto Perdisci, Wenke Lee, and Nick Feamster. Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces. NSDI Vol. 10, 2010.
- the validation technique in this paper is interesting
- CMU/SEI CERT, Position Independent Code function-level hashing (PICHash) and variations
- 2011 blog post describing function-level fuzzy hashing
- public implementation of
pic_hash
andcomposite_pic_hash
- Lakshmanan Nataraj, S Karthikeyan, Grégoire Jacob, and Bangalore S Manjunath. Malware images: visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security (VizSec), ACM, 2011.
- Jiyong Jang, David Brumley, and Shobha Venkataraman. BitShred: feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS), pp. 309-320, 2011.
- Vassil Roussev. An Evaluation of Forensic Similarity Hashes. In Proceedings of the Eleventh Annual DFRWS Conference, pp. S34-41, Aug 2011.
- the sdhash paper
- Wesley Jin, Sagar Chaki, Cory Cohen, Arie Gurfinkel, Jeffrey Havrilla, Charles Hines, and Priya Narasimhan. Binary Function Clustering using Semantic Hashes. Proceedings of 11th International Conference on Machine Learning and Applications (ICMLA), IEEE, 2012.
- The authors use a semantic hashing algorithm, MinHash, as their hash function to avoid quadratic growth of comparisons.
- André Ricardo Abed Grégio, Paulo Lício de Geus, Christopher Kruegel, and Giovanni Vigna. Tracking memory writes for malware classification and code reuse identification. International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), Springer, 2012.
- "In this paper, we propose a novel approach to capture and model malware behavior that is based on the monitoring of the data values that a certain subset of instructions writes to memory during program execution. We have implemented a malware clustering component and a component to detect code reuse between different malware families. To validate our proposed techniques, we analyzed 16,248 malware samples."
- Grégoire Jacob, Paolo Milani Comparetti, Matthias Neugschwandtner, Christopher Kruegel, and Giovanni Vigna. A static, packer-agnostic filter to detect similar malware samples. In Proceedings of the 9th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2012.
- Steve Hanna, Ling Huang, Edward Wu, Saung Li, Charles Chen, and Dawn Song. Juxtapp: a scalable system for detecting code reuse among android applications. In Proceedings of the 9th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2012.
- Ashkan Rahimian, Philippe Charland, Stere Preda, and Mourad Debbabi. RESource: a framework for online matching of assembly with open source code. International Symposium on Foundations and Practice of Security, Springer, 2012.
- given a repository of source code and a binary (i.e. compiled) program, attempt to match functions in the binary with functions in the source code repository
- reference code
- Silvio Cesare and Yang Xiang. Software Similarity and Classification. Springer, 2012.
- "Software similarity and classification is an emerging topic with wide applications. It is applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection. Extracting program features, processing those features into suitable representations, and constructing distance metrics to define similarity and dissimilarity are the key methods to identify software variants, clones, derivatives, and classes of software."
- Presents a survey of research to date and methodology for organizing and understanding the core concepts.
- Jonathan Oliver, Chun Cheng, and Yanggui Chen. TLSH--a locality sensitive hash. Fourth Cybercrime and Trustworthy Computing Workshop, IEEE, 2013.
- reference implementation and presentation
- Frank Breitinger, Georgios Stivaktakis, and Harald Baier. FRASH: A framework to test algorithms of similarity hashing. Digital Investigation Volume 10, pp. S50-S58, August 2013.
- Silvio Cesare, Yang Xiang, and Wanlei Zhou. Malwise—an effective and efficient classification system for packed and polymorphic malware. IEEE Transactions on Computers Volume 62 Issue 6, pp. 1193-1206, 2013.
- Charles LeDoux, Arun Lakhotia, Craig Miles, Vivek Notani, and Avi Pfeffer. FuncTracker: Discovering Shared Code to Aid Malware Forensics. Presented as part of the 6th USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2013.
- Lakhotia, Arun, Mila Dalla Preda, and Roberto Giacobazzi. Fast location of similar code fragments using semantic 'juice'. Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop, 2013.
- Lakshmanan Nataraj, Dhilung Kirat, B.S Manjunath, and Giovanni Vigna. SARVAM: Search And RetrieVAl of Malware. In Proceedings of The Next Generation Malware Attacks and Defense Workshop (NGMAD), 2013.
- Battista Biggio, Ignazio Pillai, Samuel Rota Bulò, Davide Ariu, Marcello Pelillo, and Fabio Roli. Is data clustering in adversarial settings secure?. Proceedings of the 2013 ACM workshop on Artificial Intelligence and Security (AISec), 2013.
- Wei Ming Khoo, Alan Mycroft, and Ross Anderson. Rendezvous: A search engine for binary code. Proceedings of the 10th Working Conference on Mining Software Repositories, IEEE Press, 2013.
- Carl Sabottke, Eddie Tanner, and Richard Johnson. Improved Malware Clustering Using VirusTotal Metadata. Class project for University of Maryland, Fall 2013.
- Orestis Kostakis. Classy: fast clustering streams of call-graphs. Data Mining and Knowledge Discovery Volume 28, pp. 1554-1585, 2014.
- Tracking Malware with Import Hashing. Mandiant Blog, Published January 23, 2014.
- commonly used implementation
- There are differences in how imphash is calculated with respect to import-by-ordinal. If two different implementations use different databases of ordinal-to-name, then they may generate different imphashes for the same PE file.
- ZongXian Shen. MeltingPot. Repository hosted on GitHub, 2014-2019.
- "MeltingPot is an automated common binary signature extractor and pattern generator. For the given sample set with the same file format, it slices each file into small pieces of binary sequences and correlates the files sharing the similar sequences. To show the result, MeltingPot generates a set of YARA formatted patterns each of which represents the common signature of a certain file cluster. Such patterns can be directly applied by YARA scan engine."
- Contains C++ code for a couple simple malware/string comparison techniques
- ssdeep
- ngram-based bloom filters with Jaccard similarity
- Steven Jilcott. Scalable malware forensics using phylogenetic analysis. Proceedings of IEEE International Symposium on Technologies for Homeland Security (HST), 2015.
- Hyrum Anderson. MinHash vs. Bitwise Set Hashing: Jaccard Similarity Showdown. Endgame Blog, Published September 23, 2015.
- Brian Wallace. Optimizing ssDeep for use at scale. VirusBulletin Blog, Published November 27, 2015.
- Chris Giannella and Eric Bloedorn. Spectral Malware Behavior Clustering. International Conference on Intelligence and Security Informatics (ISI), 2015.
- Joshua Saxe and Konstantin Berlin. Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features. Invincea, 2015.
- xorpd. FCatalog. Public website, Accessed March 2017.
- reference code for server, reference code for client, first commit Sept 2015.
- Peter M Wrench, and Barry VW Irwin. Towards a PHP webshell taxonomy using deobfuscation-assisted similarity analysis. Information Security for South Africa (ISSA), IEEE, 2015.
- Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, Anthony Douglas Joseph, J Doug Tygar. Better malware ground truth: Techniques for weighting anti-virus vendor labels. Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security (AISec), ACM, 2015.
- Vikas Gupta, and Frank Breitinger. How cuckoo filter can improve existing approximate matching techniques. International Conference on Digital Forensics and Cyber Crime, Springer, 2015.
- see the mrsh_cuckoo tool
- Jason Upchurch and Xiaobo Zhou. Variant: a malware similarity testing framework. 10th International Conference on Malicious and Unwanted Software (MALWARE), 2015.
- Created a manually-labelled dataset for groundtruth. Takes 5 previously published algorithms/tools (ssdeep, sdhash, TLSH, BitShred, FirstByte) and compares their peek performance on this dataset vs published results. According to Table V, recall and precision are well below originally reported.
- Dynetics. Malfunction. 2015.
- Sean McVey. Malware Fingerprinting: Analysis of Tool Marks and Other Characteristics of Windows Malware. National Cybersecurity Institute Journal, pp55-64, 2015.
- Bruce Ediger. An algorithm to calculate malware phylogeny, an example and counterexample. Original content appears to be late 2016 with updates through early 2019.
- JPCERT/CC. Classifying Malware using Import API and Fuzzy Hashing – impfuzzy –. Blog post, Published May 25, 2016.
- Saed Alrabaee, Lingyu Wang, and Mourad Debbabi. BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs). DFRWS USA 2016.
- Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. Scalable graph-based bug search for firmware images. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 480-491, October 2016.
- Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2016.
- Marion Marschalek. The Kings In Your Castle Part #2 – Dataset and feature extraction. Blog post published October 31, 2016.
- "A goal of our research was to extract more granular file features from different domains than the usual IOCs cover, in a sense, more 'expensive' features, that we considered less volatile than domain names. This way we expected to be able to find links among different events contained in MISP, that the usual indicators miss. In a targeted operation, it is considered expensive to change a toolset, rewrite malware or switch infection vectors like for example the purchase of a new exploit. Currently used indicators lack capabilities to describe 'expensive metrics', hence the idea to widen the feature space."
- This blog post appears to be related to this presentation at Troopers16.
- Joxean Koret. Mal Tindex. Presented at EuskalHack 2017. First and last commit June 2017.
- "Mal Tindex is an Open Source tool for indexing binaries and help attributing malware campaigns".
- It uses IDA and Diaphora to export to a database a set of signatures for each function found in each binary indexed. Then, the most "rare" functions are stored in various tables and these are used to find "rare" coincidences between malware samples.
- MACHOC fuzzy hashing algorithm
- part of the Polichombr collaborative malware analysis framework. Presented at SSTIC 2016 as "Challenges of collaborative malware analysis" by Berre et. al.
- an implementation called machoke by CERT-Conix
- only needs radare2/r2pipe, not an IDA license
- provides an algorithm to turn a function call graph into a string, then runs MurmurHash3 on it
- has a known issue
- Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. AVClass: A tool for Massive Malware Labeling. In Proceedings of the International Symposium on Research in Attacks, Intrusions and Defenses (RAID), September 2016.
- Yaniv David, Nimrod Partush, and Eran Yahav. Similarity of binaries through re-optimization. Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2017.
- Fei Zuo, Xiaopeng Li, Zhexin Zhang, Patrick Young, Lannan Luo, and Qiang Zeng. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. arXiv:1808.04706, Submitted on 8 Aug 2018.
- Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2017.
- focuses on vulnerability discovery in firmware
- Paria Shirani, Lingyu Wang, and Mourad Debbabi. BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape. International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2017.
- Guillermo Suarez-Tangil, Santanu Kumar Dash, Mansour Ahmadi, Johannes Kinder, Giorgio Giacinto, and Lorenzo Cavallaro. DroidSieve: Fast and accurate classification of obfuscated android malware. In Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy (CODASPY),(pp. 309-320, March 2017.
- Focused on detection, but also provides analysis of multi-class classification using the same features that are proposed for detection
- Médéric Hurier, Guillermo Suarez-Tangil, Santanu Kumar Dash, Tegawendé F. Bissyandé, Yves Le Traon, Jacques Klein, and Lorenzo Cavallaro. Euphony: Harmonious Unification of Cacophonous Anti-Virus Vendor Labels for Android Malware. 14th International Conference on Mining Software Repositories (MSR), IEEE, 2017.
- attempts to group based on AV labels, similarly to avclass but without the need for a-priori knowledge/training
- reference implementation, requires proprietary database
- Chariton Karamitas and Athanasios Kehagias. Efficient features for function matching between binary executables. 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2018.
- Joxean Koret. A new Control Flow Graph based heuristic for Diaphora. Blog post published November 2018.
- Proposes a new algorithm, called KOKA, using ideas from the Karamatis and Kehagias 2018 paper
- "based on the idea of 'different basic blocks and edges are different interesting pieces of information', I have created a new heuristic for Diaphora that gets features at function, basic block, edge and instruction level, assigns a different prime value to each different feature and then generates a hash by just mutiplying all the values (a small-primes-product, SPP)."
- Joxean Koret. Pigaios: A tool for matching and diffing source codes directly against binaries. Repository hosted on GitHub, 2018.
- "The idea is to point a tool to a code base, regardless of it being compilable or not (for example, partial source code or source code for platforms not at your hand), extract information from that code base and, then, import in an IDA database function names (symbols), structures and enumerations. It uses the Python CLang bindings (which are very limited, but still better than using pycparser)."
- Saed Alrabaee, Paria Shirani, Lingyu Wang, and Mourad Debbabi. FOSSIL: A Resilient and Efficient System for Identifying FOSS Functions in Malware Binaries. ACM Transactions on Privacy and Security (TOPS) Volume 21 Issue 2, February 2018.
- Fabio Pagani, Matteo Dell'Amico, and Davide Balzarotti. Beyond Precision and Recall: Understanding Uses (and Misuses) of Similarity Hashes in Binary Analysis. Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy (CODASPY), 2018.
- Halvar Flake. Some Experiments with Code Similarity. Hack in the Box Beijing, 2018.
- "This talk discusses some adventures and lessons-learned building code to recognize third-party libraries in executables. The content touches on topics such as fast approximate nearest-neighbor search over bit vectors, mistakes one makes as machine learning beginner, and specific difficulties that often get glossed over in promising academic research."
- reference code
- Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. Large-Scale and Language-Oblivious Code Authorship Identification. ACM Conference on Computer and Communications Security (CCS), 2018.
- Daniel Plohmann. ApiScout. Repository hosted on GitHub, 2018.
- This project aims at simplifying Windows API import recovery on arbitrary memory dumps.
- Code Cartographer's Diary slides
- Mila Dalla Preda and Vanessa Vidali1. Abstract Similarity Analysis. Electronic Notes in Theoretical Computer Science, 2017.
- Code similarity analysis using CFGs. The authors propose a general framework for similarity or programs expressed in terms of abstractions of their control flow graphs representation.
- Noam Shalev and Nimrod Partush. Binary Similarity Detection Using Machine Learning. In Proceedings of the 13th Workshop on Programming Languages and Analysis for Security (PLAS), 2018.
- Edward Raff and Charles K. Nicholas. Lempel-Ziv Jaccard Distance, an Effective Alternative to Ssdeep and Sdhash. Digital Investigation Volume 24, pp. 34-49, March 2018.
- Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning Distributed Representations of Code. Last revised Oct 2018.
- online demonstration of principles shown in the paper
- code available and ACM certified reusable
- Seamus Burke, Kevin Bilzer, and RJ Joyce. RichPE: Malware Attribution Using the Rich Header. Shmoocon, 2019.
- Shinho Lee, Wookhyun Jung, Sangwon Kim, Jihyun Lee, and Jun-Seob Kim. Dexofuzzy: Android malware similarity clustering method using opcode sequence. VirusBulletin blog, November 2019.
- Paul Black, Iqbal Gondal, Peter Vamplew, and Arun Lakhotia. Evolved Similarity Techniques in Malware Analysis. IEEE Conference on Trust, Security, and Privacy in Computing and Communications (TrustCom), 2019.
- Irfan Ul Haq and Juan Caballero. A Survey of Binary Code Similarity. arXiv:1909.11424, 25 Sep 2019.