A SURVEY OF SOFTWARE PLAGIARISM DETECTION USINGDYNAMIC BIRTHMARKSMcnilly Goshomi1 Research paper for students

A SURVEY OF SOFTWARE PLAGIARISM DETECTION USINGDYNAMIC BIRTHMARKSMcnilly Goshomi1

A SURVEY OF SOFTWARE PLAGIARISM DETECTION USINGDYNAMIC BIRTHMARKSMcnilly Goshomi1, Calvin Mugauri2
Information security & assurance Harare Institute of Technology1 mcnillygoshomi @gmail.com2 [email protected] of Information sciences, Harare Institute of Technology, Zimbabwe
Abstract—Software plagiarism is still quite a prominent issue in software industry for protection of intellectual property and software license. It’s quite common in open source projects. However, software Plagiarism detection faces two main challenges: the absence of source code files and intense code obfuscation techniques that aim to hide characteristics of a program. Software birth marking proves to be a reliable approach to detect software plagiarism by determining the similarity of unique characteristics between the two programs in question. There are two types of software birthmarks, static and dynamic. The former does not require program execution, but is more susceptible to attacks by semantic-preserving transformations. A dynamic software birthmark however pertains to executables, in this paper we’ll report our findings on, software plagiarism, the efforts being pooled in to level out the problem, the systems, techniques, approaches and technologies being used to deal with this issue.

Keywords: software plagiarism detection, dynamic birthmark.
INTRODUCTION
There are generally two types of plagiarism that occur, textual and source code plagiarism. Research on detecting textual plagiarism is more recent and ongoing whereas the field of source code plagiarism has generally been lying dormant for quite some time. Open source software projects enable users to use, modify and distribute software under certain licenses like the well-known GPL. However owing to huge commercial interests, some individuals & companies include third party software and/or libraries into their products without paying attention to the licensing terms. A Software birthmark is a characteristic extracted from a program that reflects the program’s intrinsic properties and can be used to uniquely identify the program. Extracting software birthmarks is a promising way for solving the plagiarism detection hurdle. However, in spite of the notable progress of birthmark based plagiarism detection approaches, not so many tools are publically available. From our research we managed to find that only a few exist and these include SandMark 1, Stigmata2 and Birthmarking 3. The former two are static birthmark based which are believed to be limited faced with semantic preserving code obfuscation techniques, and the last one is dynamic birthmark based and believed to have better resilience than the first two, yet they all suffer the problem of language dependence (they’re only valid for java programs). Also, there are some mature tools such as the JPlag 4 GPlag, MOSS that target at source code which is not always available, since plagiarists probably will not provide their source code before strong evidences are collected. Thus more powerful and practical tools are urgently needed to fill the gap of birthmark based plagiarism detection. It’s a generally accepted fact that dynamic birthmarks being abstractions of runtime behaviors are more accurate reflections of program semantics than static birthmarks.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

RELATED WORK
Software watermarking 5 is amongst the earliest and widely known methods of detecting software Watermarks are categorized into four types based functionality 19: authorship mark (for identifying authors), fingerprinting mark (serial number or purchaser), validation mark (for verifying that the software has not been altered since being authored) and licensing mark (for dictating how the software can be used). Watermarks are highly fragile in the face of 7, 25 semantics preserving obfuscations 13. A sufficiently determined person will eventually be able to defeat any watermark 14. Static Source Code Based BirthmarksFour types of static birthmarks were proposed by Tamada 18 that include, the same values in field variables, sequence of method calls, inheritance structure and used classes. The mean similarity scores of the four birthmarks are then used in determining plagiarism. These birthmarks prove to be susceptible to obfuscations and also they are only applicable to Java based programs. Birthmarks proposed by Prechelt and Ji 8, 9 were detected by making use of token sequences produced by parsing source code. Such simple approaches are too weak to dummy code insertion and statement restructuring. Plagiarism was determined by mining program dependency graphs (PDGs) in Gplag 10 and level of similarity between PDGs was computed by graph isomorphism algorithms. By using dynamic birthmarks we look forward to get over such hurdles.Static Binary Code Based BirthmarksMyles and Collberg 7 came up with k-gram based static birthmarks, each of which was actually a set of Java bytecode sequences of a length k. The level of similarity between two birthmarks was determined through set operations ignoring the frequency level of occurrence of elements within the set. Although certainly being more effective than those birthmarks proposed by Tamada 18, the birthmarks were still susceptible to code transformations. Weighted k-gram based static birthmarks 22 improved upon Myles and Collberg’s 7 by taking into consideration the frequency of each k-length operation code sequence. However, the improvement in detection capability seems considerably minor while bringing about extra costs in computing the change rate of k-gram frequencies. A static birthmark based on windows API calls disassembled from executables was proposed by Choi 16 to detect plagiarism of windows applications. The requirement for deobfuscating binaries before employing their method is far too restrictive and therefore reduces its availability.Lim 24 employed use of control flow information which reflected runtime behaviors to supplement static approaches. Again recently he came up with the idea to analyze stack flows obtained by simulating operand stack movements to detect duplicates 23. Yet still they are only applicable for Java programs. Hemel et. al. 26 suggested three methods to locate potential cloned binaries in a program repository by treating binaries as ordinary files. Precisely, similarity between two given binaries were evaluated by calculating the ratio of shared string literals, by calculating the compression ratio, and by calculating binary deltas. Since no syntactic or semantic attributes of binary executables are taken into consideration, quite low detection accuracy is to be expected. An obfuscation-resilient method based on longest common subsequence of semantically equivalent rudimentary blocks was first proposed by Luo et. al. 27. They employed use of symbolic execution to extract from basic blocks symbolic formulas, whose pairwise equivalence were checked via a theorem prover. Yet this method has considerable difficulty in handling indirect branches. Also symbolic execution combined with theorem proving is definitely not scalable.

Dynamic Software Birthmarks
Myles and Collberg 6 proposed to utilize the complete dynamic control graph of an execution as a birthmark. Even with compression our study findings prove that such a method does not scale. Schuler 11 decided to treat Java standard API call sequences at basic object level as birthmarks for Java programs. The same principle had been applied Tamada’s works 15, 28 where API call sequences of windows executables were used to derive birthmarks. Apparently as according to our findings API based birthmarks are all language dependent. To address this problem Wang et. al. 12 proposed two dynamic birthmarks based on system calls: System Call Short Sequence Birthmark (SCSSB) and Input Dependent System Call Subsequence Birthmark (IDSCSB). SCSSB treated and recognized the sets of k-length system call sequences as birthmarks. IDSCSB was introduced specifically to avoid system call insertion attack. However both birthmarks have proved to have limited applicability to software that has only few system calls e.g. scientific computing programs. In Lu’s work 17, a full dynamic instruction trace was recorded during a program’s execution, from which a dynamic birthmark was extracted by applying the k-gram algorithm. However such a type birthmark could not even at least recognize two versions generated from one program with differentiated compiler optimization levels. By considering a slice rather than the whole instruction sequence as program characterizations, Bai 26 came up with a dynamic birthmark for Java based on MSIL instructions rather than assembly instructions. In addition to being language and/or operating system dependent, the approach is also weak against code obfuscations. By bringing into the picture data flow and control flow dependency analysis, Wang et. al. 21 proposed a system call dependency graph based birthmark, and graph isomorphism is used for computing similarity between birthmarks. Patrick et al. 20 came up with a heap graph birthmark for JavaScript making use of heap memory analysis, and graph monomorphism algorithm was used for similarity detection. But best results these graph based birthmarks entail that the programs under protection have clear and concise referencing structures. Also, since graph isomorphism and monomorphism algorithms are NP-complete
generally, several thousand nodes will render the methods impractical for use.

III. CONCLUSION
In this paper we have observed how each of the successive techniques that have been employed to detect software plagiarism all seemed to have one limitation after the other therefore we propose to take a wholesome approach to software plagiarism detection by designing a system that: for the two programs being compared it collects key instructions that are both value updating and input correlated., monitors and analyses stack operations and systemcalls.

IV. REFERENCES 1 SandMark. http://sandmark.cs.arizona.edu/.2 Stigmata. http://stigmata.sourceforge.jp/implementation.html.3 Birthmarking. https://www.st.cs.uni-saarland.de/birthmarking/.4 JPlag. http://jplag.ipd.kit.edu/.5 C. Collberg and C. Thomborson, “Software watermarking: Models and dynamic embeddings,” in Proc. ACM SIGPLANSIGACT Symp. Principles of Programming Languages (POPL ’99).ACM, 1999, pp. 311–324.6 G. Myles and C. S. Collberg, “Detecting software theft via whole program path birthmarks,” in Proc. Int. Conf. Information Security (ISC ’04), 2004, pp. 404–415.7 G. Myles and C. Collberg, “K-gram based software birthmarks,” in Proc. ACM Symp. Applied Computing (SAC ’05), 2005,pp. 314–318.8 L. Prechelt, G. Malpohl, and M. Philippsen, “Finding plagiarisms among a set of programs with jplag,” Journal of Universal Computer Science, vol. 8, no. 11, pp. 1016 1038, 2002.9 J.-H. Ji, G. Woo, and H.-G. Cho, “A source code linearization technique for detecting plagiarized programs,” in Proc. Annual SIGCSE Conf. Innovation and Technology in Computer Science Education (ITiCSE ’07), 2007, pp. 73–77.10 C. Liu, C. Chen, J. Han, and P. S. Yu, “GPLAG: detection of software plagiarism by program dependence graph analysis,” in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD ’06), 2006, pp. 872–881.11 D. Schuler, V. Dallmeier, and C. Lindig, “A dynamic birthmark for Java,” in Proc. IEEE/ACM Int. Conf. Automated Software Engineering (ASE ’07), 2007, pp. 274–283.12 X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu, “Detecting software theft via system call based birthmarks,” in Annual Computer Security Applications Conference (ACSAC ’09), 2009, pp. 149–158.13 C. Collberg, G. Myles, and A. Huntwork, “Sandmark-a tool for software protection research,” IEEE Security and Privacy, vol. 1, no. 4, pp. 40–49, 2003.

14 C. S. Collberg, E. Carter, S. K. Debray, A. Huntwork, J. D. Kececioglu, C. Linn, and M. Stepp, “Dynamic path-based software watermarking,” in Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI ’04), 2004, pp. 107–118.15 H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K. Ichi Matsumoto, “Design and evaluation of dynamic software birthmarks based on api calls,” Info. Science Technical Report NAIST-IS-TR2007011, pp. 0919–9527, 2007.16 S. Choi, H. Park, H. il Lim, and T. Han, “A static api birthmark for windows binary executables,” Journal of Systems and Software, vol. 82, no. 5, pp. 862–873, 2009.17 B. Lu, F. Liu, X. Ge, B. Liu, and X. Luo, “A software birthmark based on dynamic opcode n-gram,” in Int. Conf. Semantic Computing (ICSC ’07), 2007, pp. 37–44.

18 H. Tamada, M. Nakamura, and A. Monden, “Design and evaluation of birthmarks for detecting theft of Java programs,” in IASTED Conf. on Software Engineering (IASTEDSE ’04), 2004, pp. 569–574.

19 J. Nagra, C. D. Thomborson, and C. S. Collberg, “A functional taxonomy for software watermarking,” in Proc. Austr. Conf. Computer Science (ACSC ’02), 2002, pp. 177–186.20 P. P. F. Chan, L. C. K. Hui, and S.-M. Yiu, “Heap graph based software theft detection,” IEEE Trans. Information Forensics and Security, vol. 8, no. 1, pp. 101–110, 2013.
21 X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu, “Behavior based software theft detection,” in Proc. ACM Conf. Computer and Communications Security (CCS ’09). ACM, 2009, pp. 280–290.

22 X. Xie, F. Liu, B. Lu, and L. Chen, “A software birthmark based on weighted k gram,” in IEEE Int. Conf. Intelligent Computing and Intelligent Systems (ICIS ’10), vol. 1, 2010, pp. 400–405.

23 H. il Lim and T. Han, “Analyzing stack flows to compare Java programs,” IEICE Trans. Information and Systems, vol. 95-D, no. 2, pp. 565–576, 2012.

24 H. il Lim, H. Park, S. Choi, and T. Han, “A method for detecting the theft of Java programs through analysis of the control flow information,” Information and Software Technology,
vol. 51, no. 9, pp. 1338–1350, 2009.

25 C. Collberg and C. Thomborson, “On the limits of software
watermarking,” Department of Computer Science, The University of Auckland, New Zealand, Tech. Rep., 1998.

26 A. Hemel, K. T. Kalleberg, R. Vermaas, and E. Dolstra, “Finding software license violations through binary code clone detection,” in Proc. Working Conf. Mining Software Repositories(MSR ’11), 2011, pp. 63–72.

27 L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection,” in Proc. ACM SIGSOFT Symp. Found. Softw. Eng. (FSE ’14), 2014, pp. 389–400.

28 H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K.-i. Matsumoto, “Dynamic software birthmarks to detect the theft of windows applications,” in Int. Symp. Future Software Technology (ISFST ’04), 2004.

A SURVEY OF SOFTWARE PLAGIARISM DETECTION USINGDYNAMIC BIRTHMARKSMcnilly Goshomi1