skip to main content
10.1145/2771783.2771795acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

WuKong: a scalable and accurate two-phase approach to Android app clone detection

Published:13 July 2015Publication History

ABSTRACT

Repackaged Android applications (app clones) have been found in many third-party markets, which not only compromise the copyright of original authors, but also pose threats to security and privacy of mobile users. Both fine-grained and coarse-grained approaches have been proposed to detect app clones. However, fine-grained techniques employing complicated clone detection algorithms are difficult to scale to hundreds of thousands of apps, while coarse-grained techniques based on simple features are scalable but less accurate. This paper proposes WuKong, a two-phase detection approach that includes a coarse-grained detection phase to identify suspicious apps by comparing light-weight static semantic features, and a fine-grained phase to compare more detailed features for only those apps found in the first phase. To further improve the detection speed and accuracy, we also introduce an automated clustering-based preprocessing step to filter third-party libraries before conducting app clone detection. Experiments on more than 100,000 Android apps collected from five Android markets demonstrate the effectiveness and scalability of our approach.

References

  1. Daily Android activations grow to 1.5 million, Google Play surpasses 50 billion downloads. http://bgr.com/ 2013/07/20/android-activations-app-downloads/, 2013.Google ScholarGoogle Scholar
  2. Androguard. https://code.google.com/p/ androguard/, 2014.Google ScholarGoogle Scholar
  3. Anzhi market. http://www.anzhi.com/, 2014.Google ScholarGoogle Scholar
  4. Apimonitor. https://code.google.com/p/droidbox/ wiki/APIMonitor, 2014.Google ScholarGoogle Scholar
  5. Apktool. https://code.google.com/p/androidapktool/, 2014.Google ScholarGoogle Scholar
  6. Baidu market. http://shouji.baidu.com/, 2014.Google ScholarGoogle Scholar
  7. Dex2jar. https://code.google.com/p/dex2jar, 2014.Google ScholarGoogle Scholar
  8. Eoe market. http://www.eoemarket.com/, 2014.Google ScholarGoogle Scholar
  9. Gfan market. http://apk.gfan.com/, 2014.Google ScholarGoogle Scholar
  10. Jd-Core-Java. https://github.com/nviennot/jdcore-java, 2014.Google ScholarGoogle Scholar
  11. Keytool. http://docs.oracle.com/javase/6/docs/ technotes/tools/solaris/keytool.html, 2014.Google ScholarGoogle Scholar
  12. A list of shared libraries and Ad libraries used in Android apps. http://sites.psu.edu/kaichen/2014/ 02/20/a-list-of-shared-libraries-and-adlibraries-used-in-android-apps/, 2014.Google ScholarGoogle Scholar
  13. Myapp market. http://android.myapp.com/, 2014.Google ScholarGoogle Scholar
  14. Proguard. https://proguard.sourceforge.net/, 2014.Google ScholarGoogle Scholar
  15. Smali: An assembler/disassembler for Android’s dex format. https://code.google.com/p/smali, 2014.Google ScholarGoogle Scholar
  16. B. S. Baker. A program for identifying duplicated code. In Computer Science and Statistics: Proc. Symp. on the Interface, pages 49–57, 1992.Google ScholarGoogle Scholar
  17. B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE, pages 86–95, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. S. Baker. Parameterized pattern matching: algorithms and applications. J. Comput. Syst. Sci., 52(1):28–42, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L.Bier. Clone detection using abstract syntax trees. In Proceedings of the 1998 International Conference on Software Maintenance (ICSM), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Bulychev and M. Minea. Duplicate code detection using anti-unification. In SYRCOSE, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  21. K. Chen, P. Liu, and Y. Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on Android markets. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. X. Chen, A. Y. Wang, and E. D. Tempero. A replication and reproduction of code clone detection studies. In Proceedings of the Thirty-Seventh Australasian Computer Science Conference (ACSC), pages 105–114, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Corazza, S. Di Martino, V. Maggio, and G. Scanniello. A tree kernel based approach for clone detection. In Proceedings of the 2010 International Conference on Software Maintenance (ICSM ’10), pages 1–5, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Crussell, C. Gibler, and H. Chen. Attack of the clones: detecting cloned applications on Android markets. In Proceedings of the 17th European Symposium on Research in Computer Security (ESORICS ’12), 2012.Google ScholarGoogle ScholarCross RefCross Ref
  25. J. Crussell, C. Gibler, and H. Chen. Scalable semantics-based detection of similar Android applications. In Proceedings of the 18th European Symposium on Research in Computer Security (ESORICS ’13), 2013.Google ScholarGoogle Scholar
  26. C. Gibler, R. Stevens, J. Crussell, H. Chen, H. Zang, and H. Choi. AdRob: examining the landscape and impact of Android application plagiarism. In Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys ’13), pages 431–444, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Hanna, L. Huang, E. Wu, S. Li, C. Chen, and D. Song. Juxtapp: a scalable system for detecting code reuse among Android applications. In Proceedings of the 9th Conference on Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA ’12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Higo, U. Yasushi, M. Nishino, and S. Kusumoto. Incremental code clone detection: a PDG-based approach. In WCRE, pages 3 –12, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Huang, S. Zhu, P. Liu, and D. Wu. A framework for evaluating mobile app repackaging detection algorithm. In Proceedings of the 6th International Conference on Trust and Trustworthy Computing, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  30. Y.-C. Jhi, X. Wang, X. Jia, S. Zhu, P. Liu, and D. Wu. Value-based program characterization and its application to software plagiarism detection. In Proceedings of the 33rd International Conference on Software Engineering, pages 756–765, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECKARD: scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE ’07), pages 96–105, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transaction on Software Engineering, 28(7):654–670, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Kim, Y. Jung, S. Kim, and K. Yi. MeCC: Memory comparison-based clone detector. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11), pages 301–310, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digit. Investig., 3:91–97, Sept. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Krinke. Identifying similar code with program dependence graphs. In WCRE, pages 301–309, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M.-W. Lee, J.-W. Roh, S.-w. Hwang, and S. Kim. Instant code clone search. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE ’10), pages 167–176, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Lee and I. Jeong. SDD: high performance code clone detection system for large scale source code. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA ’05), pages 140–141, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Transaction on Software Engineering, 32(3):176–192, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. H.-i. Lim, H. Park, S. Choi, and T. Han. Detecting theft of Java applications via a static birthmark based on weighted stack patterns. IEICE - Trans. Inf. Syst., E91-D(9):2323–2332, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. H.-i. Lim, H. Park, S. Choi, and T. Han. A method for detecting the theft of Java programs through analysis of the control flow information. Inf. Softw. Technol., 51(9):1338–1350, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Linares-Vásquez, A. Holtzhauer, C. Bernal-Cárdenas, and D. Poshyvanyk. Revisiting Android reuse studies in the context of code obfuscation and library usages. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 242–251. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. B. Liu, B. Liu, H. Jin, and R. View. Efficient privilege de-escalation for ad libraries in mobile apps. In Proceedings of the The 13th International Conference on Mobile Systems, Applications, and Services (MobiSys ’15), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. C. Liu, C. Chen, J. Han, and P. S. Yu. GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 872–881, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12), pages 364–374, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. Myles and C. Collberg. K-gram based software birthmarks. In Proceedings of the 2005 ACM symposium on Applied computing, pages 314–318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. G. Myles and C. Collberg. Detecting software theft via whole program path birthmarks. In Information security, pages 404–415, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  47. C. K. Roy and J. R. Cordy. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 IEEE International Conference on Program Comprehension, pages 172–181, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. P. Schugerl. Scalable clone detection using description logic. In IWSC ’11, pages 47–53, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for Java. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’07), pages 274–283, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. G. Selim, K. C. Foo, and Y. Zou. Enhancing source-based clone detection using intermediate representation. In WCRE, pages 227 –236, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. H. Tamada, M. Nakamura, A. Monden, and K. ichi Matsumoto. Design and evaluation of birthmarks for detecting theft of Java programs. In Proceedings of the IASTED International Conference on Software Engineering, pages 569–575, 2004.Google ScholarGoogle Scholar
  52. H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K. ichi Matsumoto. Design and evaluation of dynamic software birthmarks based on API calls. Technical report, Nara Institute of Science and Technology, 2007.Google ScholarGoogle Scholar
  53. H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K.-I. Matsumoto. Dynamic software birthmarks to detect the theft of Windows applications. In Proceedings of the International Symposium on Future Software Technology (ISFST ’04), 2004.Google ScholarGoogle Scholar
  54. N. Viennot, E. Garcia, and J. Nieh. A measurement study of Google Play. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’14), pages 221–233, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. H. Wang, Z. Wang, Y. Guo, and X. Chen. Detecting repackaged Android applications based on code clone detection technique. In SCIENCE CHINA Information Sciences, volume 44(1), pages 142–157, 2014.Google ScholarGoogle Scholar
  56. X. Wang, Y. chan Jhi, S. Zhu, and P. Liu. Detecting software theft via system call based birthmarks. In Proceedings of the 2009 Annual Computer Security Applications Conference, pages 149–158, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu. Behavior based software theft detection. In Proceedings of the 16th ACM Conference on Computer and Communications Security, pages 280–290, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), pages 1113–1120, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Y. Yuan and Y. Guo. CMCD: count matrix based code clone detection. In Proceedings of the 18th Asia Pacific Software Engineering Conference (APSEC ’11), pages 250–257, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Y. Yuan and Y. Guo. Boreas: an accurate and scalable token-based approach to code clone detection. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE ’12), pages 286–289, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. F. Zhang, H. Huang, S. Zhu, D. Wu, and P. Liu. ViewDroid: towards obfuscation-resilient mobile application repackaging detection. In Proceedings of the 7th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec ’14), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Y. Zhauniarovich, O. Gadyatskaya, B. Crispo, F. La Spina, and E. Moser. FSquaDRA: fast detection of repackaged applications. In Data and Applications Security and Privacy XXVIII, volume 8566 of Lecture Notes in Computer Science, pages 130–145. 2014.Google ScholarGoogle Scholar
  63. W. Zhou, Y. Zhou, M. Grace, X. Jiang, and S. Zou. Fast, scalable detection of “piggybacked” mobile applications. In Proceedings of the Third ACM Conference on Data and Application Security and Privacy (CODASPY ’13), pages 185–196, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. W. Zhou, Y. Zhou, X. Jiang, and P. Ning. Detecting repackaged smartphone applications in third-party Android marketplaces. In Proceedings of the Second ACM Conference on Data and Application Security and Privacy (CODASPY ’12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Y. Zhou and X. Jiang. Dissecting Android malware: characterization and evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP ’12), pages 95–109, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. WuKong: a scalable and accurate two-phase approach to Android app clone detection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISSTA 2015: Proceedings of the 2015 International Symposium on Software Testing and Analysis
        July 2015
        447 pages
        ISBN:9781450336208
        DOI:10.1145/2771783
        • General Chair:
        • Michal Young,
        • Program Chair:
        • Tao Xie

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 July 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate58of213submissions,27%

        Upcoming Conference

        ISSTA '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader