ABSTRACT
Repackaged Android applications (app clones) have been found in many third-party markets, which not only compromise the copyright of original authors, but also pose threats to security and privacy of mobile users. Both fine-grained and coarse-grained approaches have been proposed to detect app clones. However, fine-grained techniques employing complicated clone detection algorithms are difficult to scale to hundreds of thousands of apps, while coarse-grained techniques based on simple features are scalable but less accurate. This paper proposes WuKong, a two-phase detection approach that includes a coarse-grained detection phase to identify suspicious apps by comparing light-weight static semantic features, and a fine-grained phase to compare more detailed features for only those apps found in the first phase. To further improve the detection speed and accuracy, we also introduce an automated clustering-based preprocessing step to filter third-party libraries before conducting app clone detection. Experiments on more than 100,000 Android apps collected from five Android markets demonstrate the effectiveness and scalability of our approach.
- Daily Android activations grow to 1.5 million, Google Play surpasses 50 billion downloads. http://bgr.com/ 2013/07/20/android-activations-app-downloads/, 2013.Google Scholar
- Androguard. https://code.google.com/p/ androguard/, 2014.Google Scholar
- Anzhi market. http://www.anzhi.com/, 2014.Google Scholar
- Apimonitor. https://code.google.com/p/droidbox/ wiki/APIMonitor, 2014.Google Scholar
- Apktool. https://code.google.com/p/androidapktool/, 2014.Google Scholar
- Baidu market. http://shouji.baidu.com/, 2014.Google Scholar
- Dex2jar. https://code.google.com/p/dex2jar, 2014.Google Scholar
- Eoe market. http://www.eoemarket.com/, 2014.Google Scholar
- Gfan market. http://apk.gfan.com/, 2014.Google Scholar
- Jd-Core-Java. https://github.com/nviennot/jdcore-java, 2014.Google Scholar
- Keytool. http://docs.oracle.com/javase/6/docs/ technotes/tools/solaris/keytool.html, 2014.Google Scholar
- A list of shared libraries and Ad libraries used in Android apps. http://sites.psu.edu/kaichen/2014/ 02/20/a-list-of-shared-libraries-and-adlibraries-used-in-android-apps/, 2014.Google Scholar
- Myapp market. http://android.myapp.com/, 2014.Google Scholar
- Proguard. https://proguard.sourceforge.net/, 2014.Google Scholar
- Smali: An assembler/disassembler for Android’s dex format. https://code.google.com/p/smali, 2014.Google Scholar
- B. S. Baker. A program for identifying duplicated code. In Computer Science and Statistics: Proc. Symp. on the Interface, pages 49–57, 1992.Google Scholar
- B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE, pages 86–95, 1995. Google ScholarDigital Library
- B. S. Baker. Parameterized pattern matching: algorithms and applications. J. Comput. Syst. Sci., 52(1):28–42, 1996. Google ScholarDigital Library
- I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L.Bier. Clone detection using abstract syntax trees. In Proceedings of the 1998 International Conference on Software Maintenance (ICSM), 1998. Google ScholarDigital Library
- P. Bulychev and M. Minea. Duplicate code detection using anti-unification. In SYRCOSE, 2008.Google ScholarCross Ref
- K. Chen, P. Liu, and Y. Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on Android markets. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14), 2014. Google ScholarDigital Library
- X. Chen, A. Y. Wang, and E. D. Tempero. A replication and reproduction of code clone detection studies. In Proceedings of the Thirty-Seventh Australasian Computer Science Conference (ACSC), pages 105–114, 2014. Google ScholarDigital Library
- A. Corazza, S. Di Martino, V. Maggio, and G. Scanniello. A tree kernel based approach for clone detection. In Proceedings of the 2010 International Conference on Software Maintenance (ICSM ’10), pages 1–5, 2010. Google ScholarDigital Library
- J. Crussell, C. Gibler, and H. Chen. Attack of the clones: detecting cloned applications on Android markets. In Proceedings of the 17th European Symposium on Research in Computer Security (ESORICS ’12), 2012.Google ScholarCross Ref
- J. Crussell, C. Gibler, and H. Chen. Scalable semantics-based detection of similar Android applications. In Proceedings of the 18th European Symposium on Research in Computer Security (ESORICS ’13), 2013.Google Scholar
- C. Gibler, R. Stevens, J. Crussell, H. Chen, H. Zang, and H. Choi. AdRob: examining the landscape and impact of Android application plagiarism. In Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys ’13), pages 431–444, 2013. Google ScholarDigital Library
- S. Hanna, L. Huang, E. Wu, S. Li, C. Chen, and D. Song. Juxtapp: a scalable system for detecting code reuse among Android applications. In Proceedings of the 9th Conference on Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA ’12), 2012. Google ScholarDigital Library
- Y. Higo, U. Yasushi, M. Nishino, and S. Kusumoto. Incremental code clone detection: a PDG-based approach. In WCRE, pages 3 –12, 2011. Google ScholarDigital Library
- H. Huang, S. Zhu, P. Liu, and D. Wu. A framework for evaluating mobile app repackaging detection algorithm. In Proceedings of the 6th International Conference on Trust and Trustworthy Computing, 2013.Google ScholarCross Ref
- Y.-C. Jhi, X. Wang, X. Jia, S. Zhu, P. Liu, and D. Wu. Value-based program characterization and its application to software plagiarism detection. In Proceedings of the 33rd International Conference on Software Engineering, pages 756–765, 2011. Google ScholarDigital Library
- L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECKARD: scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE ’07), pages 96–105, 2007. Google ScholarDigital Library
- T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transaction on Software Engineering, 28(7):654–670, 2002. Google ScholarDigital Library
- H. Kim, Y. Jung, S. Kim, and K. Yi. MeCC: Memory comparison-based clone detector. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11), pages 301–310, 2011. Google ScholarDigital Library
- J. Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digit. Investig., 3:91–97, Sept. 2006. Google ScholarDigital Library
- J. Krinke. Identifying similar code with program dependence graphs. In WCRE, pages 301–309, 2001. Google ScholarDigital Library
- M.-W. Lee, J.-W. Roh, S.-w. Hwang, and S. Kim. Instant code clone search. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE ’10), pages 167–176, 2010. Google ScholarDigital Library
- S. Lee and I. Jeong. SDD: high performance code clone detection system for large scale source code. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA ’05), pages 140–141, 2005. Google ScholarDigital Library
- Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Transaction on Software Engineering, 32(3):176–192, 2006. Google ScholarDigital Library
- H.-i. Lim, H. Park, S. Choi, and T. Han. Detecting theft of Java applications via a static birthmark based on weighted stack patterns. IEICE - Trans. Inf. Syst., E91-D(9):2323–2332, 2008. Google ScholarDigital Library
- H.-i. Lim, H. Park, S. Choi, and T. Han. A method for detecting the theft of Java programs through analysis of the control flow information. Inf. Softw. Technol., 51(9):1338–1350, 2009. Google ScholarDigital Library
- M. Linares-Vásquez, A. Holtzhauer, C. Bernal-Cárdenas, and D. Poshyvanyk. Revisiting Android reuse studies in the context of code obfuscation and library usages. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 242–251. ACM, 2014. Google ScholarDigital Library
- B. Liu, B. Liu, H. Jin, and R. View. Efficient privilege de-escalation for ad libraries in mobile apps. In Proceedings of the The 13th International Conference on Mobile Systems, Applications, and Services (MobiSys ’15), 2015. Google ScholarDigital Library
- C. Liu, C. Chen, J. Han, and P. S. Yu. GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 872–881, 2006. Google ScholarDigital Library
- C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12), pages 364–374, 2012. Google ScholarDigital Library
- G. Myles and C. Collberg. K-gram based software birthmarks. In Proceedings of the 2005 ACM symposium on Applied computing, pages 314–318. Google ScholarDigital Library
- G. Myles and C. Collberg. Detecting software theft via whole program path birthmarks. In Information security, pages 404–415, 2004.Google ScholarCross Ref
- C. K. Roy and J. R. Cordy. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 IEEE International Conference on Program Comprehension, pages 172–181, 2008. Google ScholarDigital Library
- P. Schugerl. Scalable clone detection using description logic. In IWSC ’11, pages 47–53, 2011. Google ScholarDigital Library
- D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for Java. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’07), pages 274–283, 2007. Google ScholarDigital Library
- G. Selim, K. C. Foo, and Y. Zou. Enhancing source-based clone detection using intermediate representation. In WCRE, pages 227 –236, 2010. Google ScholarDigital Library
- H. Tamada, M. Nakamura, A. Monden, and K. ichi Matsumoto. Design and evaluation of birthmarks for detecting theft of Java programs. In Proceedings of the IASTED International Conference on Software Engineering, pages 569–575, 2004.Google Scholar
- H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K. ichi Matsumoto. Design and evaluation of dynamic software birthmarks based on API calls. Technical report, Nara Institute of Science and Technology, 2007.Google Scholar
- H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K.-I. Matsumoto. Dynamic software birthmarks to detect the theft of Windows applications. In Proceedings of the International Symposium on Future Software Technology (ISFST ’04), 2004.Google Scholar
- N. Viennot, E. Garcia, and J. Nieh. A measurement study of Google Play. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’14), pages 221–233, 2014. Google ScholarDigital Library
- H. Wang, Z. Wang, Y. Guo, and X. Chen. Detecting repackaged Android applications based on code clone detection technique. In SCIENCE CHINA Information Sciences, volume 44(1), pages 142–157, 2014.Google Scholar
- X. Wang, Y. chan Jhi, S. Zhu, and P. Liu. Detecting software theft via system call based birthmarks. In Proceedings of the 2009 Annual Computer Security Applications Conference, pages 149–158, 2009. Google ScholarDigital Library
- X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu. Behavior based software theft detection. In Proceedings of the 16th ACM Conference on Computer and Communications Security, pages 280–290, 2009. Google ScholarDigital Library
- K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), pages 1113–1120, 2009. Google ScholarDigital Library
- Y. Yuan and Y. Guo. CMCD: count matrix based code clone detection. In Proceedings of the 18th Asia Pacific Software Engineering Conference (APSEC ’11), pages 250–257, 2011. Google ScholarDigital Library
- Y. Yuan and Y. Guo. Boreas: an accurate and scalable token-based approach to code clone detection. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE ’12), pages 286–289, 2012. Google ScholarDigital Library
- F. Zhang, H. Huang, S. Zhu, D. Wu, and P. Liu. ViewDroid: towards obfuscation-resilient mobile application repackaging detection. In Proceedings of the 7th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec ’14), 2014. Google ScholarDigital Library
- Y. Zhauniarovich, O. Gadyatskaya, B. Crispo, F. La Spina, and E. Moser. FSquaDRA: fast detection of repackaged applications. In Data and Applications Security and Privacy XXVIII, volume 8566 of Lecture Notes in Computer Science, pages 130–145. 2014.Google Scholar
- W. Zhou, Y. Zhou, M. Grace, X. Jiang, and S. Zou. Fast, scalable detection of “piggybacked” mobile applications. In Proceedings of the Third ACM Conference on Data and Application Security and Privacy (CODASPY ’13), pages 185–196, 2013. Google ScholarDigital Library
- W. Zhou, Y. Zhou, X. Jiang, and P. Ning. Detecting repackaged smartphone applications in third-party Android marketplaces. In Proceedings of the Second ACM Conference on Data and Application Security and Privacy (CODASPY ’12), 2012. Google ScholarDigital Library
- Y. Zhou and X. Jiang. Dissecting Android malware: characterization and evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP ’12), pages 95–109, 2012. Google ScholarDigital Library
Index Terms
- WuKong: a scalable and accurate two-phase approach to Android app clone detection
Recommendations
Detecting repackaged smartphone applications in third-party android marketplaces
CODASPY '12: Proceedings of the second ACM conference on Data and Application Security and PrivacyRecent years have witnessed incredible popularity and adoption of smartphones and mobile devices, which is accompanied by large amount and wide variety of feature-rich smartphone applications. These smartphone applications (or apps), typically organized ...
Android Applications Repackaging Detection Techniques for Smartphone Devices
The problem of malwares affecting Smartphones has been widely recognized by the researchers across the world. Majority of these malwares target Android OS. Studies have found that most of the Android malwares hide inside repackaged apps to get inside ...
Inter-app communication between Android apps developed in app-inventor and Android studio
MOBILESoft '16: Proceedings of the International Conference on Mobile Software Engineering and SystemsCommunications between mobile apps are an important aspect of mobile platforms. Android is specifically designed with inter-app communication in mind and depends on this to provide different platform specific functionalities. Android Apps can either be ...
Comments