publication . Article . Preprint . 2015

Towards a useful definition of normalized compression distance for the classification of large files

Borbely, Rebecca Schuller;
Open Access
  • Published: 02 Sep 2015 Journal: Journal of Computer Virology and Hacking Techniques, volume 12, pages 235-242 (eissn: 2263-8733, Copyright policy)
  • Publisher: Springer Nature
Abstract
Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of applications. Existing discussions of NCD's theoretical merit rely on certain theoretical properties of compression algorithms. However, we demonstrate that many popular compression algorithms don't seem to satisfy these theoretical properties. We explore the relationship between some of these properties and file size, demonstrating that this theoretical problem is actually a practical problem for classifying malware with large file sizes, and we then introduce some variants of NCD that mitigate this problem.
Subjects
free text keywords: Computer Science (miscellaneous), Hardware and Architecture, Computational Theory and Mathematics, Software, Data mining, computer.software_genre, computer, Malware, Computer science, Data compression, File size, Normalized compression distance, Computer Science - Cryptography and Security
24 references, page 1 of 2

1. Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classi cation and analysis of internet malware. In: Recent advances in intrusion detection, pp. 178{197. Springer (2007) [OpenAIRE]

2. Bloom, C.: PPMZ: High compression markov predictive coder. http://www.cbloom.com/src/ppmz.html. Accessed: 2015-04-14

3. Cebrian, M., Alfonseca, M., Ortega, A., et al.: Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Communications in Information & Systems 5(4), 367{384 (2005)

4. Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared information and program plagiarism detection. Information Theory, IEEE Transactions on 50(7), 1545{ 1551 (2004)

5. Cilibrasi, R., Cruz, A.L., de Rooij, S., Keijzer, M.: Complearn. http://www.complearn.org. Accessed: 2015-04- 15

6. Cilibrasi, R., Vitanyi, P., De Wolf, R.: Algorithmic clustering of music. In: Web Delivering of Music, 2004. WEDELMUSIC 2004. Proceedings of the Fourth International Conference on, pp. 110{117. IEEE (2004) [OpenAIRE]

7. Cilibrasi, R., Vitanyi, P.M.: Clustering by compression. Information Theory, IEEE Transactions on 51(4), 1523{ 1545 (2005)

8. Cover, T., Hart, P.: Nearest neighbor pattern classi cation. Information Theory, IEEE Transactions on 13(1), 21{27 (1967) [OpenAIRE]

9. Dandu, R.V.: Storage media for computers in radiology. The Indian journal of radiology & imaging 18(4), 287 (2008)

10. loup Gailly, J., Adler, M.: zlib: A massively spi y yet delicately unobtrusive compression library. http://www. zlib.net. Accessed: 2015-04-14

11. Heath, M., Bowyer, K., Kopans, D., Kegelmeyer Jr, P., Moore, R., Chang, K., Munishkumaran, S.: Current status of the digital database for screening mammography. In: Digital mammography, pp. 457{460. Springer (1998) [OpenAIRE]

12. Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, P.: The digital database for screening mammography. In: Proceedings of the 5th international workshop on digital mammography, pp. 212{218 (2000)

13. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 341{343 (1975) [OpenAIRE]

14. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.: The similarity metric. Information Theory, IEEE Transactions on 50(12), 3250{3264 (2004)

15. Pavlov, I.: 7-zip. http://www.7-zip.org. Accessed: 2015- 04-14

24 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue