
Abstract In digital forensics, file carving is the process of recovering files on a storage media in part or in whole without any file system information. An important problem in file carving is the identification of fragment types. Many fragment classification studies in the literature employ inflexible and indiscernible feature selection methods such as different statistics of byte frequency distributions. Moreover, assessing the strengths and weaknesses of some approaches is difficult as they are specific to certain file types such as graphics. In this paper, we propose a novel feature generation model using byte embeddings ( Byte2Vec ) which map fragments to dense vector representations. The proposed model extends the word2vec and doc2vec document embedding models to bytes and fragments, respectively. We use Byte2Vec for feature extraction and k -Nearest Neighbors ( k NN) for classification. We present effectiveness of Byte2Vec+kNN in file fragment classification using a publicly available digital forensics dataset and a random web search dataset. Our experimental results show that Byte2Vec+kNN reaches an accuracy rate of 72% along with 74% precision and 72% recall. Compared to the other feature extraction techniques such as n-gram, byte distributions, byte statistics, byte distances, and sparse dictionaries for byte n-gram along with different classifiers, Byte2Vec+kNN achieves an absolute improvement of 3% and 12% in accuracy and precision, respectively.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 17 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
