Precursorfree and fast spectral library search using approximate nearest neighbor techniques

Related identifiers: doi: 10.5281/zenodo.56002 
Subject: zenodo  Uncategorized
<p><strong>Precursorfree and fast spectral library search using approximate nearest neighbor techniques</strong></p>
<p>In a massspectrometry proteomics experiments only a minority of spectra can be confidently identified, with several of the unidentified spectra due to unconsidered modifications. We here present a spectral library search engine using an approximate nearest neighbor scheme to perform precursorfree searches, capable of obtaining masstolerant spectrum identifications while significantly speeding up the computation time.</p>
<p><strong>Introduction</strong></p>
<p>Generally, fewer than half of all spectra are identified in a massspectrometry proteomics experiment. However, recent research has shown that when using a masstolerant database search, a large proportion of the unassigned spectra can be identified as modified peptides. [Chick2015] Unfortunately, by opening up the search space in such a fashion, an extremely high number of candidates has to be checked to determine a peptidespectrum match (PSM) for each spectrum, resulting in an excessive computation time.</p>
<p>On the other hand, instead of sequence database search engines, spectral library search engines can be used to identify spectra as well. Because spectral libraries use previously observed spectra to determine the PSM's, its advantages are a reduced search space and very effective similarity matching. Here we apply the idea of masstolerant peptidespectrum matching using a spectral library, by using an approximate nearest neighbor technique to quickly and effectively further reduce the increased search space.</p>
<p><strong>Methods</strong></p>
<p>Although spectral libraries by definition exhibit a reduced search space compared to sequence database search engines, when performing a masstolerant search, still tens to hundreds of thousands of candidate matches have to be checked, up to almost the entire spectral library, as indicated in Figure 1. However, because when using spectral libraries all library spectra are known beforehand, we can leverage this limited search space to only retrieve the most relevant candidates.</p>
<p>Spectral libraries mostly employ the cosine distance as similarity measure to determine valid matches. Then, each spectrum can be considered as a vector in a (very) highdimensional space. Generally, for a query spectrum its similarity with all library spectra within the precursor mass window has to be computed. However, by using approximate nearest neighbor techniques in this vector space, the number of candidate matches to be considered can be drastically reduced. Approximate nearest neighbor techniques based on the localitysensitive hashing principle are able to partition the data into 'buckets' consisting of very similar vectors. This is done by iteratively hashing vectors to buckets based on their position compared to random split vectors. This way the data space can be reduced until only a few, very similar, vectors remain in each bucket. Then, for each query spectrum, instead of having to examine the whole data space, only the bucket(s) with the most similar library spectra have to be retrieved to determine the best PSM.</p>
<p><strong>Results & Discussion</strong></p>
<p>We have implemented a masstolerant approximate nearest neighbor spectral library search engine in Python. Preliminary results show that approximate nearest neighbor techniques can drastically reduce the search space and speed up queries. Furthermore, this speedup can be tuned at the expense of some accuracy.</p>
<p>Additionally, because in this approach candidate spectra are nog longer filtered on precursor mass, performing precursorfree, masstolerant, searches is implicitly supported. Figure 2 shows that most PSM's are due to unmodified peptides (a mass difference around 0 Da), while on the other hand, various modified peptides can be identified as well, where based on the precursor mass difference the modification(s) can be determined.</p>
<p>Using approximate nearest neighbor techniques to speed up spectral library search engines seems a promising technique to perform masstolerant searches to identify modified peptides, resulting in a record number of spectrum identifications that can be obtained in a minimum amount of time.</p>
<p><strong>References</strong></p>
<p>Chick, J. M. <em>et al</em>. A masstolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. <em>Nature Biotechnology</em> <strong>33</strong>, 743–749 (2015).</p>
 Similar Research Results (2)

Metrics
No metrics available