Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Software . 2024
License: CC BY
Data sources: Datacite
ZENODO
Software . 2024
License: CC BY
Data sources: Datacite
ZENODO
Software . 2024
License: CC BY
Data sources: Datacite
ZENODO
Software . 2024
License: CC BY
Data sources: Datacite
versions View all 4 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Fasta_Seq_Prepare_Bash

Authors: Aramayo, Rodolfo;

Fasta_Seq_Prepare_Bash

Abstract

AramayoLab Motivation The main motivation for generating this script was to be able to pre-process both transcriptome and/or proteome fasta files before subjecting them to computational analysis. By default the (-d flag 'Dust Gremlings') of the script will remove characters that if present in the sequence have the potential to interfere and or abort the processing of the file by certain software packages. This is particularly true for InterProScan, which is extremely sensitive to the presence of the character '*", which can usually be found in ENSEMBL protein sequences. These characters, when found, are replaced by the IUPAC 'X' (unknown). In addition, the '-r', and '-s' flags, control the sorting of the transcriptome and/or proteome fasta sequences from large to small (-r flag), or from small to large (-s flag. By default, the (-l flag) of the script removes transcripts '= to a given size provided. If activate, the (-f flag) of the script will remove sequences belonging to the following biotypes: IG_D_gene, IG_J_gene, TR_D_gene, and TR_J_gene, as defined by the fasta headers of ENSEMBL transcriptome and/or protein sequences. A fasta file could also be split into files containing a given number of sequences (e.g., 50000). This behavior (suppressed by default) is controlled by the (-i flag). A transcriptome and/or proteome file can also be clustered using CD-HIT-EST or CD-HIT, respectively, as controlled by the (-c flag). This behavior is suppressed by default. If splitting and clustering of a given fasta file is simultaneously requested, the file will be first clustered and then the resulting clustered file will be split. The (-x flag), controls the number of cores used in the clustering process. Finally, one can define the TMPDIR to be written to the same directory where the script is being executed, if so desired. Documentation ######################################################################################################################################################################################################## ARAMAYO_LAB This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . SCRIPT_NAME: Fasta_Seq_Prepare_v1.0.0.sh SCRIPT_VERSION: 1.0.0 USAGE: Fasta_Seq_Prepare_v1.0.0.sh -p Homo_sapiens.GRCh38.pep.all.fa # REQUIRED if -t Not Provided (Proteins File - Proteome) -t Homo_sapiens.GRCh38.cds.all.fa # REQUIRED if -p Not Provided (Transcripts File - Transcriptome) -l Sequences Lower Size # OPTIONAL (default = 50 - proteins | 150 - transcripts) -u Sequences Upper Size # OPTIONAL (default = No Limit) -f Filter Biotypes # OPTIONAL (default = No) -d Dust Gremlings # OPTIONAL (default = Yes)(Yes: Converts * to X - proteins and * to n - nucleotides) -s Sort File From Shorter to Larger # OPTIONAL (default = No) (Yes: Sequences will be sorted from shorter to larger) -r Sort File From Larger to Shorter # OPTIONAL (default = No) (Yes: Sequences will be sorted from larger to shorter) -c Cluster Sequences # OPTIONAL (default = No) (Yes: Requires cd-hit Installed) -w Fasta Width # OPTIONAL (default = 80) -i Split fasta file # OPTIONAL (default = No = 1) -x Number of Cores # OPTIONAL (default = 2) -z TMPDIR Location # OPTIONAL (default=0='TMPDIR Run') TYPICAL COMMANDS: Fasta_Seq_Prepare_v1.0.0.sh -p Homo_sapiens.GRCh38.pep.all.fa Fasta_Seq_Prepare_v1.0.0.sh -p Homo_sapiens.GRCh38.pep.all.fa -c yes Fasta_Seq_Prepare_v1.0.0.sh -t Homo_sapiens.GRCh38.cdna.all.fa Fasta_Seq_Prepare_v1.0.0.sh -t Homo_sapiens.GRCh38.cdna.all.fa -c yes INPUT01: -p FLAG REQUIRED input ONLY if the '-t' flag associated file is not provided INPUT01_FORMAT: Proteome Fasta File INPUT01_DEFAULT: No default INPUT02: -t FLAG REQUIRED input ONLY if the '-p' flag associated file is not provided INPUT02_FORMAT: Transcriptome Fasta File INPUT02_DEFAULT: No default INPUT03: -l FLAG OPTIONAL input. Sequence Lower Size Filtering Value INPUT03_FORMAT: Numeric INPUT03_DEFAULT: 50 (proteins) | 150 (transcripts) | 1 = No Limit (Do Not Filter) INPUT03_NOTES: If provided, this number will be used to reject sequences whose length are equal to, or shorter than, the number provided INPUT03_NOTES: Note that the default value is a function of the class of file provided (i.e., proteome or transcriptome) INPUT03_NOTES: Also note that any numerical value will be accepted INPUT04: -u FLAG OPTIONAL input. Sequence Upper Size Filtering Value INPUT04_FORMAT: Numeric INPUT04_DEFAULT: 0 = No Limit (Do Not Filter) INPUT04_NOTES: If provided, this number will be used to reject sequences whose length are equal to, or larger than, the number provided INPUT04_NOTES: Note that any numerical value will be accepted INPUT05: -f FLAG OPTIONAL input. Filter Biotype INPUT05_FORMAT: yes | no INPUT05_DEFAULT: no (Do Not Filter) INPUT05_NOTES: If Activated, sequences belonging to the following biotypes will be filtered out: IG_D_gene, IG_J_gene, TR_D_gene, and TR_J_gene INPUT06: -d FLAG OPTIONAL input. Dust Gremlings INPUT06_FORMAT: yes | no INPUT06_DEFAULT: yes (Dust Gremlings) INPUT06_NOTES: Dusting will convert '*' to 'X' (for proteins), and '*' to 'n' (for transcripts) INPUT07: -s FLAG OPTIONAL input. Sort file INPUT07_FORMAT: yes | no INPUT07_DEFAULT: no (Do Not Sort) INPUT07_NOTES: If provided, sequences will be sorted from shorter to larger INPUT08: -r FLAG OPTIONAL input. Sort file INPUT08_FORMAT: yes | no INPUT08_DEFAULT: no (Do Not Sort) INPUT08_NOTES: If provided, sequences will be sorted from larger to shorter INPUT09: -c FLAG OPTIONAL input. Cluster Sequences INPUT09_FORMAT: yes | no INPUT09_DEFAULT: no (Do Not Cluster) INPUT09_NOTES: If provided, sequences will be clustered by cd-hit according to stringent parameters INPUT10: -w FLAG OPTIONAL input INPUT10_FORMAT: Numeric INPUT10_DEFAULT: 80 INPUT10_NOTES: The number sets the width of the output fasta file INPUT11: -i FLAG OPTIONAL input INPUT11_FORMAT: Numeric INPUT11_DEFAULT: 1 (Do Not Split) INPUT11_NOTES: Determines the number of fasta sequences requested to be present on each resulting splitted file INPUT11_NOTES: This option is only valid for fasta files containing >= 2 files INPUT12: -x FLAG OPTIONAL input INPUT12_FORMAT: Numeric INPUT12_DEFAULT: 2 INPUT12_NOTES: The maximum number of cores requested should be equal to N-1; where N is the total number of cores available in the computer performing the analysis INPUT12_NOTES: Number of Cores INPUT13: -z FLAG OPTIONAL input INPUT13_FORMAT: Numeric: 0 == TMPDIR Run | 1 == Normal Run INPUT13_DEFAULT: 0 == TMPDIR Run INPUT13_NOTES: 0 Processes the data in the $TMPDIR directory of the computer used or of the node assigned by the SuperComputer scheduler INPUT13_NOTES: Processing the data in the $TMPDIR directory of the node assigned by the SuperComputer scheduler reduces the possibility of file error generation due to network traffic INPUT13_NOTES: 1 Processes the data in the same directory where the script is being run DEPENDENCIES: GNU AWK: Required (https://www.gnu.org/software/gawk/) GNU COREUTILS: Required (https://www.gnu.org/software/coreutils/) CD-HIT: Required if clustering is invoked (https://github.com/weizhongli/cdhit; PMID: 16731699; PMID: 23060610) Author: Rodolfo Aramayo WORK_EMAIL: raramayo@tamu.edu PERSONAL_EMAIL: rodolfo@aramayo.org ######################################################################################################################################################################################################## Development/Testing Environment: Distributor ID: Apple, Inc. Description: Apple M1 Max Release: 14.4.1 Codename: Sonoma Distributor ID: Ubuntu Description: Ubuntu 22.04.3 LTS Release: 22.04 Codename: jammy Required Script Dependencies: GNU AWK (https://www.gnu.org/software/gawk/) Version Number: 5.3.0, API 4.0 GNU Awk 5.3.0, API 4.0 Copyright (C) 1989, 1991-2023 Free Software Foundation. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/. GNU COREUTILS (https://www.gnu.org/software/coreutils/) Version Number: 8.30 (GNU coreutils) 9.4 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later . This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Richard M. Stallman and David MacKenzie. CDHIT (https://github.com/weizhongli/cdhit) Version Number: 4.8.1 ====== CD-HIT version 4.8.1 (built on Jan 21 2024) ====== Usage: cd-hit [Options] Options -i input filename in fasta format, required, can be in .gz format -o output filename, required -c sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as: number of identical amino acids or bases in alignment divided by the full length of the shorter sequence -G use global sequence identity, default 1 if set to 0, then use local sequence identity, calculated as : number of identical amino acids or bases in alignment divided by the length of the alignment NOTE!!! don't use -G 0 unless you use alignment coverage controls see options -aL, -AL, -aS, -AS -b band_width of alignment, default 20 -M memory limit (in MB) for the program, default 800; 0 for unlimitted; -T number of threads, default 1; with 0, all CPUs will be used -n word_length, default 5, see user's guide for choosing it -l length of throw_away_sequences, default 10 -t tolerance for redundance, default 2 -d length of description in .clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space -s length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster -S length difference cutoff in amino acid, default 999999 if set to 60, the length difference between the shorter sequences and the representative of the cluster can not be bigger than 60 -aL alignment coverage for the longer sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AL alignment coverage control for the longer sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -aS alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AS alignment coverage control for the shorter sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -A minimal alignment coverage control for the both sequences, default 0 alignment must cover >= this value for both sequences -uL maximum unmatched percentage for the longer sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tailing gaps) must not be more than 10% of the sequence -uS maximum unmatched percentage for the shorter sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tailing gaps) must not be more than 10% of the sequence -U maximum unmatched length, default 99999999 if set to 10, the unmatched region (excluding leading and tailing gaps) must not be more than 10 bases -B 1 or 0, default 0, by default, sequences are stored in RAM if set to 1, sequence are stored on hard drive !! No longer supported !! -p 1 or 0, default 0 if set to 1, print alignment overlap in .clstr file -g 1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but either 1 or 0 won't change the representatives of final clusters -sc sort clusters by size (number of sequences), default 0, output clusters by decreasing length if set to 1, output clusters by decreasing size -sf sort fasta/fastq by cluster size (number of sequences), default 0, no sorting if set to 1, output sequences by decreasing cluster size this can be very slow if the input is in .gz format -bak write backup cluster file (1 or 0, default 0) -h print this help Questions, bugs, contact Weizhong Li at liwz@sdsc.edu For updated versions and information, please visit: http://cd-hit.org or https://github.com/weizhongli/cdhit cd-hit web server is also available from http://cd-hit.org If you find cd-hit useful, please kindly cite: "CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659 "CD-HIT: accelerated for clustering the next generation sequencing data", Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152

If you use this software, please cite it as below.

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
  • citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    Powered byBIP!BIP!
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average