Immunotools package includes:
./prepare_cfg
and:
make
./diversity_analyzer.py --test
If the installation is successful, you will find the following information at the end of the log:
Thank you for using Diversity Analyzer!
Log was written to <your_installation_dir>/divan_test/ig_repertoire_constructor.log
Diversity Analyzer takes full-length immunosequencing reads or constructed repertoire in FASTQ/FASTA as an input and analyzes diversity characteristics: VJ combinations, CDRs, SHMs.
To run Diversity Analyzer, type:
./diversity_analyzer.py [options] -i <input_sequences> -o <output_dir>
-i <input_sequences>-o / --output <output_dir>-t / --threads <int>16.
--test
./diversity_analyzer.py -i test_dataset/merged_reads.fastq -o divan_test
--help--domain <str>imgt or kabat.
Default value is imgt.
-l / --loci <str>IGH / IGL / IGK / IG (for all BCRs) /
TRA / TRB / TRG / TRD / TR (for all TCRs) or all.
Default value is IG.
--organism <str>human.
Further Diversity Analyzer usage will be extended for mouse, pig,
rabbit, rat and rhesus_monkey.
Default value is human.
--skip-plotsinput_reads.fastq,
type:
./diversity_analyzer.py -i input_reads.fastq -o divan_test
-o)
and outputs the following files there:
--skip-plots.
Read_name — names of reads from cleaned_sequences.fasta file.
Rows in cdr_details.txt and sequences in cleaned_sequences.fasta are consistently ordered.Chain_type — type of chain of a sequence: IGH / IGK / IGL / TRA / TRB / TRD or TRG.V_hit, J_hit — names of V and J gene segments that provide the best alignments.AA_seq — amino acid sequence. Has_stop_codon — indicator of presence of stop codon in a sequence:
1 - sequence contains stop codon, 0 - sequence does not contain stop codon.In-frame — indicator showing whether a sequence is in-frame or not. Productive — indicator of sequence productiveness.
We consider that sequence is productive if it is in-frame and does not contain stop codons.
CDR1_nucls, CDR1_start, CDR1_end —
nucleotide sequence, start and end positions of CDR1.
CDR2_nucls, CDR2_start, CDR2_end —
nucleotide sequence, start and end positions of CDR2.
CDR3_nucls, CDR3_start, CDR3_end —
nucleotide sequence, start and end positions of CDR3.
shm_details.txt contains a list of SHMs that are consecutively written for each sequences from cleaned_sequences.fasta. Records in shm_details.txt are consistently ordered with respect to cleaned_sequences.fasta.
SHMs are reported separately for each sequence and V / J hit. SHMs in different hits are separated by a line containing information about name and length of sequence, name and length of gene, type of segment (V / J) and chain (IGH / IGK / IGL / TRA / TRB / TRG / TRD):
Read_name:1_merged_read Read_length:354 Gene_name:IGHV3-20*01 Gene_length:296 Segment:V Chain_type:IGH
For a given hit, SHMs are written in order of position increasing. Each line corresponds to a single SHM and contains the following fields:
SHM_type — type of SHM.
Diversity Analyzer distinguishes three possible types of SHMs: substitution (S),
insertion (I) and deletion (D).
Please note that Diversity Analyzer does not join consecutive deletions and insertions and reports each of them as a single SHM.
E.g., in case of ACGTATC & AC---TC alignment, three SHMs will be reported.
Read_pos, Gene_pos — position of SHM on read and gene, respectively.
Please note that indexation is 1-based.
Read_nucl, Gene_nucl — nucleotide corresponding to SHM on read and gene, respectively.
If SHM corresponds to deletion, value of Read_nucl field will be '-'.
If SHM corresponds to insertion, value of Gene_nucl field will be '-'.
Read_aa, Gene_aa — amino acid corresponding to SHM on read and gene, respectively.
Is_synonymous — indicator showing whether SHM does not change amino acid.
To_stop_codon — indicator showing whether SHM changes amino acid into stop codon.
SHM_type Read_pos Gene_pos Read_nucl Gene_nucl Read_aa Gene_aa Is_synonymous To_stop_codon
Read_name:1_merged_read Read_length:354 Gene_name:IGHV3-20*01 Gene_length:296 Segment:V Chain_type:IGH
S 20 20 C T S S 1 0
S 29 29 C T G G 1 0
S 35 35 C A V V 1 0
S 37 37 A G Q R 0 0
S 45 45 A G R G 0 0
Read_name:1_merged_read Read_length:354 Gene_name:IGHJ3*02 Gene_length:50 Segment:J Chain_type:IGH
S 30 335 C A T T 1 0
S 32 337 C T T M 0 0
>INDEX:1|READ:1_merged_read|START_POS:0|END_POS:49
CAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTG-GGGGTC
>INDEX:1|GENE:IGHV3-20*01|START_POS:0|END_POS:50|CHAIN_TYPE:IGH
CAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTGGGGGGTC
Please not that start position (START_POS field in a header) and end position (END_POS field) are inclusive.
IgScout takes CDR3s in FASTA format as an input and reports inferred sequences of D genes in FASTA format.
To run IgScout, type:
python igscout.py [options] -i <CDR3_sequences> -o <output_dir>
-i <CDR3_sequences>-o / --output <output_dir>-k <int>-v <V_genes.fasta>-j <J_genes.fasta>
python igscout_analyzer.py <inferred_D_genes.fasta> <known_D_genes.fasta> <output_dir>
IgScout Analyzer aligns inferred D genes to known D genes and classify high quality alignments into 5 categories:
Tandem CDR3 Finder takes CDR3s and D genes in FASTA format as an input and computes usage of D genes in regular (single) and tandem CDR3s.
To run Tandem CDR3 Finder, type:
python igscout.py <D_genes> <CDR3_sequences> <output_dir>
Note: we recommend to order D genes according to their occurrences in the genome: from V genes to J genes.
If D genes are not ordered, the tool still will be able to compute tandem CDR3s, but the tandem matrix will not be accurate.
Ordered human D genes can be found in immunotools_dir/data/ordered_d_genes/human_IGHD.fa.