Immunotools package includes:
./prepare_cfg
and:
make
./diversity_analyzer.py --test
If the installation is successful, you will find the following information at the end of the log:
Thank you for using Diversity Analyzer!
Log was written to <your_installation_dir>/divan_test/ig_repertoire_constructor.log
Diversity Analyzer takes full-length immunosequencing reads or constructed repertoire in FASTQ/FASTA as an input and analyzes diversity characteristics: VJ combinations, CDRs, SHMs.
To run Diversity Analyzer, type:
./diversity_analyzer.py [options] -i <input_sequences> -o <output_dir>
-i <input_sequences>
-o / --output <output_dir>
-t / --threads <int>
16
.
--test
./diversity_analyzer.py -i test_dataset/merged_reads.fastq -o divan_test
--help
--domain <str>
imgt
or kabat
.
Default value is imgt
.
-l / --loci <str>
IGH
/ IGL
/ IGK
/ IG
(for all BCRs) /
TRA
/ TRB
/ TRG
/ TRD
/ TR
(for all TCRs) or all
.
Default value is IG
.
--organism <str>
human
.
Further Diversity Analyzer usage will be extended for mouse
, pig
,
rabbit
, rat
and rhesus_monkey
.
Default value is human
.
--skip-plots
input_reads.fastq
,
type:
./diversity_analyzer.py -i input_reads.fastq -o divan_test
-o
)
and outputs the following files there:
--skip-plots
.
Read_name
— names of reads from cleaned_sequences.fasta file.
Rows in cdr_details.txt and sequences in cleaned_sequences.fasta are consistently ordered.Chain_type
— type of chain of a sequence: IGH / IGK / IGL / TRA / TRB / TRD or TRG.V_hit
, J_hit
— names of V and J gene segments that provide the best alignments.AA_seq
— amino acid sequence. Has_stop_codon
— indicator of presence of stop codon in a sequence:
1
- sequence contains stop codon, 0
- sequence does not contain stop codon.In-frame
— indicator showing whether a sequence is in-frame or not. Productive
— indicator of sequence productiveness.
We consider that sequence is productive if it is in-frame and does not contain stop codons.
CDR1_nucls
, CDR1_start
, CDR1_end
—
nucleotide sequence, start and end positions of CDR1.
CDR2_nucls
, CDR2_start
, CDR2_end
—
nucleotide sequence, start and end positions of CDR2.
CDR3_nucls
, CDR3_start
, CDR3_end
—
nucleotide sequence, start and end positions of CDR3.
shm_details.txt contains a list of SHMs that are consecutively written for each sequences from cleaned_sequences.fasta. Records in shm_details.txt are consistently ordered with respect to cleaned_sequences.fasta.
SHMs are reported separately for each sequence and V / J hit. SHMs in different hits are separated by a line containing information about name and length of sequence, name and length of gene, type of segment (V / J) and chain (IGH / IGK / IGL / TRA / TRB / TRG / TRD):
Read_name:1_merged_read Read_length:354 Gene_name:IGHV3-20*01 Gene_length:296 Segment:V Chain_type:IGH
For a given hit, SHMs are written in order of position increasing. Each line corresponds to a single SHM and contains the following fields:
SHM_type
— type of SHM.
Diversity Analyzer distinguishes three possible types of SHMs: substitution (S
),
insertion (I
) and deletion (D
).
Please note that Diversity Analyzer does not join consecutive deletions and insertions and reports each of them as a single SHM.
E.g., in case of ACGTATC
& AC---TC
alignment, three SHMs will be reported.
Read_pos
, Gene_pos
— position of SHM on read and gene, respectively.
Please note that indexation is 1-based.
Read_nucl
, Gene_nucl
— nucleotide corresponding to SHM on read and gene, respectively.
If SHM corresponds to deletion, value of Read_nucl
field will be '-
'.
If SHM corresponds to insertion, value of Gene_nucl
field will be '-
'.
Read_aa
, Gene_aa
— amino acid corresponding to SHM on read and gene, respectively.
Is_synonymous
— indicator showing whether SHM does not change amino acid.
To_stop_codon
— indicator showing whether SHM changes amino acid into stop codon.
SHM_type Read_pos Gene_pos Read_nucl Gene_nucl Read_aa Gene_aa Is_synonymous To_stop_codon
Read_name:1_merged_read Read_length:354 Gene_name:IGHV3-20*01 Gene_length:296 Segment:V Chain_type:IGH
S 20 20 C T S S 1 0
S 29 29 C T G G 1 0
S 35 35 C A V V 1 0
S 37 37 A G Q R 0 0
S 45 45 A G R G 0 0
Read_name:1_merged_read Read_length:354 Gene_name:IGHJ3*02 Gene_length:50 Segment:J Chain_type:IGH
S 30 335 C A T T 1 0
S 32 337 C T T M 0 0
>INDEX:1|READ:1_merged_read|START_POS:0|END_POS:49
CAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTG-GGGGTC
>INDEX:1|GENE:IGHV3-20*01|START_POS:0|END_POS:50|CHAIN_TYPE:IGH
CAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTGGGGGGTC
Please not that start position (START_POS
field in a header) and end position (END_POS
field) are inclusive.
IgScout takes CDR3s in FASTA format as an input and reports inferred sequences of D genes in FASTA format.
To run IgScout, type:
python igscout.py [options] -i <CDR3_sequences> -o <output_dir>
-i <CDR3_sequences>
-o / --output <output_dir>
-k <int>
-v <V_genes.fasta>
-j <J_genes.fasta>
python igscout_analyzer.py <inferred_D_genes.fasta> <known_D_genes.fasta> <output_dir>
IgScout Analyzer aligns inferred D genes to known D genes and classify high quality alignments into 5 categories:
Tandem CDR3 Finder takes CDR3s and D genes in FASTA format as an input and computes usage of D genes in regular (single) and tandem CDR3s.
To run Tandem CDR3 Finder, type:
python igscout.py <D_genes> <CDR3_sequences> <output_dir>
Note: we recommend to order D genes according to their occurrences in the genome: from V genes to J genes.
If D genes are not ordered, the tool still will be able to compute tandem CDR3s, but the tandem matrix will not be accurate.
Ordered human D genes can be found in immunotools_dir/data/ordered_d_genes/human_IGHD.fa
.