Immunotools 1.0 manual

1. What are ImmunoTools?
2. Installation
    2.1. Verifying your installation
3. Diversity Analyzer
    3.1. Basic options
    3.2. Advanced options
    3.3. Examples
    3.4. Output files
    3.5. Output file format
      3.5.1. CDR details file
      3.5.2. SHM details file
      3.5.3. V alignments file
4. IgScout
    4.1. IgScout options
    4.2. IgScout Analyzer
5. Tandem CDR3s Finder
6. Feedback and bug reports

1. What are ImmunoTools?

Immunotools package includes:

DiversityAnalyzer, a tool for diversity analysis of adaptive immune repertoires. It takes full-length immunosequencing reads or constructed repertoire as an input and performs the following steps:

Alignment of input sequences against germline V and J segments.
CDR labeling of aligned reads
Computation of SHM positions
Computation of diversity indices for computed CDRs
Visualization of computed diversity statistics

IgScout, a tool for de novo inference of diversity (IGHD) genes from immunoglobulin CDR3s.
Tandem CDR3 Finder, a tool for finding CDR3s with tandem fusion of D-D genes.

2. Installation

64-bit Linux or MacOS system
g++ (version 4.7 or higher) or clang compiler
cmake (version 2.8.8 or higher)
Python 2 (version 2.7 or higher), including:

    
    ./prepare_cfg

    
    make

2.1. Verifying your installation


    ./diversity_analyzer.py --test

    
    Thank you for using Diversity Analyzer!
    Log was written to <your_installation_dir>/divan_test/ig_repertoire_constructor.log

3. Diversity Analyzer

Diversity Analyzer takes full-length immunosequencing reads or constructed repertoire in FASTQ/FASTA as an input and analyzes diversity characteristics: VJ combinations, CDRs, SHMs.

    
    ./diversity_analyzer.py [options] -i <input_sequences> -o <output_dir>

3.1. Basic options

-i <input_sequences>

-o / --output <output_dir>

-t / --threads <int>

16

--test

    
    ./diversity_analyzer.py -i test_dataset/merged_reads.fastq -o divan_test

--help

3.2. Advanced options

--domain <str>

imgt

kabat

imgt

-l / --loci <str>

IGH

IGL

IGK

IG

TRA

TRB

TRG

TRD

TR

all

IG

--organism <str>

human

mouse

pig

rabbit

rat

rhesus_monkey

human

--skip-plots

3.3. Examples

input_reads.fastq

    
    ./diversity_analyzer.py -i input_reads.fastq -o divan_test

3.4. Output files

-o

cleaned_sequences.fasta — input sequences that have good alignment against V and J germline database. Diversity Analyzer also crops input sequences by the start of V segment and inverts them in V(D)J direction.
cdr_details.txt — detailed information about CDR labeling of sequences from cleaned_sequences.fasta. Description of cdr_details.txt file format can be found here.
shm_details.txt — detailed information about SHM labeling of sequences from cleaned_sequences.fasta. Description of shm_details.txt file format can be found here.
v_alignment.fasta — alignment of sequences from cleaned_sequences.fasta against V segments in FASTA format. Description of v_alignment.fasta file format can be found here.

cdr1s.fasta — FASTA file with all computed CDR1 sequences.
cdr2s.fasta — FASTA file with all computed CDR2 sequences.
cdr3s.fasta — FASTA file with all computed CDR3 sequences.
compressed_cdr3s.fasta — FASTA file with unique CDR3 sequences. Abundances of unique CDR3 sequences are specified in header lines.

annotation_report.html — summary report in HTML format. Example of annotation_report.html file can be found here.
plots — directory containing plots with diversity statistics. Please note that plots directory will not be created in case of option --skip-plots.

diversity_analyzer.log — a full log of Diversity Analyzer tool.

3.5. Output file formats

3.5.1. CDR details file

cdr_details.txt

Read_name — names of reads from cleaned_sequences.fasta file. Rows in cdr_details.txt and sequences in cleaned_sequences.fasta are consistently ordered.
Chain_type — type of chain of a sequence: IGH / IGK / IGL / TRA / TRB / TRD or TRG.
V_hit, J_hit — names of V and J gene segments that provide the best alignments.
AA_seq — amino acid sequence.
Has_stop_codon — indicator of presence of stop codon in a sequence: 1 - sequence contains stop codon, 0 - sequence does not contain stop codon.
In-frame — indicator showing whether a sequence is in-frame or not.
Productive — indicator of sequence productiveness. We consider that sequence is productive if it is in-frame and does not contain stop codons.
CDR1_nucls, CDR1_start, CDR1_end — nucleotide sequence, start and end positions of CDR1.
CDR2_nucls, CDR2_start, CDR2_end — nucleotide sequence, start and end positions of CDR2.
CDR3_nucls, CDR3_start, CDR3_end — nucleotide sequence, start and end positions of CDR3.

3.5.2. SHM details file

shm_details.txt contains a list of SHMs that are consecutively written for each sequences from cleaned_sequences.fasta. Records in shm_details.txt are consistently ordered with respect to cleaned_sequences.fasta.

SHMs are reported separately for each sequence and V / J hit. SHMs in different hits are separated by a line containing information about name and length of sequence, name and length of gene, type of segment (V / J) and chain (IGH / IGK / IGL / TRA / TRB / TRG / TRD):

    
  Read_name:1_merged_read     Read_length:354     Gene_name:IGHV3-20*01   Gene_length:296     Segment:V   Chain_type:IGH

For a given hit, SHMs are written in order of position increasing. Each line corresponds to a single SHM and contains the following fields:

SHM_type — type of SHM. Diversity Analyzer distinguishes three possible types of SHMs: substitution (S), insertion (I) and deletion (D). Please note that Diversity Analyzer does not join consecutive deletions and insertions and reports each of them as a single SHM. E.g., in case of ACGTATC & AC---TC alignment, three SHMs will be reported.
Read_pos, Gene_pos — position of SHM on read and gene, respectively. Please note that indexation is 1-based.
Read_nucl, Gene_nucl — nucleotide corresponding to SHM on read and gene, respectively. If SHM corresponds to deletion, value of Read_nucl field will be '-'. If SHM corresponds to insertion, value of Gene_nucl field will be '-'.
Read_aa, Gene_aa — amino acid corresponding to SHM on read and gene, respectively.
Is_synonymous — indicator showing whether SHM does not change amino acid.
To_stop_codon — indicator showing whether SHM changes amino acid into stop codon.

shm_details.txt


  SHM_type    Read_pos    Gene_pos    Read_nucl   Gene_nucl   Read_aa     Gene_aa     Is_synonymous   To_stop_codon
  Read_name:1_merged_read     Read_length:354     Gene_name:IGHV3-20*01   Gene_length:296     Segment:V   Chain_type:IGH
  S       20      20      C       T       S       S       1       0
  S       29      29      C       T       G       G       1       0
  S       35      35      C       A       V       V       1       0
  S       37      37      A       G       Q       R       0       0
  S       45      45      A       G       R       G       0       0
  Read_name:1_merged_read     Read_length:354     Gene_name:IGHJ3*02      Gene_length:50      Segment:J   Chain_type:IGH
  S       30      335     C       A       T       T       1       0
  S       32      337     C       T       T       M       0       0

3.5.3. V alignments file

v_alignment.fasta


  >INDEX:1|READ:1_merged_read|START_POS:0|END_POS:49
  CAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTG-GGGGTC
  >INDEX:1|GENE:IGHV3-20*01|START_POS:0|END_POS:50|CHAIN_TYPE:IGH
  CAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTGGGGGGTC

START_POS

END_POS

4. IgScout

IgScout takes CDR3s in FASTA format as an input and reports inferred sequences of D genes in FASTA format.

    
    python igscout.py [options] -i <CDR3_sequences> -o <output_dir>

4.1. IgScout options

-i <CDR3_sequences>

-o / --output <output_dir>

-k <int>

-v <V_genes.fasta>

-j <J_genes.fasta>

Note

4.2. IgScout Analyzer

    
    python igscout_analyzer.py <inferred_D_genes.fasta> <known_D_genes.fasta> <output_dir>

Known gene represents a substring of a known D gene;
Inaccurate gene represents a concatenation of a substring of a known D gene and 1-2 nucleotides at the start or the end of segment. Typically, inaccurate inferences emerge as a result of highly frequent non-genomic or palindromic insertions.
Ambiguous allele represents a substring of several alleles of the same D gene.
Ambiguous gene represents a substring of several known D genes.
Novel allele represents a novel allele of a known D gene.

novel gene

segment_alignment.txt

segment_alignment.fasta

5. Tandem CDR3 Finder

Tandem CDR3 Finder takes CDR3s and D genes in FASTA format as an input and computes usage of D genes in regular (single) and tandem CDR3s.

    
    python igscout.py <D_genes> <CDR3_sequences> <output_dir>

Note:

immunotools_dir/data/ordered_d_genes/human_IGHD.fa

Single CDR3 contains a single match of a D gene.
Tandem CDR3 contains two non-overlapping matches of D genes.
Non-traceable CDR3 does not contain any D gene match.

single_d_usage.txt and single_d_usage.pdf show usage of D genes in single CDR3s.
single_d_usage directory consists of PDF files showing coverage of D genes by single CDR3s.
tandem_cdr3s.txt contains information about detected tandem CDR3s.
tandem_dd_matrix.pdf shows a matrix of D-D fusions. If D genes are ordered, most tandem CDR3s should appear in the upper triangle of the matrix.

6. Feedback and bug reports

Yana Safonova