Computational methods for T cell vaccine target discovery has been focused on prediction of binding of human leukocyte antigen to highly conserved peptides identified across pathogen variants. This approach compresses host and pathogen diversity into smaller sets of antigen targets. However, studies have shown that T cell epitopes in highly variable viral pathogens are generally not well conserved and this diversity may need to be addressed through polyvalent vaccine designs [1]. We have developed a method for antigen assessment and selection for polyvalent vaccines, based on block conservation analysis that identifies immune epitope candidates from multiple sequence alignment.

Antigen conservation and variability analysis for immunological applications is traditionally performed by calculating sequence similarity from local sequence alignments [2], or by calculating frequency of nucleotides or amino acids on each position in a multiple sequence alignment (MSA) of homologous pathogen genes or proteins [3]. Regions, in which several consecutive residues show high conservation, are then further analyzed for immunogenic potential either by computational predictions, experimental testing, or combination thereof.

A major drawback of vaccine target selection using existing methods is the systematic exclusion of low frequency variants. The frequency of occurrence of any given peptide within pathogen variants does not impact its potential as an immunogen. Potential T cell-mediated immunogenicity is assessed by the binding affinity to the human leukocyte antigen (HLA). A systematic analysis of immune epitope diversity involves the selection of all epitope targets based on known immunogenic properties (HLA binding or existence of neutralizing antibodies), and then assemble a suitable set of candidates to cover both population and pathogen diversity in a polyvalent vaccine construct. Since HLA recognizes epitopes as peptides rather than as individual amino acids, it is more appropriate to perform conservation analyses of continuous peptides rather than their individual amino acids (15). Such analysis is performed on the columns of suitably sized sliding windows (from here on termed "blocks") from the rows of sequences in an MSA.


1. Paste in a multiple sequence alignment (MSA) of amino acid sequences in the text field or use the "Choose file" button to upload a dataset. The MSA must be in FASTA format. To generate an MSA from a sequence set, web services, such as MAFFT, can be used.

2. Select the sliding window, or block, size for the analysis. Choose a block size of 8-11 amino acids for class I binding analysis.

3. Select conservation threshold. This threshold is the minimum accumulated frequency of peptides in a block, required for the block to be considered adequately covered. Most blocks will contain a large number of very low frequency variants, which can be filtered from the block if desired. For example, a peptide present only in a small fraction of examined viral proteins may be considered evolutionarily unstable (one or a few occurrences isolated in time and geographic location), and may be exempt from the analysis if desired. For this purpose, a conservation threshold of, for example, 99% can be chosen. Similarly, this threshold acts as an immunological conservation threshold, i.e. the minimum accumulated frequency of HLA binders in a block, in order for the block to be considered adequately conserved in terms of potential immunological function.

4. Select whether the immunological conservation threshold should filter output - i.e. only blocks in which the minimum accumulated frequency of HLA binders is above the threshold are displayed. This is likely to significantly reduce the size of the output, but will only display regions of highly conserved HLA binding.

5. Select parameters for HLA binding predictions. The binding predictor NetMHC 3.4 have been integrated into this web server and can be used to predict affinity between a given peptide and user defined HLAs. When the option is selected HLA class I binders can be predicted if the selected block size is between 8-11.

6. See examples of output.

Gaps insertions in highly variable protein sequences

When aligning highly variable protein sequences, such as proteins from, for example, influenza virus or HIV, MSAs will invariably contain a high proportion of gaps. Gaps, typically denoted by a dash "-", are artifacts of the MSA algorithms and distort the analysis peptides in blocks derived from MSAs. We applied an algorithm based on short read local alignments to remove gaps from blocks before further analysis where appropriate. If a peptide contained one or more gaps, the gaps were removed, thus shortening the peptide. The peptide was subsequently extended upstream and downstream with the same number of residues as lost by gap deletion, thus creating a number of alternative peptides to be included in the block. If one gap is removed, this operation resulted in two alternative peptides (unless the block is the first or last block in the MSA, in which case only one alternative will arise). Basic local alignment [2] based on BLOSUM62 matching [4] was performed of the alternative peptides to the existing peptides in the block. The alternative peptide resulting in the highest identity local alignment, i.e. most homologous to the remaining peptides in the block, was included in the block for further analysis. If regions of multiple gaps (we here define multiple as the lenght of the selected peptide size or more) were found in an MSA, this may signify an indel region. Since indels are true biological features, these were not removed from the MSA, since the heightened variability of the corresponding region should be considered when evaluating its potential for vaccine inclusion. In situations where an MSA due to a large amount of indels, cannot reach the selected conservation that position will not have any bar in the plot for the corresponding positions.


[1] Hertz, T., D. Nolan, I. James, M. John, S. Gaudieri, E. Phillips, J. C. Huang, G. Riadi, S. Mallal, and N. Jojic. 2011. Mapping the landscape of host-pathogen coevolution: HLA class I binding and its relationship with evolutionary conservation in human and viral proteins. Journal of virology 85: 1310–21.
[2] Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. Journal of molecular biology 215: 403–10.
[3] Schneider, T. D., and R. M. Stephens. 1990. Sequence logos: a new way to display consensus sequences. Nucleic acids research 18: 6097–100.
[4] Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89: 10915–9.

Developed by Bioinformatics Core at Cancer Vaccine Center, Dana-Farber Cancer Institute.

Version 1.3, May 2013.