FASTA HERDER: Help & Supplement


contact us

This page contains supplementary information to the manuscript describing FASTA HERDER and help on how to run the web server:

How to use this server

To use FASTA HERDER, you must upload a file of protein sequences in FASTA format (example file). It is OK to input a FASTA from a multiple alignment. A check on the validity of the format will be carried on with FASTA Validator.

If your file is bigger than 1,000 Kb you will get an error. A possible strategy in that case is that you split the original FASTA into several parts, each smaller than 1,000 Kb, and run FASTA HERDER on each one. The resulting compressed parts can be resubmitted all together as far as they amount to less than 1,000 Kb altogether. Another possibility if you need to run FASTA HERDER in a big file is to contact us.

Once your FASTA file is uploaded, FASTA HERDER will run upon clicking on the buttom Run FH!. There is the possibility to control some parameters of the clustering by using the Options provided:

  • Threshold tolerance:
    This parameter controls the stringency of the clustering . As a difference from other automatic approaches that only aggregate sequences with high identity, FASTA HERDER clusters near-full length homologs allowing for lower sequence identity thresholds. The default value (tolerance 0) allows for the smallest differences in length between the sequences that will be compared:
Length of sequence Maximum difference at each end
Longer 20016
(100, 200]10
(60, 100]5
(40, 60]3.5
(20, 40]2.5

By increasing the tolerance value, sequences with larger differences in length could be clustered together. This will have the effect of increassing the compression. For tolerance i, a value equal to 2i will be added to the default tolerance 0 values representing an added 2i amino acids difference allowed at each end of the protein sequence. In our benchmark, we verified that moderate tolerance values (we tested up to 4) did not cluster together sequences that were not homologous. However here we provide the user the possibility for larger values to be tried. This could be useful, for example if one is trying to compress a FASTA where all proteins are known to belong to the same family.

  • LCR masked:
    Refers to the detection and masking of Low Complexity Regions prior to clustering. If this box is checked, FASTA HERDER applies the SEG method to the FASTA file. SEG has a number of parameters that control how LCRs are detected. FASTA HERDER allows to tune 'Window size W' that corresponds to the minimum size of first stage segment, and 'complexity K1' that sets the maximum complexity of a first stage segment. For more details, see SEG web site
  • To see how the Options parameters affect the clustering in our benchmarks see the Supplementary material. If you would like to try in your FASTA a range of parameters not covered by these options you can contact us.

    FASTA HERDER will cluster the input FASTA to produce two different output files:

    • A cluster file consisting of one row per cluster. The format is tab separated text: Cluster leader \\t List of cluster members.
    • A FASTA file containing only clusters leaders.

    Supplementary material

    We benchmarked different aspects of the algorithm using the OrthoBench: