This page contains supplementary information to the manuscript describing
FASTA HERDER and help on how to run the web server:
How to use this server
To use FASTA HERDER, you must upload a file of protein sequences in FASTA format
(example file). It is OK to input a FASTA from a multiple alignment. A check on
the validity of the format will be carried on with
If your file is bigger than 1,000 Kb you will get an error. A possible strategy
in that case is that you split the original FASTA into several parts, each smaller
than 1,000 Kb, and run FASTA HERDER on each one. The resulting compressed parts can
be resubmitted all together as far as they amount to less than 1,000 Kb altogether.
Another possibility if you need to run FASTA HERDER in a big file is to contact us.
Once your FASTA file is uploaded, FASTA HERDER will run upon clicking on the buttom
Run FH!. There is the possibility to control some parameters of the clustering
by using the Options provided:
- Threshold tolerance:
This parameter controls the stringency of the clustering
. As a difference from other automatic approaches that only aggregate sequences with high identity,
FASTA HERDER clusters near-full length homologs allowing for lower sequence identity thresholds.
The default value (tolerance 0) allows for the smallest differences in length between the
sequences that will be compared:
|Length of sequence|| Maximum difference at each end|
By increasing the tolerance value, sequences with larger
differences in length could be clustered together. This will have the effect of increassing the compression.
For tolerance i, a value equal to 2i will be added to the default tolerance 0 values representing an added 2i amino acids difference allowed at each end of the protein sequence. In our benchmark, we verified that moderate tolerance values (we tested up to 4) did not cluster together sequences that were not homologous. However here we provide the user the possibility for larger values to be tried. This could be useful, for example if one is trying to compress a FASTA where all proteins are known to belong to the same family.
- LCR masked:
Refers to the detection and masking of Low Complexity Regions prior to
clustering. If this box is checked, FASTA HERDER applies the SEG method to the FASTA file.
SEG has a number of parameters that control how LCRs are detected. FASTA HERDER allows to tune
'Window size W' that corresponds to the minimum size of first stage segment, and 'complexity K1'
that sets the maximum complexity of a first stage segment. For more details, see
SEG web site
To see how the Options parameters affect the clustering in our benchmarks see
the Supplementary material. If you would like to try in your FASTA a range of parameters
not covered by these options you can contact us.
FASTA HERDER will cluster the input FASTA to produce two different output files:
- A cluster file consisting of one row per cluster. The format is tab separated text: Cluster leader \\t
List of cluster members.
- A FASTA file containing only clusters leaders.
We benchmarked different aspects of the algorithm using the OrthoBench: