SonicParanoid


SonicParanoid is a stand-alone software tool for the identification of orthologous relationships among multiple species.

For more details refer to the paper below:

From version 1.3.0 the execution time was substantially reduced when using high-sensitivity settings.
features

Fast

SonicParanoid, executed in the fast mode, predicted orthologous relationships for 40 eukaryotic proteomes in about 80 minutes, or in less than 6 minutes for 26 prokaryotes, using only 6 CPUs on desktop computer. Moreover, it processed the InParanoid8 input dataset, composed of 273 proteomes (246 eukaryotes), in less than 2 days (44 hours).

web results example

Accurate

SonicParanoid was tested using a benchmark proteome dataset from the Quest for Orthologs consortium, and the correctness of its predictions was evaluated using a public Orthology Benchmarking service. When compared to other 13 orthology prediction tools, SonicParanoid showed a balanced trade-off between precision and recall, with an accuracy comparable to those of well-established inference methods.

minimum hardware

Easy to use

SonicParanoid only requires the Python programming language and a GNU GCC compiler to be installed in your laptop/server in order to work. The low hardware requirements make it possible to run SonicParanoid on modern laptop computers, while the "update" feature allows users to easily maintain collections of orthologs that can be updated by adding or removing species.


Get it from PyPI Source Code @ Get it from Bioconda

PyPI - Python Version PyPI PyPI - Downloads Bioconda version Bioconda - Downloads Bioconda - License



Execution times

The latest version of SonicParanoid uses machine-learning to reduce the time required for all-versus-all alignments.
Following are some real-life examples showing the reduction of execution times for the ML-based alignments (essentials) and the normal alignments (complete).
Tests were performed on different datasets on a High Performance computing (HPC) server and on a desktop computer.

Execution times in hours on a HPC server using 128 physical CPU cores.

QfO 2020 Dataset

Alignment toolSonicParanoid modeEx. time (Complete)Ex. time (Essentials)Saved hoursTime saved (%)
MMseqs2Fast0.330.260.0721.21
MMseqs2Default0.640.460.1828.13
MMseqs2Sensitive2.561.650.9135.55
MMseqs2Most-sensitive6.723.912.8141.82
DiamondFast0.210.170.0419.05
DiamondDefault0.390.300.0923.08
DiamondSensitive0.500.380.1224.00
DiamondMost-sensitive1.641.470.1710.37
BLASTDefault18.3613.325.0427.45

2000 Microbial MAGs

Alignment toolSonicParanoid modeEx. time (Complete)Ex. time (Essentials)Saved hoursTime saved (%)
MMseqs2Default67.3858.099.2913.79
DiamondDefault33.2028.304.9014.76

NOTE: these results are currently being updated!

  • Computer:  High performance computing (HPC) server
  • CPU:           128 Cores @ 2.25~3.40 GHz from 2 AMD EPYC 7742 (Rome) sockets
  • Memory:     2 Terabytes of DDR4 shared memory
  • Storage:     Intel D3-S4610 solid-state disk
  • OS:             Ubuntu 20.04.3 LTS (Linux 5.11.0)

Execution times in hours on a Desktop computer with 8 physical CPU cores.

QfO 2020 Dataset

Alignment toolSonicParanoid modeEx. time (Complete)Ex. time (Essentials)Saved hoursTime saved (%)
MMseqs2Fast0.00.00.00.0
MMseqs2Default0.00.00.00.0
MMseqs2Sensitive0.00.00.00.0
MMseqs2Most-sensitive0.00.00.00.0
DiamondFast0000
DiamondDefault0000
DiamondSensitive0000
DiamondMost-sensitive0000
BLASTDefault0000

NOTE: these results are currently being updated!

  • Computer:WIP
  • CPU:WIP
  • Memory:WIP
  • Storage:WIP
  • OS:WIP

Datasets

DatasetProteomesEukaryotes; ProkaryotesSequences (thousands)Required alignmentsDescription
QfO 20207850; 28984.146,084Curated proteomes from the Quest for Orthologs consortium
2000 Microbial MAGs20000; 20005,091.984,000,000Subset of high quality MAGs obtained from Nayfach et al. (2020)

Installation

Hardware requirements

SonicParanoid requires a system with a 64-bit multi-core (at least 4) CPU and 8 Gigabytes of memory.


Supported operative systems


Software requirements

Before installing SonicParanoid make sure that the following software is installed in your system:


Installation and test

Installation


Usage

Input format, proteomes and proteins naming

SonicParanoid input files must be valid FASTA formatted files containing protein sequences. Each FASTA file should have a unique name, and must be different in content from all the others.
It is good practice to keep the species names short (less than 10 letters if possible). For example, given the species name Homo_sapiens.faa it would be better to rename it to hsapiens. Doing this would make the final output tables much easier to read.
The above also applies to protein names. These should be short where possible, as the final ortholog table contains these protein names multiple times and very long protein names would make the output difficult to read.

Disk space requirements

To further speed-up the computation of all-vs-all alignments MMseqs2 and Diamond can generate index files of the input proteome files. These index files are relatively big (about 1 Gigabyte per input proteome), but are automatically removed by SonicParanoid after the execution is completed.
Nevertheless when running SonicParanoid on a laptop computer the avaliable storage might be an issue. SonicParanoid avoid this problem by adding an optional parameter called '--no-indexing'

  • SonicParanoid automatically avoids the creation of the index files if the available storage is lower than that required to store the index files.
  • Use the --no-indexing option to prevent MMseqs2/Di to index the input files at the cost of about 5~10% slower all-vs-all alignments.

SonicParanoid can be executed through the command line by running the program sonicparanoid.
The command:

sonicparanoid --help
provides extra information on the command line parameters.

Execution example

SonicParanoid comes with a test input set composed of 4 bacterial proteomes. To verify if SonicParanoid has been successfully installed type the following commands:

The last of the above commands infers the orthologous relationships among the species which proteomes in FASTA format are stored in test_input, using 4 CPUs in the fast mode, and stores the output in the directory /test_output/runs/my_first_run.


Update an existing run

SonicParanoid allows the update of a previously computed set of ortholog relations by adding and/or removing proteome files from the original input set. Suppose in the previous example we computed the ortholog relations amongst species A, B, C, and D and that we now want to remove C from the analysis.
This is simply done by copying A, B, and D into a new directory my_new_input (or by removing C from the original input directory) and use the same output directory as follows:

The updated results will be stored in the /test_output/runs/three_species directory.
To add a new proteome to the analysis simply copy it to the directory containing the input files and run SonicParanoid again as above.
SonicParanoid will re-use the previously computed alignments and pairwise ortholog tables to minimize the required computation. In the case in which some proteome files need to be modified (e.g., add/remove sequences, or just change the file name) but we do not want to perform a complete new run, the --update-input-names parameter can be set when running SonicParanoid. The database and input information will be automatically updated.


Output

At each execution SonicParanoid stores the execution information and results in a directory named /output/runs/my_project/ where my_project, can be optionally set using the --project-id parameter.
For example, given the following execution of SonicParanoid

$ sonicparanoid -i ./my_input -o ./test_output --project-id my_first_run -t 4

The output directory resulting from the above run will have the following structure:

  • alignments
  • orthologs_db
  • seqs_dbs
  • runs
    • my_first_run
      • run_info.txt
      • ortholog_groups
        • ortholog_groups.tsv
        • flat.ortholog_groups.tsv
        • single_copy.ortholog_groups.tsv
        • not_assigned_genes.ortholog_groups.tsv
        • overall.stats.tsv
        • ortholog_counts_per_species.stats.tsv
        • species_coverages_in_groups.stats.tsv
      • pairwise_orthologs
      • run_info.txt
      • species.txt
  • snapshot.tsv
The directory alignments contains the computed and processed alignment files (compressed by default using gzip), while orthologs_db contains pairwise ortholog tables that could be re-used at each update run.
These directories should never be manually modified, since these are used for updating ortholog tables.

Run directory

At each execution a main output directory is generated under /output/runs/ (my_first_run in our example). This directory contains information on the SonicParanoid execution settings (run_info.txt) and input files (species.tsv).

Ortholog groups

The orthologs shared among the input species are stored in the directory named ortholog_groups under the main run directory (my_first_run in our example).
Following are the relevant output files related to the ortholog groups:
  • ortholog_groups.tsv Tab-separated table with the ortholog groups
  • flat.ortholog_groups.tsv Simpler table with only the gene names for each group
  • single_copy.ortholog_groups.tsv ortholog groups with a single ortholog for each species in the group
  • not_assigned_genes.ortholog_groups.tsv List of genes that could not be classified as orthologs
  • overall.stats.tsv General statistics on the predicted orthologs groups

Pairwise ortholog tables

In addition to the ortholog groups SonicParanoid provides an ortholog table for each pair of proteomes.
For example, given a run with N input proteomes,the directory pairwise_orthologs (under /output/runs/my_first_run) will contain a ortholog table for each of the N * (N - 1) / 2 possible proteome-proteome combinations.
These tables are useful to quicky see the orthologs shared between pairs of species rather then shared among multiple species.
For example if we give in input the poteomes 1,2, and 3the pairwise ortholog tables 1-2, 1-3 and 2-3 willbe generated and stored under the pairwise_orthologs directory.
The tables are stored into sub-directories named as the leftmost species name in the pair as follows:
  • pairwise_orthologs
    • 1
      • 1-2
      • 1-3
    • 2
      • 2-3

Command line parameters

You can list all the available parameters by typing:

$ sonicparanoid --help
Following is a list of SonicParanoid's parameters and their use: