SonicParanoid


SonicParanoid is a stand-alone software tool for the identification of orthologous relationships among multiple species.

For more details refer to the paper below:

From version 1.3.0 the execution time was substantially reduced when using high-sensitivity settings.
features

Fast

SonicParanoid, executed in the fast mode, predicted orthologous relationships for 40 eukaryotic proteomes in about 80 minutes, or in less than 6 minutes for 26 prokaryotes, using only 6 CPUs on desktop computer. Moreover, it processed the InParanoid8 input dataset, composed of 273 proteomes (246 eukaryotes), in less than 2 days (44 hours).

web results example

Accurate

SonicParanoid was tested using a benchmark proteome dataset from the Quest for Orthologs consortium, and the correctness of its predictions was evaluated using a public Orthology Benchmarking service. When compared to other 13 orthology prediction tools, SonicParanoid showed a balanced trade-off between precision and recall, with an accuracy comparable to those of well-established inference methods.

minimum hardware

Easy to use

SonicParanoid only requires the Python programming language, the MMseqs2 alignment tool, and a GNU GCC compiler to be installed in your laptop/server in order to work. The low hardware requirements make it possible to run SonicParanoid on modern laptop computers, while the "update" feature allows users to easily maintain collections of orthologs that can be updated by adding or removing species.


Get it from PyPI Source Code @

PyPI - Downloads PyPI - Python Version PyPI PyPI - License



Execution times

Following are some real-life examples showing how long would take to perform your analyses on a High Performance computing (HPC) server or a desktop computer.

Execution times in hours on a HPC server using 32 physical CPU cores.

DatasetModeEx. time (Ver. 1.2.6)Ex. time (Ver. 1.3.0)Saved hoursTime saved (%)
QfO 2018Fast1.251.140.118.72
QfO 2018Default2.702.070.6323.40
QfO 2018Sensitive11.457.274.1836.47
QfO 2018Most-sensitive31.1018.1212.9841.73
78 FungiFast0.700.690.010.73
78 FungiDefault1.501.450.053.53
78 FungiSensitive7.235.991.2517.21
78 FungiMost-sensitive21.4917.613.8818.05

  • Computer:  High performance computing (HPC) server
  • CPU:           32 Cores @ 2.6 GHz from Intel Xeon E5-4627 v3 sockets
  • Memory:     500 Gygabytes of shared memory
  • Storage:     Intel P3700 solid-state disk
  • OS:             RedHat Linux 4.8.5 server

Execution times in hours on a Apple Mac Mini 6 physical CPU cores.

DatasetModeEx. time (Ver. 1.2.6)Ex. time (Ver. 1.3.0)Saved hoursTime saved (%)
QfO 2018Fast6.864.752.1130.70
QfO 2018Default????X
QfO 2018Sensitive39.6725.9113.7634.69
QfO 2018 Most-sensitive ? ? ? ?
78 FungiFast?3.83??
78 FungiDefault?6.25??
78 FungiSensitive????
Fungi 78 Most-sensitive ? ? ? ?

NOTE: these results are currently being updated!

  • Computer:  Apple Mac Mini
  • CPU:           6 Cores @ 3.2 GHz from a Intel Core i7-8700B socket
  • Memory:    16 Gygabytes of shared memory
  • Storage:     PCIe-based solid-state disk (SSD)
  • OS:             macOS Catalina 10.15

Datasets

DatasetProteomesEukaryotes, ProkaryotesSequences (thousands)Required alignmentsDescription
QfO 20187850, 28985,166084Curated proteomes from the Quest for Orthologs consortium
78 Fungi7878, 0592,436084From draft genomes
150 Bacteria????under costruction...

Installation

Hardware requirements

SonicParanoid requires a system with a 64-bit multi-core (at least 4) CPU and 8 Gigabytes of memory.


Supported operative systems


Software requirements

Before installing SonicParanoid make sure that the following software is installed in your system:


Installation and test

Installation Methods


Usage

Input format

SonicParanoid input files must be valid FASTA formatted files containing protein sequences. You should not worry about the headers formatting has these would be internally mapped by SonicParanoid.

Files and genes naming

Each FASTA file should have a unique name, and must be different in content from all the others. If the same same exact sequences the clustering will be biased and the you will get probably not correct results. It is good practice to keep the species names short (less than 10 letters if possible).
For example, given the species name Homo_sapiens.faa it would a better to rename it to hsapiens. Doing this would make the final output tables much easier to read.
The above also applies to protein names. These should be short where possible, as the final ortholog table contains these protein names multiple times, very long protein names would soon make the output unreadable.

Disk space requirements

In order to further speed-up the computation of all-vs-all alignments MMseqs2 generates index files of the input proteome files. These index files are relatively big (about 1 Gigabyte per input proteome), but are automatically removed by SonicParanoid after the execution is completed.
Nevertheless when running SonicParanoid on a laptop computer the avaliable storage might be an issue. SonicParanoid solves this problem by adding an optional parameter called '--no-indexing'

  • SonicParanoid automatically avoids the creation of the index files if the available storage is lower than that required to store the index files.
  • Use the --no-indexing option to prevent MMseqs2 to index the input files at the cost of about 5~10% slower all-vs-all alignments.

SonicParanoid can be executed through the command line by running the program sonicparanoid.
The command:

sonicparanoid --help
provides extra information on the command line parameters.

Execution example

SonicParanoid comes with a test input set composed of 4 bacterial proteomes. To test if SonicParanoid has been successfully installed type the following commands:

The last of the above commands infers the orthologous relationships among the species which proteomes in FASTA format are stored in test_input, using 4 CPUs in the fast mode, and stores the output in the directory /test_output/runs/my_first_run.


Update an existing run

SonicParanoid allows the update of a previously computed set of ortholog relations by adding and/or removing proteome files from the original input set. Suppose in the previous example we computed the ortholog relations amongst species A, B, C, and D and that we now want to remove C from the analysis.
This is simply done by copying A, B, and D into a new directory my_new_input (or by removing C from the original input directory) and use the same output directory as follows:

The updated results will be stored in the /test_output/runs/three_species directory.
To add a new proteome to the analysis simply copy it to the directory containing the input files and run SonicParanoid again as above.
SonicParanoid will re-use the previously computed alignments and pairwise ortholog tables to minimize the required computation. In the case in which some proteome files need to be modified (e.g., add/remove sequences, or just change the file name) but we do not want to perform a complete new run the --update-input-names parameter can be set when running SonicParanoid. The database and input information will be automatically updated.


Output

At each execution SonicParanoid stores the execution information and results in a directory named /output/runs/my_project/ where my_project, can be optionally set using the --project-id parameter.
For example, given the following execution of SonicParanoid

The output directory resulting from the above run will have the following structure:

  • alignments
  • orthologs_db
  • mmseqs_databases
  • runs
    • my_first_run
      • run_info.txt
      • ortholog_groups
        • ortholog_groups.tsv
        • flat.ortholog_groups.tsv
        • single_copy.ortholog_groups.tsv
        • not_assigned_genes.ortholog_groups.tsv
        • overall.stats.tsv
        • ortholog_counts_per_species.stats.tsv
        • species_coverages_in_groups.stats.tsv
      • pairwise_orthologs
      • run_info.txt
      • species.txt
  • snapshot.tsv
The directory alignments contains the computed alignment files (compressed by default using gzip), while orthologs_db contains pairwise ortholog tables that could be re-used at each update run.
These directories should never be manually modified, since these are used for updating ortholog tables.

Run directory

At each execution a main output directory is generated under /output/runs/ (my_first_run in our example). This directory contains information on the SonicParanoid execution settings (run_info.txt) and input files (species.tsv).

Ortholog groups

The orthologs shared among the input species are stored in the directory named ortholog_groups under the main run directory (my_first_run in our example).
Following are the relevant output files related to the ortholog groups:
  • ortholog_groups.tsv Tab-separated table with the ortholog groups
  • flat.ortholog_groups.tsv Simpler table with only the gene names for each group
  • single_copy.ortholog_groups.tsv ortholog groups with a single ortholog for each species in the group
  • not_assigned_genes.ortholog_groups.tsv List of genes that could not be classified as orthologs
  • overall.stats.tsv General statistics on the predicted orthologs groups

Pairwise ortholog tables

In addition to the ortholog groups SonicParanoid provides an ortholog table for each pair of proteomes.
For example, given a run with N input proteomes,the directory pairwise_orthologs (under /output/runs/my_first_run) will contain a ortholog table for each of the N * (N - 1) / 2 possible proteome-proteome combinations.
These tables are useful to quicky see the orthologs shared between pairs of species rather then shared among multiple species.
For example if we give in input the poteomes 1,2, and 3the pairwise ortholog tables 1-2, 1-3 and 2-3 willbe generated and stored under the pairwise_orthologs directory.
The tables are stored into sub-directories named as the leftmost species name in the pair as follows:
  • pairwise_orthologs
    • 1
      • 1-2
      • 1-3
    • 2
      • 2-3

Command line parameters

You can list all the available parameters by typing:

$ sonicparanoid --help
Following is a list of SonicParanoid's parameters and their use: