IWASAKI Lab. home-page

SonicParanoid

SonicParanoid is a stand-alone software tool for the de novo identification of orthologous relationships among multiple species.

For more details, and for citations, refer to the papers below:

Fast and Scalable

SonicParanoid is able to infer the orthologs for dozens of prokaryotes in minutes, or hours for eukaryotes, using a desktop computer with 8 CPUs. This figure is much smaller when running on HPC servers with dozens of CPUs (e.g. <1h for the QfO benchmark dataset). It is also highly scalable, as it inferred the orthologs for 2000 MAGs in only 1 day using 128 CPUs.

Accurate

SonicParanoid was tested using a benchmark proteome dataset from the Quest for Orthologs consortium, and the correctness of its predictions was evaluated using the QfO Benchmarking service. SonicParanoid showed the highest accuracy in the aggregated rankings from the three accuracy classification methods in the 2020 QfO benchmark.

usability

Easy to use

SonicParanoid only requires the Python programming language and a GNU GCC compiler to be installed in your laptop/server in order to work. The low hardware requirements make it possible to run SonicParanoid on modern laptop computers, while the "update" feature allows users to easily maintain collections of orthologs that can be updated by adding or removing species.

PyPI - Python Version SonicParanoid version on PyPI PyPI - Downloads SonicParanoid - License

Execution times

The latest version of SonicParanoid uses machine-learning to reduce the time required for all-versus-all alignments.
Following are some real-life examples showing the reduction of execution times for the ML-based alignments (essentials) and the normal alignments (complete).
Tests were performed on different datasets on a High Performance computing (HPC) server and on a desktop computer.

Execution times in hours on a HPC server using 128 physical CPU cores.

QfO 2020 Dataset

Sensitivity Execution mode
(default is complete)
Ex. time
(Hours)
Orthologous relationships
(Million)
Non-default
parameters
Fast Complete 0.56 14.64 --mode fast
Fast Graph-only 0.23 11.27 --mode fast -go
Default Complete 0.86 15.28
Default Graph-only 0.66 11.98 -go
Sensitive Complete 4.36 19.27 --mode sensitive
Sensitive Graph-only 4.17 15.65 --mode sensitive -go

2000 Microbial MAGs

Sensitivity Execution mode
(default is complete)
Ex. time
(Hours)
Orthologous relationships
(Million)
Non-default
parameters
Fast Complete 24.33 1,692.88 --mode fast
Fast Graph-only 19.70 1,233.64 --mode fast -go
Default Complete 37.99 1,817.39
Default Graph-only 34.41 1,390.97 -go
Computer: High performance computing (HPC) server
CPU: 128 Cores @ 2.25~3.40 GHz from 2 AMD EPYC 7742 (Rome) sockets
Memory: 2 Terabytes of DDR4 shared memory
Storage: Intel D3-S4610 solid-state disk
OS: Ubuntu 20.04.3 LTS (Linux 5.11.0)

Execution times in hours on a desktop computer with 8 physical CPU cores

QfO 2020 Dataset

Sensitivity Execution mode
(default is complete)
Ex. time
(Hours)
Orthologous relationships
(Million)
Non-default
parameters
Fast Complete 3.08 14.64 --mode fast
Fast Graph-only 2.15 11.27 --mode fast -go
Default Complete 6.63 15.26
Default Graph-only 5.56 11.98 -go
Sensitive Complete 45.86 19.27 --mode sensitive
Sensitive Graph-only 45.49 15.65 --mode sensitive -go

NOTE: these results are currently being updated!

Computer: Desktop computer (from 2019)
CPU: 8 Cores Inter Core i9 9900K cpu @ 3.60 GHz
Memory: 32 Gygabytes of DDR4
Storage: SK Hynix PC601 NVMe (1TB)
OS: Manjaro (Sikaris 22.0.0) (Linux 6.0.8-1)

Datasets used to test SonicParanoid 2

Dataset Proteomes Eukaryotes; Prokaryotes Sequences (thousands) Required alignments Description
QfO 2020 78 50; 28 984.14 6,084 Curated proteomes from the Quest for Orthologs consortium
2000 Microbial MAGs 2000 0; 2000 5,091.98 4,000,000 Subset of high quality MAGs obtained from Nayfach et al. (2020)

Installation

Hardware requirements

SonicParanoid requires a system with a 64-bit multi-core (at least 4) CPU and 16 Gigabytes of memory.

Software requirements

Before the installation make sure that the following software is installed in your system:

  • Python 3.8 or up to version 3.10
  • GNU GCC compiler (version 5.0 or above)

Supported systems

Linux or MacOS using Anaconda (No root previleges required)
Linux or MacOS using Micromamba (No root previleges required)

Usage

Input format, proteomes and proteins naming

Valid input files must contain protein sequences in the FASTA format. Each FASTA file should have a unique name, and must be different in content from all the others.
It is good practice to keep the species names short (less than 10 letters). For example, given the species name Homo_sapiens.faa it would be better to rename it to hsapiens. Doing this would make the final output tables much easier to read.
The above also applies to protein names. These should be short where possible, as the final ortholog table contains these protein names multiple times and very long protein names would make the output difficult to read.

Disk space requirements

SonicParanoid requires about 9 Gygabytes of storage for the installation, mainly due to the size of the PFamA profile DB (~9GBytes).
After the installation, and only during the first execution SonicParanoid generates the profile DB from PFamA. This file requires about 9GB of disk space.
To further speed-up the computation of all-vs-all alignments MMseqs2 and Diamond can generate index files of the input proteome files. This can be done using the parameter --index-db. These index files are relatively big (about 1 Gigabyte per input proteome), but are automatically removed by SonicParanoid after the execution is completed.
SonicParanoid automatically avoids the creation of the index files if the available storage is lower than the amount of disk space required to store the index files.
Using the --index-db parameter can result in a 5~10% speed-up for all-vs-all alignments when using MMseqs2/Diamond.

Execution example

Once installed SonicParanoid can be executed through the command line by running the program sonicparanoid.
The command sonicparanoid --help provides extra information on the command line parameters.

Execution example

SonicParanoid comes with a test input set composed of 4 bacterial proteomes. To verify that SonicParanoid has been successfully installed type the following commands:

The last of the above commands infers the orthologous relationships among the species which proteomes in FASTA format are stored in test_input, using 8 CPUs in the fast mode, and stores the output in the directory /test_output/runs/my_first_run.

Update an existing run

SonicParanoid allows the update of a previously computed set of ortholog relations by adding and/or removing proteome files from the original input set. Suppose in the previous example we computed the ortholog relations amongst species A, B, C, and D and that we now want to remove C from the analysis.
This is simply done by copying A, B, and D into a new directory my_new_input (or by removing C from the original input directory) and use the same output directory as follows:

The updated results will be stored in the /test_output/runs/three_species directory.
To add a new proteome to the analysis simply copy it to the directory containing the input files and run SonicParanoid again as above.
SonicParanoid will re-use the previously computed alignments and pairwise ortholog tables to minimize the required computation. The last of the commands above infers the orthologous relationships among the species which proteomes in FASTA format are stored in test_input, using 8 CPUs in the fast mode, and stores the output in the directory /test_output/runs/my_first_run.
In the case in which some proteome files need to be modified (e.g., add/remove sequences, or just change the file name) but we do not want to perform a complete new run, the --update-input-names parameter can be set when running SonicParanoid. The database and input information will be automatically updated.


Output

At each execution SonicParanoid stores the execution information and results in a directory named /output/runs/my_project/ where my_project, can be optionally set using the --project-id parameter.
For example, given the following execution of SonicParanoid

SonicParanoid will generate the ortholog pairs and ortholog groups from the input proteomes. The output directory structure as well as the relevant result files are explained in the tabs below.

Output directory structure
Main output directory

The output directory resulting from the above run will have the following structure:

  • alignments
  • arch_orthology
  • merged_tables
  • orthologs_db
  • seqs_dbs
  • runs
    • my_first_run
      • run_info.txt
      • ortholog_groups
        • ortholog_groups.tsv
        • flat.ortholog_groups.tsv
        • single_copy.ortholog_groups.tsv
        • not_assigned_genes.ortholog_groups.tsv
        • overall.stats.tsv
        • ortholog_counts_per_species.stats.tsv
        • species_coverages_in_groups.stats.tsv
      • pairwise_orthologs
      • run_info.txt
      • species.txt
  • snapshot.tsv
The directory alignments contains the computed and processed alignment files (compressed by default using gzip), while orthologs_db contains pairwise ortholog tables that could be re-used at each update run.
These directories should never be manually modified , since these are used for updating ortholog tables.
The directory arch_orthology contains most of the files related to domain-aware orthology inference (including the trained artificial neural networks), while merged_tables contains the merged (graph-based and domain-based) pairwise ortholog tables that could be re-used at each update run.
Run Directory
At each execution a main output directory is generated under /output/runs/ (my_first_run in our example). This directory contains information regarding the run settings (run_info.txt) and input files (species.tsv).

Ortholog Group (OG) files

The orthologs shared among the input species are stored in the directory named ortholog_groups under the main run directory (my_first_run in our example).
Following are the relevant output files related to the ortholog groups:

  • ortholog_groups.tsv Tab-separated table with the ortholog groups
  • flat.ortholog_groups.tsv Simpler table with only the gene names for each group
  • single_copy.ortholog_groups.tsv ortholog groups with a single ortholog for each species in the group
  • not_assigned_genes.ortholog_groups.tsv List of genes that could not be classified as orthologs
  • overall.stats.tsv General statistics on the predicted orthologs groups

Pairwise Ortholog Tables

In addition to the ortholog groups SonicParanoid provides an ortholog table for each pair of proteomes.
For example, given a run with N input proteomes,the directory pairwise_orthologs (under /output/runs/my_first_run) will contain a ortholog table for each of the N * (N - 1) / 2 possible proteome-proteome combinations.
These tables are useful to quicky see the orthologs shared between pairs of species rather then shared among multiple species.
For example if we give in input the poteomes 1,2, and 3 the pairwise ortholog tables 1-2, 1-3 and 2-3 will be generated and stored under the pairwise_orthologs directory.
The tables are stored into sub-directories named as the leftmost species name in the pair as follows:

  • pairwise_orthologs
    • 1
      • 1-2
      • 1-3
    • 2
      • 2-3


Command line parameters

You can list all the available parameters by typing sonicparanoid --help in the command line.
Following is a list of SonicParanoid's command line parameters and their usage:

Directory containing the proteomes (in FASTA format) of the species to be analyzed.

The directory in which the results will be stored.

Name of the project reflecting the name of the run. If not specified it will be automatically generated using the current date and time.

The directory in which the alignment files are stored. If not specified it will be created inside the main output directory.

Maximum number of CPUs to be used. Default=4

Skip the creation of multi-species ortholog groups.

Skip the compression of processed alignment files.

GZip compression level: integer values between 1 and 9, with 9 and 1 being the highest lowest compression levels, respectively. Default=5

SonicParanoid execution mode. The default mode is suitable for most studies. Use sensitive if the input proteomes are not closely related.

Use Diamond with a custom sensitivity. This will bypass the -m (--mode) option.

Use MMseqs2 with a custom sensitivity. This will bypass the -m (--mode) option.

Use Blastp for all-vs-all alignments. This will bypass the -m (--mode) option.

Consider only alignments with bitscores above min-bitscore. Increasing this value can be a good idea when comparing very closely related species. Increasing this value will reduce the number of paralogs (and orthologs) generate.
WARNING: use only if you are sure of what you are doing.
INFO: min-bitscore can reduce the execution time for all-vs-all when increased. Default=40", default=40

Perform complete alignments (slower), rathen than essential ones. Default=False

Maximum allowed length-difference-ratio between main orthologs and canditate in-paralogs.
Example: 0.5 means one of the sequences could be 2 times longer than the other; 0 means no length difference allowed; 1 means no restriction is applied on length differences Default = 0.75

Output a text file with all the pairwise orthologous relationships.

The directory in which the database files created by the selected local alignment tool will be stored. If not specified it will be created inside the main output directory.

Index the MMSeqs2/Diamond databases.
IMPORTANT: this will use more storage but will be slighly faster (5~10%) when processing many big proteomes with MMseqs2.

Affects the granularity of ortholog groups. This value should be between 1.2 (very coarse) and 5 (fine grained clustering). Default=1.5

Perform only graph-based orthology (skip architectures analysis).

When merging graph- and arch-based orhtologs consider only new-orthologs with a protein coverage greater or equal than this value. Default=0.75.
WARNING: generally this value should be changed, use it only if you know what you are doing.

This will force the re-computation of the pairwise ortholog tables. Missing alignment files will be re-computed (if any).

Overwrite previous runs and execute it again. This can be useful to update a subset of the computed tables.

Remove alignments and pairwise ortholog tables related to species used in a previous run.
This option should be used when updating a run in which some input proteomes were modified or removed.

Remove alignments and pairwise ortholog tables for an input proteome used in a previous run which file name conflicts with a newly added species.
This option should be used when updating a run in which some input proteomes or their file names were modified.


Contacts


Top Page
Usage
IWASAKI Lab.