Skip to content
Snippets Groups Projects
Commit 4f14d592 authored by Nikos Pappas's avatar Nikos Pappas
Browse files

added procedures for building containers

parent dbfded74
No related branches found
No related tags found
No related merge requests found
# RaFaH
Available from `library://papanikos_182/default/rafah:0.1`
## Procedure
0. Create a new dir for build context and get in there
1. Grab necessary dependencies from the
[RaFaH repo](https://sourceforge.net/projects/rafah/files/RaFAH_v0.1_Files/)
```
$ wget https://sourceforge.net/projects/rafah/files/RaFAH_v0.1_Files
$ tar -xzvf whatever_is_downloaded
```
> wgetting is slow, I used another copy, locally available
2. Edit RaFaH script to point to appropriate locations in the container
- Replace shebang (line 1) with `#!/usr/bin/env perl`
- Change line 37 to `my $valid_domains_file = "/opt/resources/HP_Ranger_Model_3_Valid_Cols.txt";`
- Change line 38 to `my $hmm_models_prefix = "/opt/resources/HP_Ranger_Model_3_Filtered_0.9_Valids.hmm";`
- Change line 39 to `my $r_script_file_name = "/src/Predict_Host_RF.R";`
- Change line 40 to `my $r_model_file_name = "/opt/resources/MMSeqs_Clusters_Ranger_Model_1+2+3_Clean.RData";`
3. Make RaFah_v0.1.pl executable
```
$ chmod +x RaFaH_v0.1.pl
```
4. Bundle tables and models in a tar archive (not the perl and R script) and
not the `HP_Ranger_Model_3_Valid_Cols.txt`
```
$ tar -czvf rafah_resources.tar.gz ./*hmm* MMSeqs_Clusters_Ranger_Model_1+2+3_Clean.RData
```
5. Build the image with the definition file
```
$ sudo singularity build rafah.sif rafah.def
```
6. [Optional] Sign the image
```
$ singularity sign rafah.sif
```
7. Push it on the cloud
```
$ singulairty push rafah.sif library://papanikos_182/default/rafah:0.1
```
## Usage
```
$ singularity run library://papanikos_182/default/rafah:0.1 RaFaH_v0.1.pl -h
```
Bootstrap: docker
From: continuumio/miniconda3
%labels
Author "Felipe Coutinho"
Maintainer papanikos_182
Version 0.1
Source https://sourceforge.net/projects/rafah/
Preprint https://www.biorxiv.org/content/10.1101/2020.09.25.313155v1
%files
RaFAH_v0.1.pl /opt/conda/bin/
Predict_Host_RF.R /src/
HP_Ranger_Model_3_Valid_Cols.txt /opt/HP_Ranger_Model_3_Valid_Cols.txt
rafah_resources.tar.gz /opt/rafah_resources.tar.gz
%environment
export PATH=/src:$PATH
%post
# Update OS
apt update && apt upgrade -y
# Set up resources dirs for running RaFAH
mkdir -p /opt/resources
mv /opt/HP_Ranger_Model_3_Valid_Cols.txt /opt/resources
tar -xzvf /opt/rafah_resources.tar.gz -C /opt/resources && rm /opt/rafah_resources.tar.gz
# Install dependencies
conda config --add channels conda-forge
conda config --add channels default
conda config --add channels bioconda
conda config --add channels r
conda update -y conda
conda install -y mamba
mamba install -y r=3.6 r-ranger perl-bioperl hmmer=3.1b2 prodigal=2.6
conda clean --all -y
%help
A container for RaFAH v0.1 [https://sourceforge.net/projects/rafah].
Main perl script is in /src .
Helper R script for models is in /src .
Required data dependencies are stored in /opt/resources.
To run the help menu from RaFAH from this container execute
$ singularity exec shub://papanikos_182/rafah:0.1 perl RaFAH_v0.1.pl --help
To run an anlysis for all genomes stored in the /path/to/genomes/ (last slash is required),
with all files ending with .fasta and store the results in
/path/to/outdir/prefix (several files will be written in the /path/to/outdir and prefixed with prefix_ ).
$ singularity exec shub://papanikos_182/rafah:0.1 \
perl RaFAH_v0.1.pl \
--genomes_dir /path/to/genomes/ \
--extension fasta \
--file_prefix /path/to/outdir/prefix
# VirHostMatcher-Net
Available from `library://papanikos_182/default/vhmnet:0.1`
* Note that data dependencies are not included in the container.
You need to get them with
```
wget -c http://www-rcf.usc.edu/~weiliw/VirHostMatcher-Net/data_VirHostMatcher-Net_both_modes.tar.gz
tar xf data_VirHostMatcher-Net_both_modes.tar.gz
```
The models and genomes unpacked are 125G.
* [My fork](https://github.com/papanikos/VirHostMatcher-Net)
is used for grabbing source. It mainly allows to define a directory
where the data are located
## Procedure
0. Create a new dir for build context and get in there
1. Create a conda env with its requirements, verify it is working and export
it explicitly
```
$ conda env create -n vhmnet numpy pandas biopython blast
...
...Test it runs...
...
$ conda list -n vhmnet --explicit > environment.txt
```
2. Build the image with the definition file
```
$ sudo singularity build vhmnet.sif vhmnet.def
```
3. [Optional] Sign the image
```
$ singularity sign vhmnet.sif
```
4. Push it on the cloud
```
$ singulairty push vhmnet.sif library://papanikos_182/default/vhmnet:0.1
```
## Usage
```
$ singularity run library://papanikos_182/default/vhmnet:0.1 \
VirHostMatcher.py -h
```
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/ca-certificates-2020.12.8-h06a4308_0.conda
https://conda.anaconda.org/conda-forge/linux-64/ld_impl_linux-64-2.35.1-hed1e6ac_1.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libgfortran4-7.5.0-hae1eefd_17.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libstdcxx-ng-9.3.0-h2ae2ef3_17.tar.bz2
https://conda.anaconda.org/conda-forge/noarch/tzdata-2020f-he74cb21_0.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libgfortran-ng-7.5.0-hae1eefd_17.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libgomp-9.3.0-h5dbcf3e_17.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/_openmp_mutex-4.5-1_gnu.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libgcc-ng-9.3.0-h5dbcf3e_17.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/bzip2-1.0.8-h7b6447c_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/expat-2.2.10-he6710b0_2.conda
https://repo.anaconda.com/pkgs/main/linux-64/libffi-3.3-he6710b0_2.conda
https://conda.anaconda.org/conda-forge/linux-64/libopenblas-0.3.12-pthreads_hb3c22a3_1.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/ncurses-6.2-he6710b0_1.conda
https://repo.anaconda.com/pkgs/main/linux-64/openssl-1.1.1i-h27cfd23_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/pcre-8.44-he6710b0_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/perl-5.26.2-h14c3975_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/xz-5.2.5-h7b6447c_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/zlib-1.2.11-h7b6447c_3.conda
https://conda.anaconda.org/conda-forge/linux-64/libblas-3.9.0-6_openblas.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/libedit-3.1.20191231-h14c3975_1.conda
https://repo.anaconda.com/pkgs/main/linux-64/libssh2-1.9.0-h1ba5d50_1.conda
https://conda.anaconda.org/bioconda/linux-64/perl-app-cpanminus-1.7044-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-base-2.23-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-common-sense-3.74-pl526_2.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-compress-raw-bzip2-2.087-pl526he1b5a44_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-compress-raw-zlib-2.087-pl526hc9558a2_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-constant-1.33-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-data-dumper-2.173-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-digest-hmac-1.03-pl526_3.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-digest-md5-2.55-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-exporter-5.72-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-exporter-tiny-1.002001-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-extutils-makemaker-7.36-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-html-tagset-3.20-pl526_3.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-io-html-1.001-pl526_2.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-io-zlib-1.10-pl526_2.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-mozilla-ca-20180117-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-parent-0.236-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-scalar-list-utils-1.52-pl526h516909a_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-socket-2.027-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-try-tiny-0.30-pl526_1.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/perl-xml-parser-2.44_01-pl526ha1d75be_1002.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-xml-sax-base-1.09-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-xsloader-0.24-pl526_0.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/readline-8.0-h7b6447c_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/tk-8.6.10-hbc83047_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/krb5-1.18.2-h173b8e3_0.conda
https://conda.anaconda.org/conda-forge/linux-64/libcblas-3.9.0-6_openblas.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/liblapack-3.9.0-6_openblas.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-carp-1.38-pl526_3.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-encode-2.88-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-file-path-2.16-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-html-parser-3.72-pl526h6bb024c_5.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-io-compress-2.087-pl526he1b5a44_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-list-moreutils-xs-0.428-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-mime-base64-3.15-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-ntlm-1.09-pl526_4.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-storable-3.15-pl526h14c3975_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-test-requiresinternet-0.05-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-types-serialiser-1.0-pl526_2.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-xml-namespacesupport-1.12-pl526_0.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/sqlite-3.34.0-h74cdb3f_0.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/libcurl-7.71.1-h20c2e04_1.conda
https://conda.anaconda.org/bioconda/linux-64/perl-business-isbn-data-20140910.003-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-encode-locale-1.05-pl526_6.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-file-temp-0.2304-pl526_2.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-html-tree-5.07-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-json-xs-2.34-pl526h6bb024c_3.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-list-moreutils-0.428-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-lwp-mediatypes-6.04-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-net-ssleay-1.88-pl526h90d6eec_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-pathtools-3.75-pl526h14c3975_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-time-local-1.28-pl526_1.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/python-3.9.1-hdb3f193_2.conda
https://repo.anaconda.com/pkgs/main/linux-64/certifi-2020.12.5-py39h06a4308_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/curl-7.71.1-hbc83047_1.conda
https://conda.anaconda.org/bioconda/linux-64/perl-archive-tar-2.32-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-business-isbn-3.004-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-http-date-6.02-pl526_3.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-io-socket-ssl-2.066-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-json-4.02-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/noarch/perl-xml-sax-1.02-pl526_0.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/python_abi-3.9-1_cp39.tar.bz2
https://repo.anaconda.com/pkgs/main/noarch/pytz-2020.5-pyhd3eb1b0_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/six-1.15.0-py39h06a4308_0.conda
https://repo.anaconda.com/pkgs/main/noarch/wheel-0.36.2-pyhd3eb1b0_0.conda
https://conda.anaconda.org/conda-forge/linux-64/numpy-1.19.4-py39hdbf815f_2.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-file-listing-6.04-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-uri-1.76-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-xml-sax-expat-0.51-pl526_3.tar.bz2
https://repo.anaconda.com/pkgs/main/noarch/python-dateutil-2.8.1-py_0.conda
https://repo.anaconda.com/pkgs/main/linux-64/setuptools-51.0.0-py39h06a4308_2.conda
https://conda.anaconda.org/conda-forge/linux-64/biopython-1.78-py39hbd71b63_1.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/pandas-1.2.0-py39hde0f152_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-http-message-6.18-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/noarch/perl-net-http-6.19-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-www-robotrules-6.02-pl526_3.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-xml-simple-2.25-pl526_1.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/pip-20.3.3-py39h06a4308_0.conda
https://conda.anaconda.org/bioconda/linux-64/perl-http-cookies-6.04-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-http-daemon-6.01-pl526_1.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-http-negotiate-6.01-pl526_3.tar.bz2
https://conda.anaconda.org/bioconda/noarch/perl-libwww-perl-6.39-pl526_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/perl-lwp-protocol-https-6.07-pl526_4.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/entrez-direct-13.9-pl526h375a9b1_0.tar.bz2
https://conda.anaconda.org/bioconda/linux-64/blast-2.10.1-pl526he19e7b1_3.tar.bz2
# RaFaH
Available from `library://papanikos_182/default/vhulk:0.1`
## Procedure
0. Create a new dir for build context and get in there
1. Grab source from the
[vHULK repo](https://github.com/LaboratorioBioinformatica/vHULK)
```
$ git clone https://github.com/LaboratorioBioinformatica/vHULK.git
```
2. Create a conda env as per
[their suggestions](https://github.com/LaboratorioBioinformatica/vHULK#dependencies)
and export it explicitly with
```
$ conda list -n vhulk --explicit > vhulk_explicit.txt
```
This can be used for running stuff
3. Grab its data dependencies with the auxiliary script they provide
```
$ conda activate vhulk
(vhulk) $ python ./download_and_set_models.py
```
4. Extract (for testing) and rebundle under `vhulk_resources.tar.gz`. This
archive contains
```
$ tar -tvf vhulk_resources.tar.gz
drwxr-xr-x nikos/binf 0 2020-12-15 12:21 models/
-rw-r----- nikos/binf 70143964 2020-09-15 00:52 models/model_species_total_fixed_relu_08mar_2020.h5
-rw-r----- nikos/binf 70165848 2020-09-15 00:52 models/model_genus_total_fixed_relu_08mar_2020.h5
-rw-r--r-- nikos/binf 175382277 2020-12-15 12:19 models/all_vogs_hmm_profiles_feb2018.hmm.h3f
-rw-r----- nikos/binf 70165848 2020-09-15 00:52 models/model_genus_total_fixed_softmax_01mar_2020.h5
-rw-r----- nikos/binf 70143964 2020-09-15 00:52 models/model_species_total_fixed_softmax_01mar_2020.h5
-rw-r--r-- nikos/binf 333265 2020-12-15 12:19 models/all_vogs_hmm_profiles_feb2018.hmm.h3i
-rw-r--r-- nikos/binf 374382899 2020-12-15 12:19 models/all_vogs_hmm_profiles_feb2018.hmm.h3p
-rw-r--r-- nikos/binf 767855419 2020-12-15 12:19 models/all_vogs_hmm_profiles_feb2018.hmm
-rw-r--r-- nikos/binf 317721714 2020-12-15 12:19 models/all_vogs_hmm_profiles_feb2018.hmm.h3m
```
5. Modify the `vHULK-v0.1.py` to mainly point to the appropriate locations
in the container. The rest is aesthetics done with
[black](https://github.com/psf/black). Also make it executable.
The copy in here is the one used in the container.
6. Build the image with the definition file
```
$ sudo singularity build vhulk.sif vhulk.def
```
7. [Optional] Sign the image
```
$ singularity sign vhulk.sif
```
8. Push it on the cloud
```
$ singulairty push vhulk.sif library://papanikos_182/default/vhulk:0.1
```
## Usage
```
$ singularity run library://papanikos_182/default/vhulk:0.1 vHULK_v0.1.py -h
```
#!/usr/bin/env python
# coding: utf-8
# Edited May, 27th 2020
## This is vHULK: viral Host Unveiling Kit
# Developed by Deyvid Amgarten and Bruno Iha
# Creative commons
# Import required Python modules
import numpy as np
import pandas as pd
from Bio import SeqIO
import re
import sys
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import subprocess
import datetime
import argparse
import warnings
import csv
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)
from time import gmtime, strftime
from tensorflow.keras.layers import Dense, Activation, LeakyReLU, ReLU
from tensorflow.keras.models import load_model
from scipy.special import entr
# Function declarations
# Run prokka
def run_prokka(binn, input_folder, threads):
# Check the fasta format
prefix = get_prefix(binn)
# Filehandle where the output of prokka will be saved
# output_prokka = open(str(prefix)+'prokka.output', mode='w')
# Full command line for prokka
command_line = (
"prokka --kingdom Viruses --centre X --compliant --gcode 11 --cpus "
+ threads
+ " --force --quiet --prefix prokka_results_"
+ str(prefix)
+ " --fast --norrna --notrna --outdir "
+ input_folder
+ "results/prokka/"
+ str(prefix)
+ " --cdsrnaolap --noanno "
+ input_folder
+ str(binn)
).split()
return_code = subprocess.call(command_line, stderr=subprocess.PIPE)
# Check with prokka run smothly
if return_code == 1:
print("Prokka may not be correctly installed. Please check that.")
sys.exit(1)
# Get prefix from bins
def get_prefix(binn):
if re.search(".fasta", binn):
prefix = re.sub(".fasta", "", binn)
else:
prefix = re.sub(".fa", "", binn)
return prefix
# Extract Matrix
###
### Main code
###
# Set arguments
# Modification to use argparse
parser = argparse.ArgumentParser(
description="Predict phage draft genomes in metagenomic bins."
)
parser.add_argument(
"-i",
action="store",
required=True,
dest="input_folder",
help="Path to a folder containing metagenomic bins in .fa or .fasta format (required!)",
)
parser.add_argument(
"-t",
action="store",
dest="threads",
default="1",
help="Number of CPU threads to be used by Prokka and hmmscan (default=1)",
)
args = parser.parse_args()
# Greeting message
print("\n**Welcome v.HULK, a toolkit for phage host prediction!\n")
# Verify databases
if not os.path.isfile("/opt/vHULK/models/all_vogs_hmm_profiles_feb2018.hmm"):
print(
"**Your database and models are not set. Please, run: python download_and_set_models.py \n"
)
sys.exit(1)
# Create Filehandle for warnings
# warnings_handle = open('marvel-warnings.txt', 'w')
# Important variables
input_folder = args.input_folder
threads = args.threads
# Fix input folder path if missing '/'
if not re.search("/$", input_folder):
input_folder = input_folder + "/"
# Take the input folder and list all multifasta (bins) contained inside it
# print(input_folder)
list_bins_temp = os.listdir(input_folder)
list_bins = []
count_bins = 0
# Empty folder
if list_bins_temp == []:
print("**Input folder is empty. Exiting...\n")
sys.exit(1)
else:
for each_bin in list_bins_temp:
if re.search(".fasta$", each_bin, re.IGNORECASE):
list_bins.append(each_bin)
count_bins += 1
elif re.search(".fa$", each_bin, re.IGNORECASE):
list_bins.append(each_bin)
count_bins += 1
if count_bins == 0:
print(
"**There is no valid genome inside the input folder (%s).\n\
Genome or bins should be in '.fasta' or '.fa' format.\nExiting..."
% input_folder
)
sys.exit(1)
print(
"**Arguments are OK. Checked the input folder and found %d genomes.\n"
% count_bins
)
print("**" + str(datetime.datetime.now()))
# Create results folder
try:
os.stat(input_folder + "results/")
except:
os.mkdir(input_folder + "results/")
#####
# PROKKA
#####
# Running prokka for all the bins multfasta files in input folder
# Perform a check in each bin, then call the execute_prokka function individually
# It may take awhile
count_prokka = 0
print("**Prokka has started, this may take awhile. Be patient.\n")
for binn in list_bins:
# Verify bin/Genome size
len_bin = 0
for record in SeqIO.parse(input_folder + binn, "fasta"):
len_bin += len(record.seq)
# If a bin/genome is too short, skip it
if len_bin < 5000:
print(
"**v.HULK has found a genome or bin, which is too short to code \
proteins (<5000pb). As CDSs are an important feature for v.HULK, \
we will be skipping this: "
+ binn
)
continue
run_prokka(binn, input_folder, threads)
count_prokka += 1
if count_prokka % 10 == 0:
print("**Done with %d genomes..." % count_prokka)
print("**Prokka tasks have finished!\n")
####
# HMM SEARCHES
####
print("**" + str(datetime.datetime.now()))
print("**Starting HMM scan, this may take awhile. Be patient.\n")
# print(str(datetime.datetime.now()))
# Create a new results folder for hmmscan output
try:
os.stat(input_folder + "results/hmmscan/")
except:
os.mkdir(input_folder + "results/hmmscan/")
# Call HMMscan to all genomes
dic_matrices_by_genome = {}
prop_hmms_hits = {}
count_hmm = 0
for binn in list_bins:
# Prefix for naming results
prefix = get_prefix(binn)
command_line_hmmscan = (
"hmmscan -o "
+ input_folder
+ "results/hmmscan/"
+ prefix
+ "_hmmscan.out --cpu "
+ threads
+ " --tblout "
+ input_folder
+ "results/hmmscan/"
+ prefix
+ "_hmmscan.tbl --noali /opt/vHULK/models/all_vogs_hmm_profiles_feb2018.hmm "
+ input_folder
+ "results/prokka/"
+ prefix
+ "/prokka_results_"
+ prefix
+ ".faa"
)
# print(command_line_hmmscan)
# Use -E 1 for next time running HMMscan or leave the fix down there
# In case hmmscan returns an error - Added only because it stopped in half
# if os.path.exists(input_folder + 'results/hmmscan/' + prefix + '_hmmscan.tbl'):
# continue
try:
subprocess.call(command_line_hmmscan, shell=True)
# Comment line above and uncomment line below in case you want to run v.HULK without running hmmscan all over again
# True
except:
print("**Error calling HMMscan:", command_line_hmmscan)
sys.exit(1)
count_hmm += 1
# Iteration control
print("**Done with %d bins HMM searches..." % count_hmm)
## Create dictionary as ref of collumns - pVOGs
dic_vogs_headers = {}
with open("/opt/vHULK/files/VOGs_header.txt", "r") as file2:
for line2 in file2:
key = re.match("(.+)\n", line2).group(1)
dic_vogs_headers[key] = np.float32(0.0)
#
# Parse hmmscan results by gene
num_proteins_bin = 0
with open(
input_folder
+ "results/prokka/"
+ prefix
+ "/prokka_results_"
+ prefix
+ ".faa",
"r",
) as faa:
for line in faa:
if re.search("^>", line):
num_proteins_bin += 1
# Get gene name here
gene_name = re.search("^>(.*)", line).group(1)
dic_matches = {}
# Parse hmmout
with open(
input_folder + "results/hmmscan/" + prefix + "_hmmscan.tbl", "r"
) as hmmscan_out:
dic_genes_scores = {}
for line in hmmscan_out:
vog = ""
gene = ""
evalue = np.float32(0.0)
score = np.float32(0.0)
bias = np.float32(0.0)
if re.match("^VOG", line):
matches = re.match(
"^(VOG[\d\w]+)\s+-\s+([^\s]+)[^\d]+([^\s]+)\s+([^\s]+)\s+([^\s]+)",
line,
)
vog = matches[1]
gene = matches[2]
evalue = float(matches[3])
score = float(matches[4])
bias = float(matches[5])
if gene in dic_genes_scores:
dic_genes_scores[gene].append([vog, evalue, score, bias])
else:
dic_genes_scores[gene] = [[vog, evalue, score, bias]]
# Here goes the continuation
# Create a matrix by accession
dic_matrices_by_genome[prefix] = pd.DataFrame(
index=dic_genes_scores.keys(),
columns=dic_vogs_headers.keys(),
dtype=float,
)
dic_matrices_by_genome[prefix].fillna(value=np.float32(0.0), inplace=True)
# Fill in evalue values
for gene in dic_genes_scores:
for each_match in dic_genes_scores[gene]:
# print(each_match[1], gene)
# Fix for evalue values greater than 1
if each_match[1] > 1:
# print(each_match[1])
each_match[1] = 1
# print(each_match[1])
dic_matrices_by_genome[prefix][each_match[0]][gene] = np.float32(
1.0
) - np.float32(each_match[1])
print("\n**HMMscan has finished.")
# Condense matrices to array by suming up columns
list_condensed_matrices = []
list_file_names = []
for matrix in dic_matrices_by_genome:
temp = list(dic_matrices_by_genome[matrix].sum(axis=0, skipna=True))
list_file_names.append(matrix)
# Parse tag
# if re.search('^NC_.*', matrix):
# matrix = matrix.replace("NC_", "NC")
# [0]accession [1]genus [2]species
# tags = matrix.split("_")
# For Genus
# temp.append(tags[1])
# temp.append(tags[0])
# For Species
# temp.append(tag[1]+"_"+tag[2])
# temp.append(tag[0])
list_condensed_matrices.append(temp)
# Convert to array
# import numpy as np
array = np.array(list_condensed_matrices)
# print("ARRAY-SHAPE: ", len(array))
###
# Predictions
###
print("\n**Starting deeplearning predictions...")
# load models
model_genus_relu = load_model(
"/opt/vHULK/models/model_genus_total_fixed_relu_08mar_2020.h5",
custom_objects={"LeakyReLU": LeakyReLU, "ReLU": ReLU},
)
model_genus_sm = load_model(
"/opt/vHULK/models/model_genus_total_fixed_softmax_01mar_2020.h5",
custom_objects={"LeakyReLU": LeakyReLU, "ReLU": ReLU},
)
model_species_relu = load_model(
"/opt/vHULK/models/model_species_total_fixed_relu_08mar_2020.h5",
custom_objects={"LeakyReLU": LeakyReLU, "ReLU": ReLU},
)
model_species_sm = load_model(
"/opt/vHULK/models/model_species_total_fixed_softmax_01mar_2020.h5",
custom_objects={"LeakyReLU": LeakyReLU, "ReLU": ReLU},
)
with open(input_folder + "results/results.csv", "w") as file:
file.write(
"BIN/genome,pred_genus_relu,score_genus_relu,Pred_genus_softmax,score_genus_softmax,pred_species_relu,score_species_relu,pred_species_softmax,score_species_softmax,final_prediction,entropy\n"
)
for i in range(0, len(array)):
# Genus ReLu
# print(list_file_names[i])
pred_gen_relu = model_genus_relu.predict(np.array([array[i]]))
# print("Genus:ReLu")
# print(pred_gen_relu)
position_pred_gen_relu = np.argmax(pred_gen_relu)
if not pred_gen_relu.any():
name_pred_gen_relu = "None"
score_pred_gen_relu = "0"
else:
list_hosts_genus = [
line.rstrip("\n") for line in open("/opt/vHULK/files/list_hosts_genus.txt")
]
name_pred_gen_relu = list_hosts_genus[position_pred_gen_relu]
score_pred_gen_relu = str(pred_gen_relu[0][position_pred_gen_relu])
# print(list_hosts_genus[position_pred_gen_relu])
# print(position_pred_gen_relu, pred_gen_relu[0][position_pred_gen_relu])
# Genus softmax
pred_gen_sm = model_genus_sm.predict(np.array([array[i]]))
# print("Genus:Softmax")
# print(pred_gen_sm)
position_pred_gen_sm = np.argmax(pred_gen_sm)
list_hosts_genus = [
line.rstrip("\n") for line in open("/opt/vHULK/files/list_hosts_genus.txt")
]
name_pred_gen_sm = list_hosts_genus[position_pred_gen_sm]
score_pred_gen_sm = str(pred_gen_sm[0][position_pred_gen_sm])
# print(list_hosts_genus[position_pred_gen_sm])
# print(position_pred_gen_sm, pred_gen_sm[0][position_pred_gen_sm])
# Species Relu
pred_sp_relu = model_species_relu.predict(np.array([array[i]]))
# print("Species:ReLu")
# print(pred_sp_relu)
position_pred_sp_relu = np.argmax(pred_sp_relu)
if not pred_sp_relu.any():
name_pred_sp_relu = "None"
score_pred_sp_relu = "0"
else:
list_hosts_sp = [
line.rstrip("\n") for line in open("/opt/vHULK/files/list_hosts_species.txt")
]
# print(list_hosts_sp)
name_pred_sp_relu = list_hosts_sp[position_pred_sp_relu]
score_pred_sp_relu = str(pred_sp_relu[0][position_pred_sp_relu])
# print(list_hosts_sp[position_pred_sp_relu])
# print(position_pred_sp_relu, pred_sp_relu[0][position_pred_sp_relu])
# Species softmax
pred_sp_sm = model_species_sm.predict(np.array([array[i]]))
# print("Species:Softmax")
# print(pred_sp_sm)
position_pred_sp_sm = np.argmax(pred_sp_sm)
list_hosts_sp = [
line.rstrip("\n") for line in open("/opt/vHULK/files/list_hosts_species.txt")
]
# print(list_hosts_sp)
name_pred_sp_sm = list_hosts_sp[position_pred_sp_sm]
score_pred_sp_sm = str(pred_sp_sm[0][position_pred_sp_sm])
# print(list_hosts_sp[position_pred_sp_sm])
# print(position_pred_sp_sm, pred_sp_sm[0][position_pred_sp_sm])
##
# Calculate entropy
entropy_genus_sm = entr(pred_gen_sm).sum(axis=1)
# entropy_genus_sm = "{:.7f}".format(entr(pred_gen_sm).sum(axis=1))
#
# Apply decision tree
#
final_decision = "None"
# Relu sp
if float(score_pred_sp_relu) > 0.9:
final_decision = name_pred_sp_relu
# SM sp
if float(score_pred_sp_sm) > 0.6 and name_pred_sp_sm != final_decision:
final_decision = name_pred_sp_sm
# Coudn't predict species
if final_decision == "None":
# Put here sm sp
if float(score_pred_sp_sm) > 0.6:
final_decision = name_pred_sp_sm
# relu genus
if float(score_pred_gen_relu) >= 0.7:
final_decision = name_pred_gen_relu
# sm genus
if (
float(score_pred_gen_sm) >= 0.5
and name_pred_gen_sm != final_decision
):
final_decision = name_pred_gen_sm
else:
# relu genus
if float(score_pred_gen_relu) >= 0.9:
final_decision = name_pred_gen_relu
# sm genus
if (
float(score_pred_gen_sm) >= 0.4
and name_pred_gen_sm != final_decision
):
final_decision = name_pred_gen_sm
# Predicted species.
# Verify if genus is the same
else:
if re.search(name_pred_gen_relu, final_decision) or re.search(
name_pred_gen_sm, final_decision
):
pass
else:
# relu genus
if float(score_pred_gen_relu) >= 0.9:
final_decision = name_pred_gen_relu
# sm genus
if (
float(score_pred_gen_sm) >= 0.5
and name_pred_gen_sm != final_decision
):
final_decision = name_pred_gen_sm
# Print CSV
with open(input_folder + "results/results.csv", "a") as file:
file.write(
list_file_names[i]
+ ","
+ name_pred_gen_relu
+ ","
+ score_pred_gen_relu
+ ","
+ name_pred_gen_sm
+ ","
+ score_pred_gen_sm
+ ","
+ name_pred_sp_relu
+ ","
+ score_pred_sp_relu
+ ","
+ name_pred_sp_sm
+ ","
+ score_pred_sp_sm
+ ","
+ final_decision
+ ","
+ str(entropy_genus_sm[0])
+ "\n"
)
# print(list_file_names[i]+","+name_pred_gen_relu+":"+score_pred_gen_relu+","+name_pred_gen_sm+":"+score_pred_gen_sm+","+name_pred_sp_relu+":"+score_pred_sp_relu+","+name_pred_sp_sm+":"+score_pred_sp_sm+","+final_decision+","+str(entropy_genus_sm))
print(
'\n**Deep learning predictions have finished. Results are in file "results.csv" inside input_folder/results/.\n**Thank you for using v.HULK'
)
Bootstrap: docker
From: continuumio/miniconda3
%labels
Maintainer papanikos_182
Version 0.1
Source https://github.com/LaboratorioBioinformatica/vHULK
Preprint https://www.biorxiv.org/content/10.1101/2020.12.06.413476v1
%files
vhulk_explicit.txt /opt/vHULK/
vHULK /opt
vhulk_resources.tar.gz /opt/vHULK
%environment
export PATH=/opt/vHULK:$PATH
%post
apt update && apt upgrade -y
conda update -y conda
conda create -n vhulk --file=/opt/vHULK/vhulk_explicit.txt
conda clean -ya
echo ". /opt/conda/etc/profile.d/conda.sh" >> $SINGULARITY_ENVIRONMENT
echo "conda activate vhulk" >> $SINGULARITY_ENVIRONMENT
tar -xvzf /opt/vHULK/vhulk_resources.tar.gz -C /opt/vHULK && rm /opt/vHULK/vhulk_resources.tar.gz
%help
A container for vHULK v0.1 ( https://github.com/LaboratorioBioinformatica/vHULK ).
Required source scripts. models and data are stored in /opt/vHULK .
The main vHULK-v0.1.py has been modified with
- Portable shebang
- All hardcoded paths are now hardcoded for this container
To run the help menu for vHULK from this container execute
$ singularity exec library://papanikos_182/vhulk:0.1 python vHULK-v0.1.py --help
vHULK takes an input directory with one or more fasta files that are assumed to
be bins. It makes predictions based on its models for all bins separately.
That is, if you have assembled contigs you need to split them into separate files
in a directory and provide that as input.
It outputs a dir named `results`, stored within the input dir (nobody knows why..).
Its major output is a `results.csv` file with host predictions for each genome.
The results dir also contains hmmscan results and prokka annotations for all input
genomes/bins.
To run an anlysis for all genomes stored in the /path/to/genomes, with 8 threads
$ singularity exec library://papanikos_182/default/vhulk:0.1 \
python vHULK_v0.1.py -i /path/to/genomes -t 8
This diff is collapsed.
# WIsH
Available from `library://papanikos_182/default/wish:1.0`
* Note that data dependencies are not included in the container.
Models from [VirHostMatcher-Net](https://github.com/WeiliWw/VirHostMatcher-Net#downloading)
are used
## Procedure
1. Build the image with the definition file
```
$ sudo singularity build wish.sif wish.def
```
3. [Optional] Sign the image
```
$ singularity sign wish.sif
```
4. Push it on the cloud
```
$ singulairty push wish.sif library://papanikos_182/default/wish:0.1
```
## Usage
```
$ singularity run library://papanikos_182/default/wish:1.0 \
WIsH -h
```
Bootstrap: docker
From: debian:latest
%labels
Maintainer papanikos_182
Version 0.1
Source https://github.com/soedinglab/WIsH
Publication https://academic.oup.com/bioinformatics/article/33/19/3113/3964377
%environment
export PATH=/opt/wish:$PATH
%post
# Update stuff
apt update && apt upgrade -y
# Install compile tools for compiling wish
apt install -y build-essential cmake make git
# get source
git clone https://github.com/soedinglab/WIsH.git /opt/wish
# Get in there
cd /opt/wish
# Compile it
cmake . && make
%help
A container for WIsH.
Source: https://github.com/soedinglab/WIsH
Models for host genomes are provided from VirHostMatcher-Net.
Example:
# Probably you need to bind the path/to/data/host_wish_model
$ singularity exec -B /path/to/data/host_wish_model:/data \
WIsH -c predict -m /data \
-g /path/to/phage/genomes/fastas \
-r /path/to/results \
-b
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment