# PVOGs functions interactions --- ## TL;DR --- ``` # Clone this repo $ git clone this_repo pvogs_function # Get in there $ cd pvogs_function # Optional, if snakemake>=5.14 and conda available $ conda env create -n my_env --file=environment.yml $ conda activate my_env # Dry run to check that it works (my_env)$ snakemake --use-conda -n ``` ## Description --- The main purpose of this repository is to host the code necessary for full reproducibility. * Raw data required are hosted on [zenodo sandbox](https://sandbox.zenodo.org/record/666719#.X1c5qoZS_J8). These are automatically downloaded when executing the workflow, so no need to get them. Most of the steps are not easily configurable, unless you take a dive into all the rules and scripts. This is by choice. ## Requirements --- * A working [conda](https://docs.conda.io/en/latest/) installation * `snakemake >= 5.14` (any version with jupyter integration should do) * Optional: `mamba == 0.5.1` (speeds up environment dependency resolution and creation) You can create the same `conda` environment used during development with the provided `environment.yml`. ``` $ conda env create -n my_env --file=./environment.yml ``` Make sure you activate it before you launch snakemake ``` $ conda activate my_env (my_env)$ snakemake --version 5.23.0 ``` ## Configuration --- There is not any real configuration needed. The `config/config_default.yml` file is there mainly as a placeholder for things that are subject to change. These are mainly - the zenodo dois: Until the workflow gets published, I am using the zenodo sandbox for testing. The only value that will make a difference is the `negatives`. This determines how many negative datasets to create. For reproducibility, leave that to `10`. You can mess around with different values but be advised: **this has not been tested, and the workflow will most likely break**. ## Usage --- Currently, this workflow was built and tested on a local machine with graphics enabled. >If you run this on a remote machine, make sure that you (can) ssh with `ssh -X ...` >This is required for the `summarize_intact.py` script, that uses the `ete3` package >to do some plotting. The most resource demanding rules are * ANI calculation: `fastani` * AAI calculation: `comparem_call_genes`, `comparem_similarity`, `comparem_aai` * HMM searches: `hmmsearch`, `hmmsearch_transeq` * Model search: `random_forest` `threads` have been manually set to allow these to run in a reasonable amount of time and allow parallel execution of jobs, given my own local setup (`Ubuntu 16.04.1 x86_64` with 120Gb of RAM and 20 processors). You should adjust these according to your needs. >TO DO >Include thread definition in the config ### **Option 1. This repo** --- `cd` into the root directory of this repo. - Dry run: Always a good idea before launching the whole worfklow ``` $ snakemake --use-conda -j16 -np ``` If the dry run completed with no errors to run the worfklow by removing the `-n` flag. * Adjust number of parallel jobs (`-j`) according to your setup * Remove the `-p` flag if you don't want the commands to be printed. ``` $ snakemake --use-conda -j16 -p ``` - Speed up environment creation with mamba If `mamba` is available in your snakemake environment, or if you created a new environment with the `environment.yml` provided here: ``` $ snakemake --use-conda -j16 --conda-frontend mamba ``` - Jupyter integration A central notebook is used for all visualization and machine learning (model search) purposes. Its main output is the `results/RF/best_model.pkl` file. If you want to fiddle around with it yourself ``` $ snakemake --use-conda -j16 --conda-frontend mamba --edit-notebook results/RF/best_model.pkl ``` Once the `results/RF/best_model.pkl` is written you can save the changes, and quit the server ([more info here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration) and you can always [see this demo](https://snakemake.readthedocs.io/en/stable/_images/snakemake-notebook-demo.gif). This will trigger the execution of the rest of the workflow. The resulting notebook will be saved as `results/logs/processed_notebook.py.ipynb`. Note that depending on the changes you make the results you might get will differ from the default, non-interactive run. ### **Option 2.** Archived workflow from zenodo (TO DO). --- Something along the [guidelines from snakemake](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#sustainable-and-reproducible-archiving). ## Output --- The output of the whole workflow is produced and stored within a `results` directory. This looks like (several directories and files omitted for legibility). Most prominent ones are marked with an asterisk and a short description: ``` # Skipping several thousands of intermediate files with the -I option $ tree -n -I '*NC*.fasta|*_genes.*|*.gff|*.log' results results ├── annotations.tsv ├── filtered_scores.tsv -------------------- * Table containing feature values for all interactions passing filtering ├── final_training_set.tsv ├── interaction_datasets │ ├── 01_filter_intact │ ├── 02_summarize_intact │ ├── 03_uniprot │ ├── 04_process_uniprot │ ├── 05_genomes │ ├── 05_interaction_datasets │ ├── 06_map_proteins_to_pvogs │ ├── N1 -------------------------------- .... | * Features, interactions, proteins, and pvogs are stored per dataset │ └── positives -------------------------- │ ├── positives.features.tsv │ ├── positives.interactions.tsv │ ├── positives.proteins.faa │ └── positives.pvogs_interactions.tsv ├── logs ├── predictions.tsv ------------------------- * Final predictions made ├── pre_process │ ├── all_genomes │ ├── comparem --------------------------- * Directory with the final AAI matrix used ... │ ├── fastani ---------------------------- * Directory with the final ANI matrix used │ ├── hmmsearch -------------------------- * HMMER search results for all pvogs profiles agains the translated genomes │ ├── reflist.txt │ └── transeq │ └── transeq.genomes.fasta ├── RF │ ├── best_model_id.txt ------------------- * Contains the id of the negative dataset │ ├── best_model.pkl ---------------------- * The best model obtained. │ ├── features_stats.tsv ------------------ * Mean, max, min. std for feature importances │ ├── features.tsv ------------------------ * Exact values of features importances for each combination of training/validation │ ├── figures ----------------------------- * Figures used in the manuscript. │ │ ├── Figure_1a.svg .... .... │ ├── metrics.pkl │ ├── metrics.stats.tsv ------------------- * Mean. max, min, std across all models │ ├── metrics.tsv ------------------------- * Exact values of metrics for each combination of training/validation │ └── models │ ├── N10.RF.pkl ---------------------- * Best model obtained when optimizing with each negative set ..... ..... └── scores.tsv ----------------------------- * Master table with feature values for all possible pVOGs combinations ```