README.md

# PVOGs functions interactions
---

## TL;DR
---

```
# Clone this repo
$ git clone this_repo pvogs_function

# Get in there
$ cd pvogs_function

# Optional, if snakemake>=5.14 and conda available
$ conda env create -n my_env --file=environment.yml
$ conda activate my_env


# Dry run to check that it works
(my_env)$ snakemake --use-conda -n
```

## Description
---
The main purpose of this repository is to host the code necessary for full reproducibility.

* Raw data required are hosted on [zenodo sandbox](https://sandbox.zenodo.org/record/666719#.X1c5qoZS_J8). These are automatically
downloaded when executing the workflow, so no need to get them.

Most of the steps are not easily configurable, unless you take a dive into all the rules and scripts. This is by choice.

## Requirements
---

* A working [conda](https://docs.conda.io/en/latest/) installation
* `snakemake >= 5.14` (any version with jupyter integration should do)
  * Optional: `mamba == 0.5.1` (speeds up environment dependency resolution and creation)

You can create the same `conda` environment used during development with the provided `environment.yml`.
```
$ conda env create -n my_env --file=./environment.yml
```

Make sure you activate it before you launch snakemake
```
$ conda activate my_env
(my_env)$ snakemake --version
5.23.0
```

## Configuration
---

There is not any real configuration needed. The `config/config_default.yml` file is there mainly as a placeholder for things that are
subject to change. These are mainly
  - the zenodo dois: Until the workflow gets published, I am using the zenodo sandbox for testing.

The only value that will make a difference is the `negatives`. This determines how many negative datasets to create.
For reproducibility, leave that to `10`. You can mess around with different values but 
be advised: **this has not been tested, and the workflow will most likely break**.

## Usage
---
Currently, this workflow was built and tested on a local machine with graphics enabled.

>If you run this on a remote machine, make sure that you (can) ssh with `ssh -X ...`
>This is required for the `summarize_intact.py` script, that uses the `ete3` package
>to do some plotting.

The most resource demanding rules are 
* ANI calculation: `fastani`
* AAI calculation: `comparem_call_genes`, `comparem_similarity`, `comparem_aai`
* HMM searches: `hmmsearch`, `hmmsearch_transeq`
* Model search: `random_forest`


`threads` have been manually set to allow these to run in a reasonable amount of time and allow parallel execution of jobs,
given my own local setup (`Ubuntu 16.04.1 x86_64` with 120Gb of RAM and 20 processors). You should adjust these according to your needs.

>TO DO
>Include thread definition in the config


### **Option 1. This repo**
---
`cd` into the root directory of this repo.

  - Dry run:
Always a good idea before launching the whole worfklow
```
$ snakemake --use-conda -j16 -np
```

If the dry run completed with no errors to run the worfklow by removing the `-n` flag. 
  * Adjust number of parallel jobs (`-j`) according to your setup
  * Remove the `-p` flag if you don't want the commands to be printed.
```
$ snakemake --use-conda -j16 -p
```
  - Speed up environment creation with mamba
If `mamba` is available in your snakemake environment, or if you created a new environment with the `environment.yml`
provided here:
```
$ snakemake --use-conda -j16 --conda-frontend mamba
```

  - Jupyter integration
A central notebook is used for all visualization and machine learning (model search) purposes.
Its main output is the `results/RF/best_model.pkl` file.

If you want to fiddle around with it yourself
```
$ snakemake --use-conda -j16 --conda-frontend mamba --edit-notebook results/RF/best_model.pkl
```
Once the `results/RF/best_model.pkl` is written you can save the changes, and quit the server
([more info here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration) and
you can always [see this demo](https://snakemake.readthedocs.io/en/stable/_images/snakemake-notebook-demo.gif).
This will trigger the execution of the rest of the workflow.

The resulting notebook will be saved as `results/logs/processed_notebook.py.ipynb`.

Note that depending on the changes you make the results you might get will differ from the default, non-interactive run.


### **Option 2.** Archived workflow from zenodo (TO DO).
---
Something along the [guidelines from snakemake](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#sustainable-and-reproducible-archiving).


## Output
---
The output of the whole workflow is produced and stored within a `results` directory. This looks like (several directories
and files omitted for legibility). Most prominent ones are marked with an asterisk and a short description:
```
# Skipping several thousands of intermediate files with the -I option
$ tree -n -I '*NC*.fasta|*_genes.*|*.gff|*.log' results

results
├── annotations.tsv
├── filtered_scores.tsv -------------------- * Table containing feature values for all interactions passing filtering
├── final_training_set.tsv
├── interaction_datasets
│   ├── 01_filter_intact
│   ├── 02_summarize_intact
│   ├── 03_uniprot
│   ├── 04_process_uniprot
│   ├── 05_genomes
│   ├── 05_interaction_datasets
│   ├── 06_map_proteins_to_pvogs
│   ├── N1  --------------------------------  
....                                        | * Features, interactions, proteins, and pvogs are stored per dataset
│   └── positives --------------------------  
│       ├── positives.features.tsv
│       ├── positives.interactions.tsv
│       ├── positives.proteins.faa
│       └── positives.pvogs_interactions.tsv
├── logs
├── predictions.tsv ------------------------- * Final predictions made
├── pre_process
│   ├── all_genomes
│   ├── comparem  --------------------------- * Directory with the final AAI matrix used
...
│   ├── fastani  ---------------------------- * Directory with the final ANI matrix used
│   ├── hmmsearch  -------------------------- * HMMER search results for all pvogs profiles agains the translated genomes
│   ├── reflist.txt
│   └── transeq
│       └── transeq.genomes.fasta
├── RF
│   ├── best_model_id.txt ------------------- * Contains the id of the negative dataset
│   ├── best_model.pkl ---------------------- * The best model obtained.
│   ├── features_stats.tsv ------------------ * Mean, max, min. std for feature importances
│   ├── features.tsv ------------------------ * Exact values of features importances for each combination of training/validation
│   ├── figures ----------------------------- * Figures used in the manuscript.       
│   │   ├── Figure_1a.svg
        ....
....
│   ├── metrics.pkl
│   ├── metrics.stats.tsv ------------------- * Mean. max, min, std across all models
│   ├── metrics.tsv ------------------------- * Exact values of metrics for each combination of training/validation
│   └── models
│       ├── N10.RF.pkl ---------------------- * Best model obtained when optimizing with each negative set
        .....
.....		
└── scores.tsv  ----------------------------- * Master table with feature values for all possible pVOGs combinations

```