add README

b6856259 · Nikos Pappas · b6856259
Commit b6856259 authored 4 years ago by Nikos Pappas
--- a/README.md
+++ b/README.md
+# PVOGs functions interactions
+---
+
+## TL;DR
+---
+
+```
+# Clone this repo
+$ git clone this_repo pvogs_function
+
+# Get in there
+$ cd pvogs_function
+
+# Optional, if snakemake>=5.14 and conda available
+$ conda env create -n my_env --file=environment.yml
+$ conda activate my_env
+
+
+# Dry run to check that it works
+(my_env)$ snakemake --use-conda -n
+```
+
+## Description
+---
+The main purpose of this repository is to host the code necessary for full reproducibility.
+
+* Raw data required are hosted on [zenodo sandbox](https://sandbox.zenodo.org/record/666719#.X1c5qoZS_J8). These are automatically
+downloaded when executing the workflow, so no need to get them.
+
+Most of the steps are not easily configurable, unless you take a dive into all the rules and scripts. This is by choice.
+
+## Requirements
+---
+
+* A working [conda](https://docs.conda.io/en/latest/) installation
+* `snakemake >= 5.14` (any version with jupyter integration should do)
+  * Optional: `mamba == 0.5.1` (speeds up environment dependency resolution and creation)
+
+You can create the same `conda` environment used during development with the provided `environment.yml`.
+```
+$ conda env create -n my_env --file=./environment.yml
+```
+
+Make sure you activate it before you launch snakemake
+```
+$ conda activate my_env
+(my_env)$ snakemake --version
+5.23.0
+```
+
+## Configuration
+---
+
+There is not any real configuration needed. The `config/config_default.yml` file is there mainly as a placeholder for things that are
+subject to change. These are mainly
+  - the zenodo dois: Until the workflow gets published, I am using the zenodo sandbox for testing.
+
+The only value that will make a difference is the `negatives`. This determines how many negative datasets to create.
+For reproducibility, leave that to `10`. You can mess around with different values but 
+be advised: **this has not been tested, and the workflow will most likely break**.
+
+## Usage
+---
+Currently, this workflow was built and tested on a local machine with graphics enabled.
+
+>If you run this on a remote machine, make sure that you (can) ssh with `ssh -X ...`
+>This is required for the `summarize_intact.py` script, that uses the `ete3` package
+>to do some plotting.
+
+The most resource demanding rules are 
+* ANI calculation: `fastani`
+* AAI calculation: `comparem_call_genes`, `comparem_similarity`, `comparem_aai`
+* HMM searches: `hmmsearch`, `hmmsearch_transeq`
+* Model search: `random_forest`
+
+
+`threads` have been manually set to allow these to run in a reasonable amount of time and allow parallel execution of jobs,
+given my own local setup (`Ubuntu 16.04.1 x86_64` with 120Gb of RAM and 20 processors). You should adjust these according to your needs.
+
+>TO DO
+>Include thread definition in the config
+
+
+### **Option 1. This repo**
+---
+`cd` into the root directory of this repo.
+
+  - Dry run:
+Always a good idea before launching the whole worfklow
+```
+$ snakemake --use-conda -j16 -np
+```
+
+If the dry run completed with no errors to run the worfklow by removing the `-n` flag. 
+  * Adjust number of parallel jobs (`-j`) according to your setup
+  * Remove the `-p` flag if you don't want the commands to be printed.
+```
+$ snakemake --use-conda -j16 -p
+```
+  - Speed up environment creation with mamba
+If `mamba` is available in your snakemake environment, or if you created a new environment with the `environment.yml`
+provided here:
+```
+$ snakemake --use-conda -j16 --conda-frontend mamba
+```
+
+  - Jupyter integration
+A central notebook is used for all visualization and machine learning (model search) purposes.
+Its main output is the `results/RF/best_model.pkl` file.
+
+If you want to fiddle around with it yourself
+```
+$ snakemake --use-conda -j16 --conda-frontend mamba --edit-notebook results/RF/best_model.pkl
+```
+Once the `results/RF/best_model.pkl` is written you can save the changes, and quit the server
+([more info here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration) and
+you can always [see this demo](https://snakemake.readthedocs.io/en/stable/_images/snakemake-notebook-demo.gif).
+This will trigger the execution of the rest of the workflow.
+
+The resulting notebook will be saved as `results/logs/processed_notebook.py.ipynb`.
+
+Note that depending on the changes you make the results you might get will differ from the default, non-interactive run.
+
+
+### **Option 2.** Archived workflow from zenodo (TO DO).
+---
+Something along the [guidelines from snakemake](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#sustainable-and-reproducible-archiving).
+
+
+## Output
+---
+The output of the whole workflow is produced and stored within a `results` directory. This looks like (several directories
+and files omitted for legibility). Most prominent ones are marked with an asterisk and a short description:
+```
+# Skipping several thousands of intermediate files with the -I option
+$ tree -n -I '*NC*.fasta|*_genes.*|*.gff|*.log' results
+
+results
+├── annotations.tsv
+├── filtered_scores.tsv -------------------- * Table containing feature values for all interactions passing filtering
+├── final_training_set.tsv
+├── interaction_datasets
+│   ├── 01_filter_intact
+│   ├── 02_summarize_intact
+│   ├── 03_uniprot
+│   ├── 04_process_uniprot
+│   ├── 05_genomes
+│   ├── 05_interaction_datasets
+│   ├── 06_map_proteins_to_pvogs
+│   ├── N1  --------------------------------  
+....                                        | * Features, interactions, proteins, and pvogs are stored per dataset
+│   └── positives --------------------------  
+│       ├── positives.features.tsv
+│       ├── positives.interactions.tsv
+│       ├── positives.proteins.faa
+│       └── positives.pvogs_interactions.tsv
+├── logs
+├── predictions.tsv ------------------------- * Final predictions made
+├── pre_process
+│   ├── all_genomes
+│   ├── comparem  --------------------------- * Directory with the final AAI matrix used
+...
+│   ├── fastani  ---------------------------- * Directory with the final ANI matrix used
+│   ├── hmmsearch  -------------------------- * HMMER search results for all pvogs profiles agains the translated genomes
+│   ├── reflist.txt
+│   └── transeq
+│       └── transeq.genomes.fasta
+├── RF
+│   ├── best_model_id.txt ------------------- * Contains the id of the negative dataset
+│   ├── best_model.pkl ---------------------- * The best model obtained.
+│   ├── features_stats.tsv ------------------ * Mean, max, min. std for feature importances
+│   ├── features.tsv ------------------------ * Exact values of features importances for each combination of training/validation
+│   ├── figures ----------------------------- * Figures used in the manuscript.       
+│   │   ├── Figure_1a.svg
+        ....
+....
+│   ├── metrics.pkl
+│   ├── metrics.stats.tsv ------------------- * Mean. max, min, std across all models
+│   ├── metrics.tsv ------------------------- * Exact values of metrics for each combination of training/validation
+│   └── models
+│       ├── N10.RF.pkl ---------------------- * Best model obtained when optimizing with each negative set
+        .....
+.....		
+└── scores.tsv  ----------------------------- * Master table with feature values for all possible pVOGs combinations
+
+```