Skip to content
Snippets Groups Projects
Commit b6856259 authored by Nikos Pappas's avatar Nikos Pappas
Browse files

add README

parents
No related branches found
No related tags found
No related merge requests found
README.md 0 → 100644
# PVOGs functions interactions
---
## TL;DR
---
```
# Clone this repo
$ git clone this_repo pvogs_function
# Get in there
$ cd pvogs_function
# Optional, if snakemake>=5.14 and conda available
$ conda env create -n my_env --file=environment.yml
$ conda activate my_env
# Dry run to check that it works
(my_env)$ snakemake --use-conda -n
```
## Description
---
The main purpose of this repository is to host the code necessary for full reproducibility.
* Raw data required are hosted on [zenodo sandbox](https://sandbox.zenodo.org/record/666719#.X1c5qoZS_J8). These are automatically
downloaded when executing the workflow, so no need to get them.
Most of the steps are not easily configurable, unless you take a dive into all the rules and scripts. This is by choice.
## Requirements
---
* A working [conda](https://docs.conda.io/en/latest/) installation
* `snakemake >= 5.14` (any version with jupyter integration should do)
* Optional: `mamba == 0.5.1` (speeds up environment dependency resolution and creation)
You can create the same `conda` environment used during development with the provided `environment.yml`.
```
$ conda env create -n my_env --file=./environment.yml
```
Make sure you activate it before you launch snakemake
```
$ conda activate my_env
(my_env)$ snakemake --version
5.23.0
```
## Configuration
---
There is not any real configuration needed. The `config/config_default.yml` file is there mainly as a placeholder for things that are
subject to change. These are mainly
- the zenodo dois: Until the workflow gets published, I am using the zenodo sandbox for testing.
The only value that will make a difference is the `negatives`. This determines how many negative datasets to create.
For reproducibility, leave that to `10`. You can mess around with different values but
be advised: **this has not been tested, and the workflow will most likely break**.
## Usage
---
Currently, this workflow was built and tested on a local machine with graphics enabled.
>If you run this on a remote machine, make sure that you (can) ssh with `ssh -X ...`
>This is required for the `summarize_intact.py` script, that uses the `ete3` package
>to do some plotting.
The most resource demanding rules are
* ANI calculation: `fastani`
* AAI calculation: `comparem_call_genes`, `comparem_similarity`, `comparem_aai`
* HMM searches: `hmmsearch`, `hmmsearch_transeq`
* Model search: `random_forest`
`threads` have been manually set to allow these to run in a reasonable amount of time and allow parallel execution of jobs,
given my own local setup (`Ubuntu 16.04.1 x86_64` with 120Gb of RAM and 20 processors). You should adjust these according to your needs.
>TO DO
>Include thread definition in the config
### **Option 1. This repo**
---
`cd` into the root directory of this repo.
- Dry run:
Always a good idea before launching the whole worfklow
```
$ snakemake --use-conda -j16 -np
```
If the dry run completed with no errors to run the worfklow by removing the `-n` flag.
* Adjust number of parallel jobs (`-j`) according to your setup
* Remove the `-p` flag if you don't want the commands to be printed.
```
$ snakemake --use-conda -j16 -p
```
- Speed up environment creation with mamba
If `mamba` is available in your snakemake environment, or if you created a new environment with the `environment.yml`
provided here:
```
$ snakemake --use-conda -j16 --conda-frontend mamba
```
- Jupyter integration
A central notebook is used for all visualization and machine learning (model search) purposes.
Its main output is the `results/RF/best_model.pkl` file.
If you want to fiddle around with it yourself
```
$ snakemake --use-conda -j16 --conda-frontend mamba --edit-notebook results/RF/best_model.pkl
```
Once the `results/RF/best_model.pkl` is written you can save the changes, and quit the server
([more info here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration) and
you can always [see this demo](https://snakemake.readthedocs.io/en/stable/_images/snakemake-notebook-demo.gif).
This will trigger the execution of the rest of the workflow.
The resulting notebook will be saved as `results/logs/processed_notebook.py.ipynb`.
Note that depending on the changes you make the results you might get will differ from the default, non-interactive run.
### **Option 2.** Archived workflow from zenodo (TO DO).
---
Something along the [guidelines from snakemake](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#sustainable-and-reproducible-archiving).
## Output
---
The output of the whole workflow is produced and stored within a `results` directory. This looks like (several directories
and files omitted for legibility). Most prominent ones are marked with an asterisk and a short description:
```
# Skipping several thousands of intermediate files with the -I option
$ tree -n -I '*NC*.fasta|*_genes.*|*.gff|*.log' results
results
├── annotations.tsv
├── filtered_scores.tsv -------------------- * Table containing feature values for all interactions passing filtering
├── final_training_set.tsv
├── interaction_datasets
│ ├── 01_filter_intact
│ ├── 02_summarize_intact
│ ├── 03_uniprot
│ ├── 04_process_uniprot
│ ├── 05_genomes
│ ├── 05_interaction_datasets
│ ├── 06_map_proteins_to_pvogs
│ ├── N1 --------------------------------
.... | * Features, interactions, proteins, and pvogs are stored per dataset
│ └── positives --------------------------
│ ├── positives.features.tsv
│ ├── positives.interactions.tsv
│ ├── positives.proteins.faa
│ └── positives.pvogs_interactions.tsv
├── logs
├── predictions.tsv ------------------------- * Final predictions made
├── pre_process
│ ├── all_genomes
│ ├── comparem --------------------------- * Directory with the final AAI matrix used
...
│ ├── fastani ---------------------------- * Directory with the final ANI matrix used
│ ├── hmmsearch -------------------------- * HMMER search results for all pvogs profiles agains the translated genomes
│ ├── reflist.txt
│ └── transeq
│ └── transeq.genomes.fasta
├── RF
│ ├── best_model_id.txt ------------------- * Contains the id of the negative dataset
│ ├── best_model.pkl ---------------------- * The best model obtained.
│ ├── features_stats.tsv ------------------ * Mean, max, min. std for feature importances
│ ├── features.tsv ------------------------ * Exact values of features importances for each combination of training/validation
│ ├── figures ----------------------------- * Figures used in the manuscript.
│ │ ├── Figure_1a.svg
....
....
│ ├── metrics.pkl
│ ├── metrics.stats.tsv ------------------- * Mean. max, min, std across all models
│ ├── metrics.tsv ------------------------- * Exact values of metrics for each combination of training/validation
│ └── models
│ ├── N10.RF.pkl ---------------------- * Best model obtained when optimizing with each negative set
.....
.....
└── scores.tsv ----------------------------- * Master table with feature values for all possible pVOGs combinations
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment