Skip to content
Snippets Groups Projects
README.md 7.38 KiB
Newer Older
  • Learn to ignore specific revisions
  • Nikos Pappas's avatar
    Nikos Pappas committed
    # PVOGs functions interactions
    ---
    
    ## TL;DR
    ---
    
    ```
    # Clone this repo
    $ git clone this_repo pvogs_function
    
    # Get in there
    $ cd pvogs_function
    
    # Optional, if snakemake>=5.14 and conda available
    $ conda env create -n my_env --file=environment.yml
    $ conda activate my_env
    
    
    # Dry run to check that it works
    (my_env)$ snakemake --use-conda -n
    ```
    
    ## Description
    ---
    The main purpose of this repository is to host the code necessary for full reproducibility.
    
    * Raw data required are hosted on [zenodo sandbox](https://sandbox.zenodo.org/record/666719#.X1c5qoZS_J8). These are automatically
    downloaded when executing the workflow, so no need to get them.
    
    Most of the steps are not easily configurable, unless you take a dive into all the rules and scripts. This is by choice.
    
    ## Requirements
    ---
    
    * A working [conda](https://docs.conda.io/en/latest/) installation
    * `snakemake >= 5.14` (any version with jupyter integration should do)
      * Optional: `mamba == 0.5.1` (speeds up environment dependency resolution and creation)
    
    You can create the same `conda` environment used during development with the provided `environment.yml`.
    ```
    $ conda env create -n my_env --file=./environment.yml
    ```
    
    Make sure you activate it before you launch snakemake
    ```
    $ conda activate my_env
    (my_env)$ snakemake --version
    5.23.0
    ```
    
    ## Configuration
    ---
    
    There is not any real configuration needed. The `config/config_default.yml` file is there mainly as a placeholder for things that are
    subject to change. These are mainly
      - the zenodo dois: Until the workflow gets published, I am using the zenodo sandbox for testing.
    
    The only value that will make a difference is the `negatives`. This determines how many negative datasets to create.
    For reproducibility, leave that to `10`. You can mess around with different values but 
    be advised: **this has not been tested, and the workflow will most likely break**.
    
    ## Usage
    ---
    Currently, this workflow was built and tested on a local machine with graphics enabled.
    
    >If you run this on a remote machine, make sure that you (can) ssh with `ssh -X ...`
    >This is required for the `summarize_intact.py` script, that uses the `ete3` package
    >to do some plotting.
    
    The most resource demanding rules are 
    * ANI calculation: `fastani`
    * AAI calculation: `comparem_call_genes`, `comparem_similarity`, `comparem_aai`
    * HMM searches: `hmmsearch`, `hmmsearch_transeq`
    * Model search: `random_forest`
    
    
    `threads` have been manually set to allow these to run in a reasonable amount of time and allow parallel execution of jobs,
    given my own local setup (`Ubuntu 16.04.1 x86_64` with 120Gb of RAM and 20 processors). You should adjust these according to your needs.
    
    >TO DO
    >Include thread definition in the config
    
    
    ### **Option 1. This repo**
    ---
    `cd` into the root directory of this repo.
    
      - Dry run:
    Always a good idea before launching the whole worfklow
    ```
    $ snakemake --use-conda -j16 -np
    ```
    
    If the dry run completed with no errors to run the worfklow by removing the `-n` flag. 
      * Adjust number of parallel jobs (`-j`) according to your setup
      * Remove the `-p` flag if you don't want the commands to be printed.
    ```
    $ snakemake --use-conda -j16 -p
    ```
      - Speed up environment creation with mamba
    If `mamba` is available in your snakemake environment, or if you created a new environment with the `environment.yml`
    provided here:
    ```
    $ snakemake --use-conda -j16 --conda-frontend mamba
    ```
    
      - Jupyter integration
    A central notebook is used for all visualization and machine learning (model search) purposes.
    Its main output is the `results/RF/best_model.pkl` file.
    
    If you want to fiddle around with it yourself
    ```
    $ snakemake --use-conda -j16 --conda-frontend mamba --edit-notebook results/RF/best_model.pkl
    ```
    Once the `results/RF/best_model.pkl` is written you can save the changes, and quit the server
    ([more info here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration) and
    you can always [see this demo](https://snakemake.readthedocs.io/en/stable/_images/snakemake-notebook-demo.gif).
    This will trigger the execution of the rest of the workflow.
    
    The resulting notebook will be saved as `results/logs/processed_notebook.py.ipynb`.
    
    Note that depending on the changes you make the results you might get will differ from the default, non-interactive run.
    
    
    ### **Option 2.** Archived workflow from zenodo (TO DO).
    ---
    Something along the [guidelines from snakemake](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#sustainable-and-reproducible-archiving).
    
    
    ## Output
    ---
    The output of the whole workflow is produced and stored within a `results` directory. This looks like (several directories
    and files omitted for legibility). Most prominent ones are marked with an asterisk and a short description:
    ```
    # Skipping several thousands of intermediate files with the -I option
    $ tree -n -I '*NC*.fasta|*_genes.*|*.gff|*.log' results
    
    results
    ├── annotations.tsv
    ├── filtered_scores.tsv -------------------- * Table containing feature values for all interactions passing filtering
    ├── final_training_set.tsv
    ├── interaction_datasets
    │   ├── 01_filter_intact
    │   ├── 02_summarize_intact
    │   ├── 03_uniprot
    │   ├── 04_process_uniprot
    │   ├── 05_genomes
    │   ├── 05_interaction_datasets
    │   ├── 06_map_proteins_to_pvogs
    │   ├── N1  --------------------------------  
    ....                                        | * Features, interactions, proteins, and pvogs are stored per dataset
    │   └── positives --------------------------  
    │       ├── positives.features.tsv
    │       ├── positives.interactions.tsv
    │       ├── positives.proteins.faa
    │       └── positives.pvogs_interactions.tsv
    ├── logs
    ├── predictions.tsv ------------------------- * Final predictions made
    ├── pre_process
    │   ├── all_genomes
    │   ├── comparem  --------------------------- * Directory with the final AAI matrix used
    ...
    │   ├── fastani  ---------------------------- * Directory with the final ANI matrix used
    │   ├── hmmsearch  -------------------------- * HMMER search results for all pvogs profiles agains the translated genomes
    │   ├── reflist.txt
    │   └── transeq
    │       └── transeq.genomes.fasta
    ├── RF
    │   ├── best_model_id.txt ------------------- * Contains the id of the negative dataset
    │   ├── best_model.pkl ---------------------- * The best model obtained.
    │   ├── features_stats.tsv ------------------ * Mean, max, min. std for feature importances
    │   ├── features.tsv ------------------------ * Exact values of features importances for each combination of training/validation
    │   ├── figures ----------------------------- * Figures used in the manuscript.       
    │   │   ├── Figure_1a.svg
            ....
    ....
    │   ├── metrics.pkl
    │   ├── metrics.stats.tsv ------------------- * Mean. max, min, std across all models
    │   ├── metrics.tsv ------------------------- * Exact values of metrics for each combination of training/validation
    │   └── models
    │       ├── N10.RF.pkl ---------------------- * Best model obtained when optimizing with each negative set
            .....
    .....		
    └── scores.tsv  ----------------------------- * Master table with feature values for all possible pVOGs combinations
    
    ```