From a06d66f8171a7fc0ac8e00782c88ed5df50dca82 Mon Sep 17 00:00:00 2001 From: nikos <n.pappas@uu.nl> Date: Tue, 12 Jan 2021 11:19:04 +0100 Subject: [PATCH] fix tree output formatting --- README.md | 43 +++++++++++++++++++++++++++---------------- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 2c5dfad..e510c3a 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,8 @@ tou do not need to worry about getting models or any other external databases. |Tool (source) | Publication/Preprint | |:------|:------| -[RaFAh](https://sourceforge.net/projects/rafah/)|[Coutinho F. H. et al. 2020](https://www.biorxiv.org/content/10.1101/2020.09.25.313155v1?rss=1) +[HTP](https://github.com/wojciech-galan/viruses_classifier)|[Gałan W. et al., 2019](https://www.nature.com/articles/s41598-019-39847-2) +[RaFAh](https://sourceforge.net/projects/rafah/)|[Coutinho F. H. et al., 2020](https://www.biorxiv.org/content/10.1101/2020.09.25.313155v1?rss=1) [vHuLK](https://github.com/LaboratorioBioinformatica/vHULK)|[Amgarten D. et al., 2020](https://www.biorxiv.org/content/10.1101/2020.12.06.413476v1) [VirHostMatcher-Net](https://github.com/WeiliWw/VirHostMatcher-Net)|[Wang W. et al., 2020](https://doi.org/10.1093/nargab/lqaa044]) [WIsH](https://github.com/soedinglab/WIsH)|[Galiez G. et al., 2017](https://academic.oup.com/bioinformatics/article/33/19/3113/3964377) @@ -36,7 +37,7 @@ The file `environment.txt` can be used to recreate the complete environment used during development. > The provided `environment.txt` contains an explicit list of all packages, -> produced with `conda list -n hp --explicit > environment.txt` . +> produced with `conda list -n phap --explicit > environment.txt` . > This ensures all packages are exactly the same versions/builds, so we > minimize the risk of running into dependencies issues @@ -158,8 +159,12 @@ For each sample, results for each tool are stored in directories named after the tool. An example looks like this: ``` -results/A/ +$ tree -L2 results/A +results/A ├── all_predictions.tsv +├── htp +│  ├── predictions.tsv +│  └── raw.txt ├── rafah │  ├── A_CDS.faa │  ├── A_CDS.fna @@ -183,47 +188,53 @@ results/A/ │  └── results └── wish ├── llikelihood.matrix - ├── prediction.list - └── predictions.tsv + ├── prediction.list + └── predictions.tsv ``` ### Per sample * `all_predictions.tsv`: Contains the best prediction per contig (rows) for -each tool along with its confidence/p-value/whatever single value each tool +each tool along with its confidence/p-value/whatever-single-value each tool uses to evaluate its confidence in the prediction. An example for three genomes: ``` -contig vhulk_pred vhulk_score rafah_pred rafah_score vhmnet_pred vhmnet_score wish_pred wish_score -NC_005964.2 None 4.068828 Mycoplasma 0.461 Mycoplasma fermentans 0.9953 Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma fermentans;Mycoplasma fermentans MF-I2 -1.2085700000000001 -NC_015271.1 Escherichia_coli 1.0301523 Salmonella 0.495 Muricauda pacifica 0.9968 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Raoultella;Raoultella sp. NCTC 9187;Raoultella sp. NCTC 9187 -1.3869200000000002 -NC_023719.1 Bacillus 0.0012575098 Bacillus 0.55 Clostridium sp. LS 1.0000 Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium beijerinckii;Clostridium beijerinckii -1.29454 +contig htp_proba vhulk_pred vhulk_score rafah_pred rafah_score vhmnet_pred vhmnet_score wish_pred wish_score +NC_005964.2 0.8464285626352002 None 4.068828 Mycoplasma 0.461 Mycoplasma fermentans 0.9953 Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma fermentans;Mycoplasma fermentans MF-I2 -1.2085700000000001 +NC_015271.1 0.995161392517451 Escherichia_coli 1.0301523 Salmonella 0.495 Muricauda pacifica 0.9968 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Raoultella;Raoultella sp. NCTC 9187;Raoultella sp. NCTC 9187 -1.3869200000000002 +NC_023719.1 0.9999957241187084 Bacillus 0.0012575098 Bacillus 0.55 Clostridium sp. LS 1.0000 Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium beijerinckii;Clostridium beijerinckii -1.29454 ``` * `tmp` directory - * Contains one fasta file per input genome, along with other intermediate -files necessary for a smooth execution of the workflow. + * Directory `genomes`: Contains one fasta file per input genome + * File `reflist.txt`: An intermediate file that holds paths to all produced +genome fastas (used as intermediate file to ensure smooth execution) ### Per tool +* `htp` + * File `raw.txt`: The raw output of `htp` per contig + * File `predictions.tsv`: **Two**-column separated tsv with contig id and +probability of host being a phage. + * `rafah` - * All files prefixed with `<sample_id>_` are the rafah's raw output + * Files prefixed with `<sample_id>_` are the rafah's raw output * `predictions.tsv`: A selection of the 1st (`Contig`) , 6th (`Predicted_Host`) and 7th (`Predicted_Host_Score`) columns from file `<sample_id>_Seq_Info.tsv` * `vhulk` - * `results.csv`: Copy of the `results/sample/tmp/genomes/results/results.csv` - * `predictions.tsv`: A selection of the 1st (`BIN/genome`), 10th (`final_prediction`) + * File `results.csv`: Copy of the `results/sample/tmp/genomes/results/results.csv` + * File `predictions.tsv`: A selection of the 1st (`BIN/genome`), 10th (`final_prediction`) 11th (`entropy`) columns from file `results.csv`. * `vhmnet` * Directories `feature_values` and `predictions` are the raw output * Directory `tmp` is a temporary dir written by `VirHostMatcher-Net` for doing its magic. - * `predictions.tsv` contain contig, host taxonomy and scores. + * File `predictions.tsv` contains contig, host taxonomy and scores. * `wish` * Files `llikelihood.matrix` and `prediction.list` are the raw output -- GitLab