PATHDET Manual

1. Data preparation
2. File uploading
3. Access to the results
4. Rapid report
5. Full report
Appendix A. Removal of the sequence reads from the host genome
Appendix B. Browser compatibility
Appendix C. Software used in PATHDET

1. Data preparation

Prepare the FastQ file of the Short / Long reads of high throughput sequencer from the cell-free DNA. The compressed (gzipped) file is recommended. The following file extensions are permitted: .fastq.gz, .fq.gz, .fastq, and .fq.

2. File uploading

In the Analysis page, upload the file and fill all the fields of the sample information. When uploading the FastQ files from which the reads of the host genome have been removed, the users should input the number of sequence reads before the depletion. Following the file uploading, the validation of the FastQ file format, which takes several minutes, is performed. If the file passes the validation, the PATHDET pipeline is automatically executed. For an individual PATHDET run, a Run ID is assigned with a random character string, and the URL that included the Run ID is created for viewing the results. The URL initially redirects to the waiting screen and is automatically reloaded every 60 minutes until the full report is displayed. The URL is sent to the email address inputted. If the uploaded file excesses the sequence read number limit or failed the validation, an error page is returned. Please edit the sequence file according to the error message and resubmit it.

3. Access to the results

The user can access the result page via the URL in the email from the PATHDET or by inputting the Run ID on the Result page. The Rapid report is displayed on a web browser usually in ten minutes, and the full report is mostly returned in 30 minutes to several hours. The PATHDET makes a notice via email when all of the processes are completed. In the result page, Specimen Information is displayed as inputted by the users on the Analysis page. The user can monitor the progress of the analyses and the number of reads that passed the filtering in the Progress Status section. The result page can be accessed for two weeks after the execution.

4. Rapid report

The Rapid report on the browser consists of two contents; the plug-in Krona chart and the ranking tables of microorganisms based on read abundance. The Krona chart is a multi-layer pie chart that visualizes the abundance of microorganisms in biota with a hierarchal taxonomic manner. The pie-chart view can be operated with an interactive user interface. Three ranking tables are shown according to the taxonomic ranks–Family, Genus, and Species–and, in the individual tables, top-three abundant organisms/taxonomic groups are displayed. The abundance is represented by RPM, the number of reads per million sequencing reads and RA, the relative abundance of microorganisms. For Long-read sequences, the analysis result ofMASH Screen is added. Please check the following site for details of MASH Screen (https://genomeinformatics.github.io/mash-screen/). The analysis summary, the Krona chart, the ranking tables and various reports are available via a zipped compressed folder by clicking the Download button.

5. Full report

The Full report, which is added to the Result page, is comprised of the report of pathogen prediction, as well as the Krona pie chart and the ranking tables. The abundance of the sequence reads from the whole microorganisms (RPM), and a metric of species diversity in a microbiota (the Shannon Index; H) are useful metrics for examining the pathogenicity. In short, the pathogenic samples usually exhibit high RPM and low H, and the benign samples display low RPM and high H. The plots illustrate logistic regression analysis of RPM and H at each of the three taxonomic ranks–family, genus, species. The decision boundary was created with the in-house clinical samples. This is a pilot analysis, and the user should perform the same analysis by employing the user’s own pathogenic and benign samples for the final decision. Additionally, the mapping metrics for the genome assembly of the representative pathogen candidate are displayed to distinguish true positives from false positives. For a pathogenic sample, the coverage percentage and the the average depth of the mapped regions should be <<100 and close to 1, respectively. If the coverage percentage is <<100 and the average depth of the mapped regions is significantly more than 1, the read abundance may be an artifact due to an abnormal amplification of particular regions. The Krona chart and the ranking tables are renewed by adding the results of NCBI BLASTN for the sequence reads that have no hits with Kraken2. The analysis summary, the Krona chart, and the ranking tables are available in a zipped compressed folder by clicking the Download button.

Appendix A. Removal of the sequence reads from the host genome

（１） Short-read Seq data

Materials

Example commands


$ trim_galore --gzip --length 50 samples.fastq.gz

$ cd-hit-dup -i samples_trimmed.fq.gz -o samples_trimmed.rdup.fastq


$ seqtk sample -s ${RANDOM} samples_trimmed.rdup.fastq 1000000 > samples_trimmed.rdup.1M.fastq

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml


$ bowtie2-build -f genome.fa genome


$ bowtie2 --very-sensitive-local -x genome -U samples_trimmed.rdup.1M.fastq --un samples_trimmed.rdup.1M.genome.unmapped.fastq > trimmed.rdup.1M.genome.sam

（2） Long-read Seq data

Materials

Software (the user can replace either of them with the compatible tools)

Example commands

https://nanoporetech.com/


$ seqtk sample -s ${RANDOM} samples.fastq 1000000 > samples.1M.fastq


$ minimap2 -ax map-ont genome.fa samples.1M.fastq > 1M.genome.sam

$ samtools view -bS 1M.genome.sam | samtools sort | samtools view -f 4 | samtools fastq > samples_1M.genome.unmapped.fastq

Appendix B. Browser compatibility

OS	Version	Chrome	Firefox	Microsoft Edge	Safari
Linux	CentOS 7	79.0.3945.79	71.0	n/a	n/a
MacOS	Mojave	79.0.3945.79	71.0	n/a	12.0.2
Windows	10	79.0.3945.79	71.0	44.18362.449.0	n/a

Appendix C. Software used in PATHDET

Softwares	Version	Category	URL
FastQValidator	v0.1.1	Common	https://genome.sph.umich.edu/wiki/FastQValidator
Kraken2	v2.0.7	Common	https://ccb.jhu.edu/software/kraken2/
BLAST+	v2.9.0+	Common	https://blast.ncbi.nlm.nih.gov/Blast.cgi
Krona	v2.7.1	Common	https://github.com/marbl/Krona/wiki
seqkit	v0.10.1	Common	https://bioinf.shenwei.me/seqkit/
taxonkit	v0.3.0	Common	https://bioinf.shenwei.me/taxonkit/
ncbi-genome-download	v0.2.9	Common	https://github.com/kblin/ncbi-genome-download
samtools	v1.9	Common	http://samtools.sourceforge.net/
FastQC	v0.11.8	Short-read Seq	http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
TrimGalore	v0.6.4	Short-read Seq	https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
PRINSEQ lite	v0.20.4	Short-read Seq	http://prinseq.sourceforge.net/index.html
CD-HIT	v4.6.6	Short-read Seq	http://weizhongli-lab.org/cd-hit/
MultiQC	v1.4	Short-read Seq	https://multiqc.info/
Bowtie2	v2.3.4.3	Short-read Seq	http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
NanoPlot	v1.20.0	Long-read Seq	https://github.com/wdecoster/NanoPlot
minimap2	v2.17-r941	Long-read Seq	https://github.com/lh3/minimap2
MASH	v2.2	Long-read Seq	https://github.com/marbl/Mash

Version 2.0