PATHDET Manual
1. Data preparation
2. File uploading
3. Access to the results
4. Rapid report
5. Full report
Appendix A. Removal of the sequence reads from the host genome
Appendix B. Browser compatibility
Appendix C. Software used in PATHDET
1. Data preparation
- Prepare the FastQ file of the Short / Long reads of high throughput sequencer from the cell-free DNA. The compressed (gzipped) file is recommended. The following file extensions are permitted: .fastq.gz, .fq.gz, .fastq, and .fq. Make sure that the sequence reads which harbor personally identifiable information have been discarded when analyzing personal samples. The PATHDET deletes any of the sequence reads uploaded immediately after the completion of the analyses but is not responsible for any of the risks associated with the use and transfer of the personal sequence data.
Please read the disclaimer and usage policy carefully. Appendix A displays the example command list to remove personally identifiable sequence reads.
The PATHDET places the limitations of sequence read numbers as follows for achieving a prompt return of the analysis results.
a) FastQ file from which the sequence reads from the host genome were discarded
・Short-read Seq: up to 50,000 reads (approx. 3-5 MB fastq.gz file)
・Long-read Seq: up to 1GB fastq.gz fileb)
b) FastQ file that includes the host genome sequences
・Short-read Seq: up to 3,000,000 reads (approx. 200-300 MB fastq.gz file)
・Long-read Seq: up to 2.5GB fastq.gz file
2. File uploading
In the Analysis page, upload the file and fill all the fields of the sample information. When uploading the FastQ files from which the reads of the host genome have been removed, the users should input the number of sequence reads before the depletion. Following the file uploading, the validation of the FastQ file format, which takes several minutes, is performed. If the file passes the validation, the PATHDET pipeline is automatically executed. For an individual PATHDET run, a Run ID is assigned with a random character string, and the URL that included the Run ID is created for viewing the results. The URL initially redirects to the waiting screen and is automatically reloaded every 60 minutes until the full report is displayed. The URL is sent to the email address inputted. If the uploaded file excesses the sequence read number limit or failed the validation, an error page is returned. Please edit the sequence file according to the error message and resubmit it.
3. Access to the results
The user can access the result page via the URL in the email from the PATHDET or by inputting the Run ID on the Result page. The Rapid report is displayed on a web browser usually in ten minutes, and the full report is mostly returned in 30 minutes to several hours. The PATHDET makes a notice via email when all of the processes are completed. In the result page, Specimen Information is displayed as inputted by the users on the Analysis page. The user can monitor the progress of the analyses and the number of reads that passed the filtering in the Progress Status section. The result page can be accessed for two weeks after the execution.
4. Rapid report
The Rapid report on the browser consists of two contents; the plug-in Krona chart and the ranking tables of microorganisms based on read abundance. The Krona chart is a multi-layer pie chart that visualizes the abundance of microorganisms in biota with a hierarchal taxonomic manner. The pie-chart view can be operated with an interactive user interface. Three ranking tables are shown according to the taxonomic ranks–Family, Genus, and Species–and, in the individual tables, top-three abundant organisms/taxonomic groups are displayed. The abundance is represented by RPM, the number of reads per million sequencing reads and RA, the relative abundance of microorganisms. For Long-read sequences, the analysis result ofMASH Screen is added. Please check the following site for details of MASH Screen (https://genomeinformatics.github.io/mash-screen/). The analysis summary, the Krona chart, the ranking tables and various reports are available via a zipped compressed folder by clicking the Download button.
5. Full report
The Full report, which is added to the Result page, is comprised of the report of pathogen prediction, as well as the Krona pie chart and the ranking tables. The abundance of the sequence reads from the whole microorganisms (RPM), and a metric of species diversity in a microbiota (the Shannon Index; H) are useful metrics for examining the pathogenicity. In short, the pathogenic samples usually exhibit high RPM and low H, and the benign samples display low RPM and high H. The plots illustrate logistic regression analysis of RPM and H at each of the three taxonomic ranks–family, genus, species. The decision boundary was created with the in-house clinical samples. This is a pilot analysis, and the user should perform the same analysis by employing the user’s own pathogenic and benign samples for the final decision. Additionally, the mapping metrics for the genome assembly of the representative pathogen candidate are displayed to distinguish true positives from false positives. For a pathogenic sample, the coverage percentage and the the average depth of the mapped regions should be <<100 and close to 1, respectively. If the coverage percentage is <<100 and the average depth of the mapped regions is significantly more than 1, the read abundance may be an artifact due to an abnormal amplification of particular regions. The Krona chart and the ranking tables are renewed by adding the results of NCBI BLASTN for the sequence reads that have no hits with Kraken2. The analysis summary, the Krona chart, and the ranking tables are available in a zipped compressed folder by clicking the Download button.
Appendix A. Removal of the sequence reads from the host genome
(1) Short-read Seq data
Materials
・ A FastQ file of single reads or R1 of paired-end reads (samples.fastq.gz in the example commands)・ A genome assembly (multi-FASTA file) of the host (e.g., reference genome of the human or particular ethnic group; genome.fa in the example commands)
Software (the user can replace either of them with the compatible tools)
・ Trim Galore and cutadapt for a read trimming
・ cd-hit-dup for a removal of duplication reads
・ bowtie2 for a read mapping
・ seqtk for a FastQ file processing
Example commands
Adapter removal, low-quality ends trimming, and duplication reads removal
$ trim_galore --gzip --length 50 samples.fastq.gz
$ cd-hit-dup -i samples_trimmed.fq.gz -o samples_trimmed.rdup.fastq
Random sampling of 1 million reads (the variable $RANDOM can be replaced by any natural numbers).
$ seqtk sample -s ${RANDOM} samples_trimmed.rdup.fastq 1000000 > samples_trimmed.rdup.1M.fastq
Building index for the genome assembly (or the user can download pre-built bowtie2 index from the bowtie2 web page; http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
$ bowtie2-build -f genome.fa genome
Mapping the reads on the host genome and retrieval of unmapped reads
$ bowtie2 --very-sensitive-local -x genome -U samples_trimmed.rdup.1M.fastq --un samples_trimmed.rdup.1M.genome.unmapped.fastq > trimmed.rdup.1M.genome.sam
(2) Long-read Seq data
Materials
・ A FastQ file (samples.fastq.gz in the example commands)・ A genome assembly (multi-FASTA file) of the host (e.g., reference genome of the human or particular ethnic group; genome.fa in the example commands)
Software (the user can replace either of them with the compatible tools)
・ Minimap2 for a read mapping・ seqtk for a FastQ file processing
・ samtools for a SAM file processing
Example commands
For additional adapter removal and poor quality end trimming, please visit the Oxford Nanopore Technologies website, https://nanoporetech.com/.Random sampling of 1 million reads (the variable $RANDOM can be replaced by any natural numbers).
$ seqtk sample -s ${RANDOM} samples.fastq 1000000 > samples.1M.fastq
Mapping the reads on the host genome and retrieval of unmapped reads
$ minimap2 -ax map-ont genome.fa samples.1M.fastq > 1M.genome.sam
$ samtools view -bS 1M.genome.sam | samtools sort | samtools view -f 4 | samtools fastq > samples_1M.genome.unmapped.fastq
The FastQ file of unmapped regions, samples_1M.genome.unmapped.fastq, is subject to submission to PATHDET. Additionally, input the number of total reads as 1000000, which is equal to the random sampling size of the reads from the original FastQ file.
Appendix B. Browser compatibility
OS | Version | Chrome | Firefox | Microsoft Edge | Safari |
Linux | CentOS 7 | 79.0.3945.79 | 71.0 | n/a | n/a |
MacOS | Mojave | 79.0.3945.79 | 71.0 | n/a | 12.0.2 |
Windows | 10 | 79.0.3945.79 | 71.0 | 44.18362.449.0 | n/a |