It is allowed to manually add rRNA sequences to the reference genome FASTA file, followed by rebuilding reference indices. With rRNARemove switch on, SAW mapping will filter out the reads that are mapped to rRNA sequences. rRNA filtering function is recently added in SAW v6.0.
Rules to add rRNA sequence: include rRNA sequences to filter out in the FASTA file, and append '_rRNA' at the end of the usual sequence name starting with ">", for program identification. Examples are as follows:
Add a row of "rRNAremove" to bcPara file prior to running SAW mapping . Examples are as follows:
Plain Text in=<mask> in1=<lane_read_1.fq.gz> in2=<lane_read_2.fq.gz> barcodeReadsCount=<lane.barcodeReadsCount.txt> barcodeStart=0 barcodeLen=25 umiStart=25 umiLen=10 umiRead=1 mismatch=1 bcNum=<CIDCount> polyAnum=15 mismatchInPolyA=2 rRNAremove
If a query read has been mapped to a particular rRNA sequence, the 3rd column of the alignment record displays the corresponding RNAME with a suffix of "_rRNA" as the sequence names in the reference genome, and the optional field in the 12th column has XF:i tag set as 3. The ratio of rRNA will be computed according to XF tag records during the following annotation step.
The checkGTF tool of SAW sif has been developed for such a purpose. The execution commands are shown as follows:
Bash ## export SINGULARITY_BIND="/path/to/input/dir,/path/to/output/dir" singularity exec SAW.sif checkGTF \ -i <input.gtf/gff> \ ## GTF/GFF file input to be checked -o <output.gtf/gff> ## [optional]. Set to output revised GTF/GFF file. Be aware that this may remove some genes which do not meet the requirements and cannot be fixed.
Gene annotation records that can not be fixed by the program will be removed from the output. But these records will be written into the log file. Please rectify the incorrect items and run the program again.
It is possible that the input annotation file does not conform to the norms. Please double check according to the file format requirements mentioned above.
Another possibility is that the forward/reverse symbols of the strand are not in the right format. Strand values in annotation files should only be either "+" (forward) or "-" (reverse), do not confuse "-" (hyphen) with "_" (underline)
One possibility is that GTF/GFF annotation files are not completely consistent with genome FASTA files in terms of chromosome naming. Please keep the chromosome name unified.
There are no attributes of gene_name gene_id transcript_name transcript_id in gtf format (only gene_name and gene_id are needed for each gene)
There are no attributes of ID Name Parent in gff format (Parent is not needed for gene entities)
Multiple gene IDs are assigned to the same gene, as printed by log "Multiple gene IDs for gene xxx: id1, id2..."
Both forward and reverse strands are assigned to the same gene, as printed by log "Strand disagreement for gene xxx - skipping"
No transcript_id for transcript/exon, as printed by log "Record does not have transcriptID for gene xxx"
If a gene has multiple transcripts and the same transcript_id / ID, as printed by log "Transcript appears more than once for xxx"
start > end for some exons, as printed by log "Exon has 0 or negative extent for xxx"
There is overlap between exons of the same transcript, as printed by log "Exons overlap for xxx"
A gene has no transcript present, as printed by log "No transcript for gene xxx"
ps: One contig with multiple genes sharing the same gene_name will merge them into one.
1. File format:
GFF files or GTF files, supporting gtf/gtf.gz, gff/gff.gz, gff3/gff3.gz as file suffix names.
2. GTF file format:
Comment lines begin with #
The main body has 9 columns, separated by 'tab': seqname source feature start end score strand frame attributes
type: types of annotation information must contain gene,transcript and exon
start/end: need to be less than 231
strand: forward and reverse of strands, represented as + and -, respectively
attributes as the 9th column, whose format is tag "value" , with different attributes separated by space; of which the following four are required.
gene_name value
gene_id value: represents the unique ID of a transcript for the given gene loci of the genome. 'gene_id' and 'value' are separated by space. If the value is empty, it means that there is no corresponding gene.
transcript_name value
transcript_id value : a unique ID to identify a transcript. Empty value means no transcript.
At present, the maximum valid gene number must be less than 220, that is 1048576
Do not disrupt order. The same gene's transcript/exons need to be arranged in order
3. GFF file format:
Comment lines begin with #
The main body has 9 columns, separated by 'tab': seqid source type start end score strand phase attributes
type: types of annotation information must contain gene,mRNA and exon
start/end: max of them need to be less than 231
strand: "+" stands for forward strands, "-" stands for reverse strands, "." indicates there is no need to specify positive or negative strands, "?" means unknown
attributes as the 9th column, whose format is tag=value, with different attributes separated by semicolon
ID Name Parent must provide (Parent is not required for each gene)
For naming rules of the 3rd column, please carefully check on ⇒ "dendrachy" (tree-shaped hierarchy) (do not list 'child' rows without 'parent' rows!) An example is shown as follows:
At present, the maximum valid gene number must be less than 220, that is 1048576
Although ordering is not required, the rules that 'gene' must appear ahead of corresponding mRNA, and mRNA must appear ahead of corresponding exon still need to be met.
4. Others to note:
gene/gene_name should not contain any special symbols (space, all types of brackets, quotation marks, <>, %, etc.) other than common symbols such as "_" and "."
gene/gene_name shorter than 64 characters
Although the mainly used GFF files are version 3 (GFF3), please name them as .gff ; likewise, please name GTF files as .gtf
imageQC | ImageQC description | SAW | SAW description |
---|
<= 1.0.8 | File format: .json + .tar.gz Features: ssDNA image QC | <= 4.1.0 | Support ssDNA image registration and tissue segmentation |
>= 1.1.0 | File format: .ipr + .tar.gz Features: ssDNA image QC | >= 5.1.3 | Support cell segmentation on ssDNA image; enable analysis of FASTQ data in Q4 format |
ImageStudio | ImageStudio description | SAW | SAW description | StereoMap | StereoMap description |
---|
1.0.0 | File format: .ipr + .tar.gz Features: ssDNA image QC and manual processing | >= 5.5.0 | Support cell segmentation on ssDNA image; enable analysis of FASTQ data in Q4 format | 1.0.0 | Support displaying spatial expression heatmap, co-visualization of gene distribution, and ssDNA image. Manual registration enabled |
2.0.0 | File format: .ipr + .tar.gz Features: Image QC for ssDNA, DAPI, mIF stains and manual processing | >= 6.0.0 | Support mIF image registration; allow for rRNA filtering | 2.0.0 | Display of individual mIF images and the ones stacked with different image layers |
2.1 | File format: .ipr + .tar.gz Features: Image QC for ssDNA, DAPI, mIF stains and their manual image processing; Fully manual procedure for QC-failed images | >= 6.1 <7.0 | Support analysis of the manually processed image outputs from ImageStudio and StereoMap | 2.1 <3.0 | Support reading multiple gef files at a time, which will be displayed by individual tabs |
2.2 | File format: .ipr + .tar.gz Features: Image QC for ssDNA, DAPI, mIF stains and their manual image processing; fully manual procedure for QC failed image | >=6.1 <7.0 | Support analysis with the results of fully manual procedure done by ImageStudio | 2.1 <3.0 | Support reading multiple gef files at a time, which will be displayed by individual tabs |
3.0 | File format: .ipr + .tar.gz Features: Image QC for ssDNA, DAPI, H&E, mIF stains and their manual image processing; fully manual procedure for QC failed images | 7.0 | Reconstructed 'count' go online; 'register' reconstructed with new tissue segmentation algorithm and new 'V03' cell segmentation algorithm; Support H&E whole process; Support cell correction using EDM algorithm based on mask file of cell segmentation result | 3.0 | Support reading h5ad files with different binsize/resolution; /codedCellBlock information is written into cgef file after the SAW cellChunk module; Render cellbin heatmap while loading cgef files |
The Image studio is integrated into StereoMap | File format: .tar.gz (includes. ipr) Features: ssDNA, DAPI, H&E, mIF Image QC and manual processing; And full manual processing for QC-failed Image; | 8.0 | ● Now the standard spatial transcriptomic analysis workflow is intergrated into one command line. ● Support one-stop computational workflow for FFPE sample (including microorganism analysis) ● Output zipped report file ● Output zipped package for visualization | 4.0 | ● Visualization: Support reading with .stereo manifest file; compatible with data of old version in reading ● Manual processing: Processe image data in a step by step manner |
8.1 | ● Support Stereo-seq T FF V1.3 and Stereo-CITE T FF data analysis | 4.1 | ● Visualization: Support the display of gene expression heatmaps for cellbin analysis; support linked display for the protein & marker genes ● Manual processing: New registration method available (Feature point registration) ● The output file supports user-defined directories. |
There are three directions in which investigation can be carried out.
1. Sequencing quality. Low sequencing quality can affect alignment results. In addition to Q30, the presence of unknown base calls needs to be considered as well, which can be examined by reviewing base distribution in the sequencing report. If the proportion of N bases is high, it needs to be considered that sequencing problems have affected the valid CID ratio. It is recommended to prioritize such inspection.
2. The chip mask h5 file does not correspond to the FASTQ datasets. Because the CID recorded in the mask does not match the CID obtained by sequencing the sample, the valid CID ratio is low. If this situation occurs alone, the proportion is usually extremely low. If the next situation is also involved, the variation would be of significance, requiring a case by case analysis.
3. (Cross) Contamination. It occured when other samples got mixed in during the experiment, library preparation, or sequencing, which affected the valid CID ratio because of being contaminated. Here comes a likelihood that two chips can be both mapped to the sequencing data of the same library. If there is a lot of mixing, a distinct tissue pattern should be visible. If the proportion is extremely small, in some cases there will be some local bright spots.
Some information, such as cell sizes of specific tissue types, can be used. It is recommended to vary the bin level repeatedly based on the results of downstream analyses, with a spectrum of bin20, 50, 100, and 200. Bin20 is about the size of a regular mammalian cell, while bin50 and bin100 are both frequently adopted in the analysis. And bin200 is generally used for immediate visualization of SAW outputs.
Given that the diameter of a typical mammalian cell is approximately 10μm, it is analogous to a bin20 spot that is 10μm x 10μm in area or a bin14 spot with a diagonal of 10 μm.