FAQ - STOmics

Q How to interpret GEF-format files?

A

Option 1: use C++ compiled geftools:

https://github.com/STOmics/geftools

Option 2: use Python package - gefpy (e.g. 0.6.1):

https://pypi.org/project/gefpy/
https://gefpy.readthedocs.io/en/latest/index.html
```
pip install gefpy==0.6.1
```

Option 3: with installed SAW sif (e.g. v5.1.3):

https://hub.docker.com/repository/docker/stomics/saw

singularity exec SAW_v5.1.3.sif cellCut

Please use Singularity version 3.8 or later

Bash
export HDF5_USE_FILE_LOCKING=FALSE
## gef2gem using geftools
geftools view -i <SN>.gef -o <SN>.gem -s <SN>
# -i input square bin GEF, e.g.SN.raw.gef or SN.gef
# -o output GEM
# -s SN

## gef2gem using gefpy
python
>>> from gefpy.bgef_reader_cy import BgefR
>>> bgef=BgefR(filepath='<SN>.gef',bin_size=200,n_thread=4)
>>> bgef.to_gem('<SN>.bin200.gem')

## gef2gem using SAW sif
## export SINGULARITY_BIND="/path/to/input/dir,/path/to/output/dir"
singularity exec SAW_v5.1.3.sif cellCut view -i <SN>.gef -o <SN>.gem -s <SN>

## cgef2cgem
geftools view -i <SN>.cellbin.gef -o <SN>.cellbin.gem -d <SN>.raw.gef -s <SN>
# -i input cellbin GEF
# -o output cellbin GEM
# -d input square bin GEF, e.g. SN.raw.gef or SN.gef
# -s SN

## gem2gef
geftools bgef -i <SN>.gem -o <SN>.gef -b 1,20,50 -O Transcriptomics
# -i input square bin GEM
# -o output square bin GEF
# -b bin sizes seqarate by comma, default: 1,10,20,50,100,200,500
# -O omics name

Q What are the purposes of the three Saturation curves?

A

The first one can indicate whether the sequencing is saturated. If the fitted curve reaches or approximates a plateau, this means the sample is about to saturate. Depending on the goal of each individual project, you may need additional sequencing runs. For example, a project designed to recover very lowly expressed transcripts or involves precious samples may desire a higher sequencing saturation. A recommended saturation of 80% is an empirical threshold, it is not a rigid value.

The second and third figures are plotted with statistics computed at bin levels, and their stationary stages are lagging behind Figure 1. The first plot serves as the main indicator for the potential benefit of additional sequencing.

Q What is the difference between the two SAW registration modules, register and rapidRegister?

A

SAW register pipeline includes a cell segmentation procedure, whereas rapidRegister does not.

Q How to deal with abnormal gene expression visualization result that does not show any tissue morphology?

A

Step 1: Check if the "Valid CID Reads" ratio in the HTML report is lower than 10%. If so, please check whether the FASTQ corresponds with chip SN.

Step 2: Two possibilities that can lead to a low "Valid CID Reads" ratio of around 10% - 30%:

Reference genome does not meet the format requirement: if the ratio of multi-mapped reads is high, and the uniquely mapped reads ratio is extremely low, please run SAW checkGTF for the GFF/GTF file to verify the file format is valid for running pipelines.

Contamination: please perform troubleshooting on the wet lab workflow.

Q What are the major filtering steps for the sequencing data in SAW pipelines?

A

CID filtering: to filter out reads with CID that can not be matched with any CID recorded in the Stereo-seq Chip T mask file.

MID filtering: to filter out reads with MID containing N base, reads with MID having ploy-A content, and reads with at least one base whose quality scores are lower than 10.

Reads filtering: to filter out reads containing adapters and DNB sequences.

Q What factors affect cell segmentation results? How to get optimized segmentation?

A

The outcome of cell segmentation is determined by multiple factors such as the performance of microscope imaging and the segmentation algorithm used. Factors like overexposure and blurring can affect the automatic identification of cell areas which results in poor segmentation output. For some dense areas that are also blurred, and even accompanied by overlapping cells, it is especially difficult for the algorithm to do segmentation accurately. Also, segmentation mistakes will arise in cases where brightness is locally uneven over the tissue areas or background impurities and hangover of cell movement were introduced during experiments (see examples below).

From the perspective of the algorithm itself, training of automatic segmentation was done on specific datasets with manually assigned labels. Hence, the algorithm could perform poorly in identifying some particularly rare cell morphology that is not encompassed in the datasets.

If the algorithm segmentation does not work well, users can manually adjust results using ImageStudio, a desktop image processing software, or try to do it again with Stereopy or other algorithms. If there is a need to enlarge the identified cells, the cell correction algorithm in Stereopy can be employed to increase the cell diameters and have a larger cell coverage.

Blur img6.002d84d5

Overexposure

img7.1ad5aa84

Abnormal shapes, like fibers or clumps

img8.769c0dce

Hangover

img9.e610e2a1

Bubble

img10.b7bde687

Background impurity

img11.ab239510

Local uneven brightness

img12.2287ff52

Cells of special forms

img13.5ffff81f

Q How to extract the corresponding regions from expression matrix according to IF images?

A

The immunofluorescence signal visualizes the location of the targeted proteins on the tissue slice. High fluorescent intensity indicates that a large number of cells in that region actively express the target proteins.

In the SAW workflow, the register module takes the use of an automatic global thresholding algorithm to compute the threshold value of the gray level that binarizes the IF image into the foreground and background region. The foreground region of the IF image is used as the mask file in the tissueCut module to acquire the gene expression matrix of the corresponding region.

If the segmentation result based on gray level calculated automatically is not satisfying, users can utilize the ImageStudio "Tissue Segmentation" module to manually adjust the grayscale threshold of the IF image to obtain a new tissue segmentation result.

img5.e007f59e

Q What's the principle behind IF image QC?

A

Our current quality check strategy for IF images requires a paired DAPI image to be input together. The assessment contents include track line recognition of DAPI image, evaluation of microscope stitching for DAPI/IF images, and calibration between DAPI and IF images based on tissue morphology.

The detected track lines from the DAPI image during the QC step provide a fiducial reference frame for automatic image registration with the chip. Microscope stitching evaluation is used to determine whether there are obvious stitching errors in the microscope-stitched global image, guaranteeing the quality of subsequent tissue segmentation and alignment. Calibration evaluation is aimed to ensure that IF images can be processed in the same way as the DAPI image in terms of stitching, rotation, scaling, translation, and flip, and finally register the IF images with the expression matrix.

However, it is possible that the IF images have dissimilar tissue morphology with DAPI, which might fail calibration QC. In such cases, ImageStudio can be used to make adjustments pairwisely with the "Calibration" module.

img4.263b9b3e

In the situation where DAPI image fails QC for track line recognition and microscope stitching, the related IF images can not be further processed automatically.

Q How are immunofluorescence (IF) images mapped to the gene expression matrix?

A

The alignment between IF image and the spatial gene expression matrix is achieved indirectly by taking the DAPI image as a reference frame.

DAPI and IF images of the same tissue slice were shot back to back by switching channels. With the chip fixed during imaging, DAPI and IF images share the same stitching, scale, and angle parameters as compared to those of the spatial gene expression map. So the information used for DAPI image stitching, rotation, scaling, translation, and transformation can be applied to image processing of the IF layer as well, including alignment with the expression matrix.

Q How to remove rRNA alignments during analysis? Can rRNA sequences that need to be removed be specified manually?

A

It is allowed to manually add rRNA sequences to the reference genome FASTA file, followed by rebuilding reference indices. With rRNARemove switch on, SAW mapping will filter out the reads that are mapped to rRNA sequences. rRNA filtering function is recently added in SAW v6.0.

Rules to add rRNA sequence: include rRNA sequences to filter out in the FASTA file, and append '_rRNA' at the end of the usual sequence name starting with ">", for program identification. Examples are as follows:

img2.4ba7ac55

Add a row of "rRNAremove" to bcPara file prior to running SAW mapping . Examples are as follows:

Plain Text
in=<mask>
in1=<lane_read_1.fq.gz>
in2=<lane_read_2.fq.gz>
barcodeReadsCount=<lane.barcodeReadsCount.txt>
barcodeStart=0
barcodeLen=25
umiStart=25
umiLen=10
umiRead=1
mismatch=1
bcNum=<CIDCount>
polyAnum=15
mismatchInPolyA=2
rRNAremove

If a query read has been mapped to a particular rRNA sequence, the 3rd column of the alignment record displays the corresponding RNAME with a suffix of "_rRNA" as the sequence names in the reference genome, and the optional field in the 12th column has XF:i tag set as 3. The ratio of rRNA will be computed according to XF tag records during the following annotation step.

img3.c710d19a