Select Top Features#

Technically, any forms of observation-by-feature matrix is acceptable for the method we developed, and users are encouraged to explore the usability of our method with other types of data, even not in a biological context. However, single-cell transcriptomics data, as provided, usually is of high dimensionality and contains technical and biological noise. With testing different approaches of reducing the dimensionality and noise, we recommend that users select a number of top differentially expressed genes (DEGs) for each cluster (or group of clusters) that a vertex represents.

We implemented a fast Wilcoxon rank-sum test method which can be invoked with function select_top_features. The test is done in a one group versus all other groups manner. Here, we will choose the top DEGs for Osteoblast cells (shortened as "OS"), Reticular cells ("RE") and Chondrocytes ("CH"), as also shown in the previously mentioned publication. The number of top DEGs for each cluster is set to 30 (nTop = 30), thus 90 unique genes at maximum are expected to be returned.

import CytoSimplex as csx
import scanpy as sc
adata = sc.read(filename='test.h5ad',
                backup_url="https://figshare.com/ndownloader/files/41034857")
vertices = {"OS": "Osteoblast_1",
            "RE": "Reticular_1",
            "CH": "Chondrocyte_1"}
selected_genes = csx.select_top_features(adata, cluster_var="cluster", vertices=vertices, n_top=30)
selected_genes[:10]
['Steap1',
 'Smim5',
 'H2-DMa',
 'Zcchc5',
 'Lims2',
 'Fam89a',
 'Ninj2',
 'Scin',
 'Pygl',
 'Slc2a5']

Alternatively, users can set return_stats=True to obtain a table of full Wilcoxon rank-sum test statistics including the result for all clusters, instead of selected vertices.

stats = csx.select_top_features(adata, cluster_var="cluster", vertices=vertices, return_stats=True)
stats
group avgExpr logFC ustat auc pval padj pct_in pct_out feature
0 CH 0.000000 0.000000 3010.5 0.500000 NaN NaN 0.000000 0.000000 Rp1
1 CH 0.000000 0.000000 3010.5 0.500000 NaN NaN 0.000000 0.000000 Sox17
2 CH 10.610097 7.637897 4547.5 0.755273 4.666025e-08 4.881361e-07 81.481481 21.524664 Mrpl15
3 CH 6.077630 3.874176 3826.0 0.635443 9.406324e-04 3.637987e-03 48.148148 16.143498 Lypla1
4 CH 0.000000 -0.057590 2997.0 0.497758 7.669166e-01 1.000000e+00 0.000000 0.448430 Gm37988
... ... ... ... ... ... ... ... ... ... ...
242911 Reticular_2 1.582342 -0.496039 1048.0 0.483172 7.931468e-01 1.000000e+00 11.111111 14.937759 PISD
242912 Reticular_2 0.000000 -2.912214 846.0 0.390041 1.202520e-01 1.000000e+00 0.000000 21.991701 DHRSX
242913 Reticular_2 0.000000 -0.154048 1071.0 0.493776 7.746696e-01 1.000000e+00 0.000000 1.244813 CAAA01147332.1
242914 Reticular_2 16.142743 2.744758 1350.0 0.622407 2.151089e-01 1.000000e+00 100.000000 83.402490 tdT-WPRE_trans
242915 Reticular_2 0.000000 0.000000 1084.5 0.500000 NaN NaN 0.000000 0.000000 CreER-WPRE_trans

242916 rows × 10 columns

The returned table can be considered as the concatenation of the tables of all tests, with the column group indicating which cluster the test is primarily based on. For example, the 3rd row is the result of the test for gene “Mrpl15” in group “CH”, which represents the original cluster “Chondrocyte_1”, against all other clusters.