Select Top Features#

Technically, any forms of observation-by-feature matrix is acceptable for the method we developed, and users are encouraged to explore the usability of our method with other types of data, even not in a biological context. However, single-cell transcriptomics data, as provided, usually is of high dimensionality and contains technical and biological noise. With testing different approaches of reducing the dimensionality and noise, we recommend that users select a number of top differentially expressed genes (DEGs) for each cluster (or group of clusters) that a vertex represents.

We implemented a fast Wilcoxon rank-sum test method which can be invoked with function select_top_features. The test is done in a one group versus all other groups manner. Here, we will choose the top DEGs for Osteoblast cells (shortened as "OS"), Reticular cells ("RE") and Chondrocytes ("CH"), as also shown in the previously mentioned publication. The number of top DEGs for each cluster is set to 30 (nTop = 30), thus 90 unique genes at maximum are expected to be returned.

import CytoSimplex as csx
import scanpy as sc
adata = sc.read(filename='test.h5ad',
                backup_url="https://figshare.com/ndownloader/files/41034857")
vertices = {"OS": "Osteoblast_1",
            "RE": "Reticular_1",
            "CH": "Chondrocyte_1"}
selected_genes = csx.select_top_features(adata, cluster_var="cluster", vertices=vertices, n_top=30)
selected_genes[:10]

['Steap1',
 'Smim5',
 'H2-DMa',
 'Zcchc5',
 'Lims2',
 'Fam89a',
 'Ninj2',
 'Scin',
 'Pygl',
 'Slc2a5']

Alternatively, users can set return_stats=True to obtain a table of full Wilcoxon rank-sum test statistics including the result for all clusters, instead of selected vertices.

stats = csx.select_top_features(adata, cluster_var="cluster", vertices=vertices, return_stats=True)
stats

	group	avgExpr	logFC	ustat	auc	pval	padj	pct_in	pct_out	feature
0	CH	0.000000	0.000000	3010.5	0.500000	NaN	NaN	0.000000	0.000000	Rp1
1	CH	0.000000	0.000000	3010.5	0.500000	NaN	NaN	0.000000	0.000000	Sox17
2	CH	10.610097	7.637897	4547.5	0.755273	4.666025e-08	4.881361e-07	81.481481	21.524664	Mrpl15
3	CH	6.077630	3.874176	3826.0	0.635443	9.406324e-04	3.637987e-03	48.148148	16.143498	Lypla1
4	CH	0.000000	-0.057590	2997.0	0.497758	7.669166e-01	1.000000e+00	0.000000	0.448430	Gm37988
...	...	...	...	...	...	...	...	...	...	...
242911	Reticular_2	1.582342	-0.496039	1048.0	0.483172	7.931468e-01	1.000000e+00	11.111111	14.937759	PISD
242912	Reticular_2	0.000000	-2.912214	846.0	0.390041	1.202520e-01	1.000000e+00	0.000000	21.991701	DHRSX
242913	Reticular_2	0.000000	-0.154048	1071.0	0.493776	7.746696e-01	1.000000e+00	0.000000	1.244813	CAAA01147332.1
242914	Reticular_2	16.142743	2.744758	1350.0	0.622407	2.151089e-01	1.000000e+00	100.000000	83.402490	tdT-WPRE_trans
242915	Reticular_2	0.000000	0.000000	1084.5	0.500000	NaN	NaN	0.000000	0.000000	CreER-WPRE_trans

242916 rows × 10 columns

The returned table can be considered as the concatenation of the tables of all tests, with the column group indicating which cluster the test is primarily based on. For example, the 3rd row is the result of the test for gene “Mrpl15” in group “CH”, which represents the original cluster “Chondrocyte_1”, against all other clusters.