CytoSimplex.select_top_features#
- CytoSimplex.select_top_features(x, cluster_var, vertices, n_top=30, processed=False, lfc_thresh=0.1, return_stats=False, feature_names=None)[source]#
Select top features for each vertices based on the wilcoxon test.
- Parameters
x (
Union[AnnData,ndarray,csr_matrix]) – The matrix of gene expression, where each row is a cell and each column is a gene. Recommended to be full size raw counts. (i.e. not log-transformed or normalized and not only for highly variable genes) When givenanndata.AnnData, x.X will be used.cluster_var (
Union[str,list,Series]) – The cluster assignment of each single cell. If x is ananndata.AnnData, cluster_var can be a str that specifies the name of the cluster variable in x.obs. list orpandas.Seriesis accepted in all cases, and the length must equal to x.shape[0].vertices (
Union[list,dict]) –The terminal specifications. Depending on the type of simplex to be visualized in downstream, the number of vertices (n) should be determined by users. e.g. 3 elements for a 2-simplex (ternary simplex / triangle). Acceptable input include:
n_top (
int) – The number of top features to select for each vertex.processed (
bool) – Whether the input matrix is already processed. If False, the input matrix will be log transformed and row normalized. If True, the input matrix will be directly used to calculate the rank-sum statistics. And logFC will be calculated assuming that the input matrix is log-transformed.lfc_thresh (
float) – The log fold change threshold to select up-regulated genes.return_stats (
bool) – Whether to return the full statistics of all clusters and all features instead of only returning the selected top features by default.feature_names (
Optional[str]) – The names of the features in the matrix. If None, the feature names will be the index of the matrix.
- Return type
- Returns
selected (
list, when return_stats=False.) – The list of selected features. Maximum length is n_top * len(vertices) when enough features can pass the threshold.stats (
pandas.DataFrame, when return_stats=True.) – The statistics of the wilcoxon test, with n_groups * n_features rows. Columns are ‘group’, ‘avgExpr’, ‘logFC’, ‘ustat’, ‘auc’, ‘pval’, ‘padj’, ‘pct_in’, ‘pct_out’ and ‘feature’.
Examples
>>> import CytoSimplex as csx >>> import scanpy as sc >>> adata = sc.read( ... filename="test.h5ad", ... backup_url="https://figshare.com/ndownloader/files/41034857" ... ) >>> vertices = {'OS': ["Osteoblast_1", "Osteoblast_2", "Osteoblast_3"], ... 'RE': ['Reticular_1', 'Reticular_2'], ... 'CH': ['Chondrocyte_1', 'Chondrocyte_2', 'Chondrocyte_3']} >>> gene = csx.select_top_features(adata, "cluster", vertices) >>> gene[:8] ['Nrk', 'Eps8l2', 'Mfi2', 'Fam101a', 'Scin', 'Sox5', 'Fbln7', 'Edil3'] >>> stats = csx.select_top_features(adata, "cluster", vertices, return_stats=True) group avgExpr logFC ustat auc pval padj pct_in pct_out feature 0 CH 0.000000 0.000000 5000.0 0.5000 NaN NaN 0.0 0.0 Rp1 1 CH 0.000000 0.000000 5000.0 0.5000 NaN NaN 0.0 0.0 Sox17 2 CH 8.413918 5.771032 6948.0 0.6948 7.674938e-08 9.017050e-07 64.0 19.0 Mrpl15 3 CH 4.627888 2.507528 5894.0 0.5894 4.888534e-03 1.529972e-02 36.0 15.5 Lypla1 4 CH 0.256851 0.256851 5100.0 0.5100 4.999579e-02 1.110755e-01 2.0 0.0 Gm37988 ... ... ... ... ... ... ... ... ... ... ... 141696 RE 2.269271 0.260934 5105.0 0.5105 7.154050e-01 1.000000e+00 16.0 14.5 PISD 141697 RE 2.593425 -0.267436 4968.0 0.4968 9.268655e-01 1.000000e+00 18.0 22.0 DHRSX 141698 RE 0.000000 -0.185628 4925.0 0.4925 3.973749e-01 9.619204e-01 0.0 1.5 CAAA01147332.1 141699 RE 16.347951 3.563943 7373.0 0.7373 2.048452e-07 1.117704e-05 100.0 80.0 tdT-WPRE_trans 141700 RE 0.000000 0.000000 5000.0 0.5000 NaN NaN 0.0 0.0 CreER-WPRE_trans