pairot.tl.DatasetMap#
- class pairot.tl.DatasetMap(adata1, adata2)#
Align cell annotations between query and reference dataset using annotation-informed optimal transport.
Examples
>>> import scanpy as sc >>> import pairot as pr >>> >>> # 1. Preprocess input data >>> adata_query, adata_ref = pr.pp.preprocess_adatas( >>> sc.read_h5ad("path/to/query.h5ad"), >>> sc.read_h5ad("path/to/reference.h5ad"), >>> n_top_genes=750, >>> cell_type_column_adata1="cell_type_column_query", >>> cell_type_column_adata2="cell_type_column_ref", >>> sample_column_adata1="sequencing_sample_column_query", >>> sample_column_adata2="sequencing_sample_column_ref", >>> ) >>> >>> # 2. Initialize pairOT model >>> dataset_map = pr.tl.DatasetMap(adata_query, adata_ref) >>> dataset_map.init_geom(batch_size=512, epsilon=0.05) >>> dataset_map.init_problem(tau_a=1.0, tau_b=1.0) >>> >>> # 3. Fit pairOT model >>> dataset_map.solve() >>> mapping = dataset_map.compute_mapping() >>> distance = dataset_map.compute_distance() >>> >>> # 4. Visualize results >>> pr.pl.mapping(mapping) # similarity matrix >>> distance = distance.loc[ >>> mapping.max(axis=1).sort_values(ascending=False).index.tolist(), >>> mapping.max().sort_values(ascending=False).index.tolist(), >>> ] # order cluster distance matrix the same way as similarity matrix >>> pr.pl.distance(distance) # cluster distance matrix
Attributes table#
Return DEGs (differentially expressed genes) based on which label distance matrix is computed. |
|
Return label distance matrix between cell-type clusters. |
Methods table#
|
Compute the distance between cell-type clusters based on the optimal transport mappings. |
|
Compute the mapping between cell-type clusters based on the aggregated transport matrix. |
|
Initialize the geometry of the optimal transport problem. |
|
Initialize the optimal transport problem. |
|
Function for pre-processing the input AnnData objects for usage with |
|
Select the |
|
Solve the underlying optimal transport problem. |
Attributes#
- DatasetMap.DEGs_label_distance#
Return DEGs (differentially expressed genes) based on which label distance matrix is computed.
- DatasetMap.label_distance#
Return label distance matrix between cell-type clusters.
Methods#
- DatasetMap.compute_distance(n_samples=10000)#
Compute the distance between cell-type clusters based on the optimal transport mappings.
- DatasetMap.compute_mapping(aggregation_method='mean')#
Compute the mapping between cell-type clusters based on the aggregated transport matrix.
Aggregation is done by cluster/cell-type.
- DatasetMap.init_geom(epsilon=0.05, batch_size=1024, lambda_feature=0.5, lambda_label=1.5, n_genes_ova=10, n_genes_ava=3, overlap_threshold_ava=0.3, overlap_n_genes_ava=10, adj_p_val_threshold=0.05, auroc_threshold=0.6, logfc_threshold=1.0, gene_filtering=True, q_norm=0.33, embedding_layer=None, **kwargs)#
Initialize the geometry of the optimal transport problem.
Function calls the constructor of
ott.geometry.pointcloud.PointCloud.- Parameters:
epsilon (
float(default:0.05)) – Regularization strength of the optimal transport problem.batch_size (
int|None(default:1024)) – Batch size used to solve the optimal transport problem in an online fashion. The bigger the batch size, the better the GPU utilization. However, bigger batch sizes lead to a higher GPU memory consumption.lambda_feature (
float(default:0.5)) – Weight for the distance in gene/feature space for the cell to cell transport cost.lambda_label (
float(default:1.5)) – Weight for the distance in label space for the cell to cell transport cost.n_genes_ova (
int(default:10)) – Number of top n differentially expressed (DE) genes inadata1used to calculate the rank distance between DE genes for the label distance. This setting applies to the one-vs-all (OVA) DE test results.n_genes_ava (
int(default:3)) – Number of top n differentially expressed (DE) genes inadata1used to calculate the rank distance between DE genes for the label distance. This setting applies to the all-vs-all (AVA) DE test results.overlap_threshold_ava (
float(default:0.3)) – Minimum overlap of the topoverlap_n_genes_avaDE genes to add the all-vs-all (AVA) DE test results for the corresponding cell type label combination.overlap_n_genes_ava (
int(default:10)) – Number of top DE genes used to calculate the overlap of DE genes when deciding which all-vs-all (AVA) DE results to include.adj_p_val_threshold (
float(default:0.05)) – Minimum adjusted p-value to consider a gene as differentially expressed.auroc_threshold (
float(default:0.6)) – Minimum AUROC score to consider a gene as differentially expressed.logfc_threshold (
float(default:1.0)) – Minimum log fold change to consider a gene as differentially expressed.gene_filtering (
bool(default:True)) – Whether to filter DE gene results. If true mitochondrial, ribosomal, IncRNA, TCR and BCR genes are removed from the DE results.q_norm (
float(default:0.33)) – Quantile used to normalize label distance matrix.embedding_layer (
str|None(default:None)) – Name of the embedding layer inadata1.obsmandadata1.obsmused to calculate the distance between two cells. If this parameter is provided, the distance between two cells is calculated as the cosine distance in embedding space instead of the Spearman correlation in the full gene space.kwargs – Keyword arguments passed to
ott.geometry.pointcloud.PointCloud.
- DatasetMap.init_problem(tau_a=1.0, tau_b=1.0, marginals_distribution='balanced', **kwargs)#
Initialize the optimal transport problem.
Function calls the constructor of
ott.problems.linear.linear_problem.LinearProblem.- Parameters:
tau_a (
float(default:1.0)) – If < 1., defines how unbalanced the problem is on the first marginal.tau_b (
float(default:1.0)) – If < 1., defines how unbalanced the problem is on the second marginal.marginals_distribution (
Literal['uniform','balanced'] (default:'balanced')) – Whether the marginals should be uniform or balanced by cell-type frequency. Useuniformfor uniform marginals. Meaning, each cell contributes the same mass to the marginal distribution. Usebalancedfor marginals balanced by cell-type frequency. Meaning, each cell-type contributes the same mass to the marginal distribution. This parameter is ignored, if marginalsaorbare supplied via the**kwargs.kwargs – Keyword arguments passed to
ott.problems.linear.linear_problem.LinearProblem.
- static DatasetMap.preprocess_adatas(adata1, adata2, cell_type_column_adata1='cell_type_author', cell_type_column_adata2='cell_type_author', sample_column_adata1='sample_id', sample_column_adata2='sample_id', n_top_genes=500, filter_genes=False, n_samples_auroc=10000, n_samples_hvg_selection=100000)#
Function for pre-processing the input AnnData objects for usage with
pairot.tl.DatasetMap.- Function applies the following preprocessing steps:
Subset gene space to genes that are expressed in both datasets.
Calculate differentially-expressed (DE) genes for each cluster.
Subset to highly variable genes which are used to calculate the Spearman correlation between two cells.
- Parameters:
adata1 (
AnnData) – Query data.adata2 (
AnnData) – Reference data.cell_type_column_adata1 (
str(default:'cell_type_author')) – Name of the column inadata.obsthat contains the cell type labels for adata1.cell_type_column_adata2 (
str(default:'cell_type_author')) – Name of the column inadata.obsthat contains the cell type labels for adata2.sample_column_adata1 (
str(default:'sample_id')) – Name of the column inadata.obsthat contains the sequencing sample ids/labels for adata1.sample_column_adata2 (
str(default:'sample_id')) – Name of the column inadata.obsthat contains the sequencing sample ids/labels for adata1.n_top_genes (
int(default:500)) – Number of highly variable genes to use to calculate the Spearman correlation between two cells.filter_genes (
bool(default:False)) – Whether to remove uninformative genes. If true mitochondrial, ribosomal, IncRNA, TCR and BCR genes are removed.n_samples_auroc (
int(default:10000)) – Maximum number of samples to use for AUROC calculation. If None, all samples are used. This can drastically reduce computation time for large datasets.n_samples_hvg_selection (
int(default:100000)) – Number of samples to use for highly variable gene selection. If None, all samples are used. This can drastically reduce the memory usage for large datasets.
- Return type:
- Returns:
Preprocessed AnnData objects.
- static DatasetMap.select_similar_clusters(mapping, distance, threshold_mapping=0.25, threshold_distance=1.0, n_top=None)#
Select the
n_topmost similar cell-type clusters in the reference data for each cell-type cluster in the query data based on the aggregated transport matrix.- Parameters:
mapping (
DataFrame) – The cluster mapping matrix / aggregated transport matrix. The output of self.compute_cluster_mapping().distance (
DataFrame) – The cluster distance matrix. The output of self.compute_cluster_distances().threshold_mapping (
float|None(default:0.25)) – The minimum transported mass between two cell-type clusters for a cluster to be suggested as a most similar cell-type cluster. If set toNone, no filtering will be applied.threshold_distance (
float|None(default:1.0)) – The maximum distance between two cell-type clusters for a cluster to be suggested as a most similar cell-type cluster. If set toNone, no filtering will be applied.n_top (
int|None(default:None)) – Maximum number of most similar clusters to return. If set toNone, all cell-type clusters will be returned.
- Return type:
- Returns:
A dictionary mapping each cell-type cluster in the query data to a list of most similar cell type clusters in the reference data.
- DatasetMap.solve(**kwargs)#
Solve the underlying optimal transport problem.
Function uses
ott.solvers.linear.sinkhorn.Sinkhornto solve the optimal transport problem.- Parameters:
kwargs – Keyword arguments passed to
ott.solvers.linear.sinkhorn.Sinkhorn.