pairot.tl.DatasetMap#

class pairot.tl.DatasetMap(adata1, adata2)#

Align cell annotations between query and reference dataset using annotation-informed optimal transport.

Examples

>>> import scanpy as sc
>>> import pairot as pr
>>>
>>> # 1. Preprocess input data
>>> adata_query, adata_ref = pr.pp.preprocess_adatas(
>>>     sc.read_h5ad("path/to/query.h5ad"),
>>>     sc.read_h5ad("path/to/reference.h5ad"),
>>>     n_top_genes=750,
>>>     cell_type_column_adata1="cell_type_column_query",
>>>     cell_type_column_adata2="cell_type_column_ref",
>>>     sample_column_adata1="sequencing_sample_column_query",
>>>     sample_column_adata2="sequencing_sample_column_ref",
>>> )
>>>
>>> # 2. Initialize pairOT model
>>> dataset_map = pr.tl.DatasetMap(adata_query, adata_ref)
>>> dataset_map.init_geom(batch_size=512, epsilon=0.05)
>>> dataset_map.init_problem(tau_a=1.0, tau_b=1.0)
>>>
>>> # 3. Fit pairOT model
>>> dataset_map.solve()
>>> mapping = dataset_map.compute_mapping()
>>> distance = dataset_map.compute_distance()
>>>
>>> # 4. Visualize results
>>> pr.pl.mapping(mapping)  # similarity matrix
>>> distance = distance.loc[
>>>     mapping.max(axis=1).sort_values(ascending=False).index.tolist(),
>>>     mapping.max().sort_values(ascending=False).index.tolist(),
>>> ]  # order cluster distance matrix the same way as similarity matrix
>>> pr.pl.distance(distance)  # cluster distance matrix

Attributes table#

DEGs_label_distance

Return DEGs (differentially expressed genes) based on which label distance matrix is computed.

label_distance

Return label distance matrix between cell-type clusters.

Methods table#

compute_distance([n_samples])

Compute the distance between cell-type clusters based on the optimal transport mappings.

compute_mapping([aggregation_method])

Compute the mapping between cell-type clusters based on the aggregated transport matrix.

init_geom([epsilon, batch_size, ...])

Initialize the geometry of the optimal transport problem.

init_problem([tau_a, tau_b, ...])

Initialize the optimal transport problem.

preprocess_adatas(adata1, adata2[, ...])

Function for pre-processing the input AnnData objects for usage with pairot.tl.DatasetMap.

select_similar_clusters(mapping, distance[, ...])

Select the n_top most similar cell-type clusters in the reference data for each cell-type cluster in the query data based on the aggregated transport matrix.

solve(**kwargs)

Solve the underlying optimal transport problem.

Attributes#

DatasetMap.DEGs_label_distance#

Return DEGs (differentially expressed genes) based on which label distance matrix is computed.

DatasetMap.label_distance#

Return label distance matrix between cell-type clusters.

Methods#

DatasetMap.compute_distance(n_samples=10000)#

Compute the distance between cell-type clusters based on the optimal transport mappings.

Parameters:

n_samples (int (default: 10000)) – The number of samples based on which the distance is calculated.

Return type:

DataFrame

Returns:

Distances between cell-type clusters of the query and reference datasets.

DatasetMap.compute_mapping(aggregation_method='mean')#

Compute the mapping between cell-type clusters based on the aggregated transport matrix.

Aggregation is done by cluster/cell-type.

Parameters:

aggregation_method (Optional[Literal['mean', 'jensen_shannon', 'transported_mass']] (default: 'mean')) – Method used to aggregate the transport map between cell-type clusters.

Return type:

DataFrame | dict[str, DataFrame]

Returns:

Mappings between cell-type clusters of the query and reference datasets.

DatasetMap.init_geom(epsilon=0.05, batch_size=1024, lambda_feature=0.5, lambda_label=1.5, n_genes_ova=10, n_genes_ava=3, overlap_threshold_ava=0.3, overlap_n_genes_ava=10, adj_p_val_threshold=0.05, auroc_threshold=0.6, logfc_threshold=1.0, gene_filtering=True, q_norm=0.33, embedding_layer=None, **kwargs)#

Initialize the geometry of the optimal transport problem.

Function calls the constructor of ott.geometry.pointcloud.PointCloud.

Parameters:
  • epsilon (float (default: 0.05)) – Regularization strength of the optimal transport problem.

  • batch_size (int | None (default: 1024)) – Batch size used to solve the optimal transport problem in an online fashion. The bigger the batch size, the better the GPU utilization. However, bigger batch sizes lead to a higher GPU memory consumption.

  • lambda_feature (float (default: 0.5)) – Weight for the distance in gene/feature space for the cell to cell transport cost.

  • lambda_label (float (default: 1.5)) – Weight for the distance in label space for the cell to cell transport cost.

  • n_genes_ova (int (default: 10)) – Number of top n differentially expressed (DE) genes in adata1 used to calculate the rank distance between DE genes for the label distance. This setting applies to the one-vs-all (OVA) DE test results.

  • n_genes_ava (int (default: 3)) – Number of top n differentially expressed (DE) genes in adata1 used to calculate the rank distance between DE genes for the label distance. This setting applies to the all-vs-all (AVA) DE test results.

  • overlap_threshold_ava (float (default: 0.3)) – Minimum overlap of the top overlap_n_genes_ava DE genes to add the all-vs-all (AVA) DE test results for the corresponding cell type label combination.

  • overlap_n_genes_ava (int (default: 10)) – Number of top DE genes used to calculate the overlap of DE genes when deciding which all-vs-all (AVA) DE results to include.

  • adj_p_val_threshold (float (default: 0.05)) – Minimum adjusted p-value to consider a gene as differentially expressed.

  • auroc_threshold (float (default: 0.6)) – Minimum AUROC score to consider a gene as differentially expressed.

  • logfc_threshold (float (default: 1.0)) – Minimum log fold change to consider a gene as differentially expressed.

  • gene_filtering (bool (default: True)) – Whether to filter DE gene results. If true mitochondrial, ribosomal, IncRNA, TCR and BCR genes are removed from the DE results.

  • q_norm (float (default: 0.33)) – Quantile used to normalize label distance matrix.

  • embedding_layer (str | None (default: None)) – Name of the embedding layer in adata1.obsm and adata1.obsm used to calculate the distance between two cells. If this parameter is provided, the distance between two cells is calculated as the cosine distance in embedding space instead of the Spearman correlation in the full gene space.

  • kwargs – Keyword arguments passed to ott.geometry.pointcloud.PointCloud.

DatasetMap.init_problem(tau_a=1.0, tau_b=1.0, marginals_distribution='balanced', **kwargs)#

Initialize the optimal transport problem.

Function calls the constructor of ott.problems.linear.linear_problem.LinearProblem.

Parameters:
  • tau_a (float (default: 1.0)) – If < 1., defines how unbalanced the problem is on the first marginal.

  • tau_b (float (default: 1.0)) – If < 1., defines how unbalanced the problem is on the second marginal.

  • marginals_distribution (Literal['uniform', 'balanced'] (default: 'balanced')) – Whether the marginals should be uniform or balanced by cell-type frequency. Use uniform for uniform marginals. Meaning, each cell contributes the same mass to the marginal distribution. Use balanced for marginals balanced by cell-type frequency. Meaning, each cell-type contributes the same mass to the marginal distribution. This parameter is ignored, if marginals a or b are supplied via the **kwargs.

  • kwargs – Keyword arguments passed to ott.problems.linear.linear_problem.LinearProblem.

static DatasetMap.preprocess_adatas(adata1, adata2, cell_type_column_adata1='cell_type_author', cell_type_column_adata2='cell_type_author', sample_column_adata1='sample_id', sample_column_adata2='sample_id', n_top_genes=500, filter_genes=False, n_samples_auroc=10000, n_samples_hvg_selection=100000)#

Function for pre-processing the input AnnData objects for usage with pairot.tl.DatasetMap.

Function applies the following preprocessing steps:
  1. Subset gene space to genes that are expressed in both datasets.

  2. Calculate differentially-expressed (DE) genes for each cluster.

  3. Subset to highly variable genes which are used to calculate the Spearman correlation between two cells.

Parameters:
  • adata1 (AnnData) – Query data.

  • adata2 (AnnData) – Reference data.

  • cell_type_column_adata1 (str (default: 'cell_type_author')) – Name of the column in adata.obs that contains the cell type labels for adata1.

  • cell_type_column_adata2 (str (default: 'cell_type_author')) – Name of the column in adata.obs that contains the cell type labels for adata2.

  • sample_column_adata1 (str (default: 'sample_id')) – Name of the column in adata.obs that contains the sequencing sample ids/labels for adata1.

  • sample_column_adata2 (str (default: 'sample_id')) – Name of the column in adata.obs that contains the sequencing sample ids/labels for adata1.

  • n_top_genes (int (default: 500)) – Number of highly variable genes to use to calculate the Spearman correlation between two cells.

  • filter_genes (bool (default: False)) – Whether to remove uninformative genes. If true mitochondrial, ribosomal, IncRNA, TCR and BCR genes are removed.

  • n_samples_auroc (int (default: 10000)) – Maximum number of samples to use for AUROC calculation. If None, all samples are used. This can drastically reduce computation time for large datasets.

  • n_samples_hvg_selection (int (default: 100000)) – Number of samples to use for highly variable gene selection. If None, all samples are used. This can drastically reduce the memory usage for large datasets.

Return type:

tuple[AnnData, AnnData]

Returns:

Preprocessed AnnData objects.

static DatasetMap.select_similar_clusters(mapping, distance, threshold_mapping=0.25, threshold_distance=1.0, n_top=None)#

Select the n_top most similar cell-type clusters in the reference data for each cell-type cluster in the query data based on the aggregated transport matrix.

Parameters:
  • mapping (DataFrame) – The cluster mapping matrix / aggregated transport matrix. The output of self.compute_cluster_mapping().

  • distance (DataFrame) – The cluster distance matrix. The output of self.compute_cluster_distances().

  • threshold_mapping (float | None (default: 0.25)) – The minimum transported mass between two cell-type clusters for a cluster to be suggested as a most similar cell-type cluster. If set to None, no filtering will be applied.

  • threshold_distance (float | None (default: 1.0)) – The maximum distance between two cell-type clusters for a cluster to be suggested as a most similar cell-type cluster. If set to None, no filtering will be applied.

  • n_top (int | None (default: None)) – Maximum number of most similar clusters to return. If set to None, all cell-type clusters will be returned.

Return type:

dict[str, list[str]]

Returns:

A dictionary mapping each cell-type cluster in the query data to a list of most similar cell type clusters in the reference data.

DatasetMap.solve(**kwargs)#

Solve the underlying optimal transport problem.

Function uses ott.solvers.linear.sinkhorn.Sinkhorn to solve the optimal transport problem.

Parameters:

kwargs – Keyword arguments passed to ott.solvers.linear.sinkhorn.Sinkhorn.