pairot.pp.preprocess_adatas

pairot.pp.preprocess_adatas#

pairot.pp.preprocess_adatas(adata1, adata2, cell_type_column_adata1, cell_type_column_adata2, sample_column_adata1, sample_column_adata2, n_top_genes=500, filter_genes=False, n_samples_auroc=10000, n_samples_hvg_selection=100000)#

Function for pre-processing the input AnnData objects for usage with pairot.tl.DatasetMap.

Function applies the following preprocessing steps:
  1. Subset gene space to genes that are expressed in both datasets.

  2. Calculate differentially-expressed (DE) genes for each cluster.

  3. Subset to highly variable genes which are used to calculate the Spearman correlation between two cells.

Parameters:
  • adata1 (AnnData) – Query data.

  • adata2 (AnnData) – Reference data.

  • cell_type_column_adata1 (str) – Name of the column in adata.obs that contains the cell type labels for adata1.

  • cell_type_column_adata2 (str) – Name of the column in adata.obs that contains the cell type labels for adata2.

  • sample_column_adata1 (str) – Name of the column in adata.obs that contains the sequencing sample ids/labels for adata1.

  • sample_column_adata2 (str) – Name of the column in adata.obs that contains the sequencing sample ids/labels for adata1.

  • n_top_genes (int (default: 500)) – Number of highly variable genes to use to calculate the Spearman correlation between two cells.

  • filter_genes (bool (default: False)) – Whether to remove uninformative genes. If true mitochondrial, ribosomal, IncRNA, TCR and BCR genes are removed.

  • n_samples_auroc (int (default: 10000)) – Maximum number of samples to use for AUROC calculation. If None, all samples are used. This can drastically reduce computation time for large datasets.

  • n_samples_hvg_selection (int (default: 100000)) – Number of samples to use for highly variable gene selection. If None, all samples are used. This can drastically reduce the memory usage for large datasets.

Return type:

tuple[AnnData, AnnData]

Examples

>>> import anndata as ad
>>> from pairot.pp import preprocess_adatas
>>>
>>> adata_query = ad.read_h5ad("path/to/query_data.h5ad")
>>> adata_ref = ad.read_h5ad("path/to/ref_data.h5ad")
>>> adata_query, adata_ref = preprocess_adatas(
>>>     adata1,
>>>     adata2,
>>>     cell_type_column_adata1="cell_type_col_adata1",
>>>     cell_type_column_adata2="cell_type_col_adata2",
>>>     sample_column_adata1="sample_id_col_adata1",
>>>     sample_column_adata2="sample_id_col_adata2",
>>> )