pairot.pp.preprocess_adatas#
- pairot.pp.preprocess_adatas(adata1, adata2, cell_type_column_adata1, cell_type_column_adata2, sample_column_adata1, sample_column_adata2, n_top_genes=500, filter_genes=False, n_samples_auroc=10000, n_samples_hvg_selection=100000)#
Function for pre-processing the input AnnData objects for usage with
pairot.tl.DatasetMap.- Function applies the following preprocessing steps:
Subset gene space to genes that are expressed in both datasets.
Calculate differentially-expressed (DE) genes for each cluster.
Subset to highly variable genes which are used to calculate the Spearman correlation between two cells.
- Parameters:
adata1 (
AnnData) – Query data.adata2 (
AnnData) – Reference data.cell_type_column_adata1 (
str) – Name of the column inadata.obsthat contains the cell type labels for adata1.cell_type_column_adata2 (
str) – Name of the column inadata.obsthat contains the cell type labels for adata2.sample_column_adata1 (
str) – Name of the column inadata.obsthat contains the sequencing sample ids/labels for adata1.sample_column_adata2 (
str) – Name of the column inadata.obsthat contains the sequencing sample ids/labels for adata1.n_top_genes (
int(default:500)) – Number of highly variable genes to use to calculate the Spearman correlation between two cells.filter_genes (
bool(default:False)) – Whether to remove uninformative genes. If true mitochondrial, ribosomal, IncRNA, TCR and BCR genes are removed.n_samples_auroc (
int(default:10000)) – Maximum number of samples to use for AUROC calculation. If None, all samples are used. This can drastically reduce computation time for large datasets.n_samples_hvg_selection (
int(default:100000)) – Number of samples to use for highly variable gene selection. If None, all samples are used. This can drastically reduce the memory usage for large datasets.
- Return type:
Examples
>>> import anndata as ad >>> from pairot.pp import preprocess_adatas >>> >>> adata_query = ad.read_h5ad("path/to/query_data.h5ad") >>> adata_ref = ad.read_h5ad("path/to/ref_data.h5ad") >>> adata_query, adata_ref = preprocess_adatas( >>> adata1, >>> adata2, >>> cell_type_column_adata1="cell_type_col_adata1", >>> cell_type_column_adata2="cell_type_col_adata2", >>> sample_column_adata1="sample_id_col_adata1", >>> sample_column_adata2="sample_id_col_adata2", >>> )