Tutorial 2: Vertical denoising and integration on DLPFC multi slices from 10x Visium

In this tutorial, we evaluate the capability of STADIM in handling vertically adjacent slices. We use the human Dorsolateral Prefrontal Cortex (DLPFC) slice #151673-#151676 generated by 10x Visium platform as a representative example, which is annotated by Maynard et al. into white matter (WM) and cortical layers (L1-L6) based on marker genes and cellular structure.

We download the data via http://spatial.libd.org/spatialLIBD/.

Quick view data

[1]:

import scanpy as sc
import matplotlib.pyplot as plt
import os

sample_ids = ['151673', '151674', '151675', '151676']
data_dir = '/data2/xiaost/SODA/Data/DLPFC/'

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for i, s_id in enumerate(sample_ids):
    adata = sc.read_h5ad(os.path.join(data_dir, f'dlpfc_{s_id}.h5ad'))
    print(f'{s_id}')
    print(adata)

    sc.pl.spatial(adata, color='Label', ax=axes[i], show=False, title=f'Sample {s_id}')

    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

plt.tight_layout()
plt.show()

151673
AnnData object with n_obs × n_vars = 3611 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'sample', 'Label'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'
151674
AnnData object with n_obs × n_vars = 3635 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'sample', 'Label'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'
151675
AnnData object with n_obs × n_vars = 3566 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'sample', 'Label'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'
151676
AnnData object with n_obs × n_vars = 3431 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'sample', 'Label'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'

Run STADIM

You can call the STADIM command directly in your terminal:

nohup stadim --input "./dlpfc_151673.h5ad" "./dlpfc_151674.h5ad" "./dlpfc_151675.h5ad" "./dlpfc_151676.h5ad" --save_dir ./dlpfc_donor1_test --device "cuda:0" --monitor --seed 2026 --min_genes 0 --min_cells 10 --knn_c 6 --knn_e 10 --mnn_n 25 > ./dlpfc_donor1_test.log 2>&1 &

and load the results into your Python environment for downstream analysis. The denoised expression matrix is conveniently stored within the layers attribute:

import anndata as ad

adata = ad.read_h5ad('/dlpfc_donor1_test/sdata.h5ad')

## The denoised gene expressionis is stored in adata.layers['STADIM']
print(adata)

or you can call STADIM within a Jupyter Notebook as follows:

[2]:

import stadim

adata = stadim.run_STADIM(['/data2/xiaost/SODA/Data/DLPFC/dlpfc_151673.h5ad', '/data2/xiaost/SODA/Data/DLPFC/dlpfc_151674.h5ad',
                          '/data2/xiaost/SODA/Data/DLPFC/dlpfc_151675.h5ad', '/data2/xiaost/SODA/Data/DLPFC/dlpfc_151676.h5ad'],
                          save_preprocessed_h5ad=None, save_dir=None,
                          sample_names=None, batch_key='sample', device='cuda:1', seed=2026,
                          min_genes=0, min_cells=10, nhvgs=2000, dim=50, knn_c=6, knn_e=10, mnn_n=25,
                          batch_size=256, triplets_update_ratio=0.8, hard_triplets_ratio=0.7,
                          epochs=100, lr=1e-3, loss_mode='nb', beta_trip=0.1,
                          encoder_layers=[512, 256, 64], decoder_layers=[1000])

Results will be stored in adata.layers['STADIM']

=== 1. Begin read_data!
Detected multi-slice data, total 4 slices
  Slice 1: Using user-defined column 'sample' as sample
  Slice 1: Keeping existing sample label 151673
  Slice 2: Using user-defined column 'sample' as sample
  Slice 2: Keeping existing sample label 151674
  Slice 3: Using user-defined column 'sample' as sample
  Slice 3: Keeping existing sample label 151675
  Slice 4: Using user-defined column 'sample' as sample
  Slice 4: Keeping existing sample label 151676

Merging data...

Raw Merged Data: 14243 spots × 33538 genes
Unified Filtering (min_genes=0, min_cells=10)...
  → After Filtering: 14243 spots × 19326 genes
  ✓ Complete: Samples included: ['151673', '151674', '151675', '151676']
  ✓ Complete: AnnData object with n_obs × n_vars = 14243 × 19326
    obs: 'in_tissue', 'array_row', 'array_col', 'sample', 'Label'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
    uns: 'spatial'
    obsm: 'spatial'
    layers: 'X_raw'

=== 2. Begin MY_preprocess!
== Selected 2000 HVGs across 4 slices

=== 3. Begin all_ap find neighbors!

=== 4. Begin pre_dataset!
{0: '151673', 1: '151674', 2: '151675', 3: '151676'}

Preparing triplet: 100%|██████████| 14243/14243 [00:02<00:00, 6902.20it/s]


=== 5. Begin IterableTripletDataset!
Initializing triplets...
IterableTripletDataset Initialized: 14243 anchors, batch_size=256

=== 6. Begin calculate_recommended_margin!
Calculating recommended margin (averaging over 5 runs)...

========================================
Margins from 5 runs: [15.0, 15.0, 15.0, 15.0, 15.0]
Final Recommended Margin: 15.0000

=== 7. Starting training...
Begin training: device=cuda:1

Training: 100%|██████████| 100/100 [05:23<00:00,  3.24s/it, recon=0.306, triplet=1.794, total=0.486]


=== 8. Generating denoised expression...
   Data size (2.75e+08) is within limits. Allocating memory for denoised data.

All processes finished! Total time: 6.62 mins.

[3]:

print(adata)

AnnData object with n_obs × n_vars = 14243 × 19326
    obs: 'in_tissue', 'array_row', 'array_col', 'sample', 'Label', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'size_factor'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable'
    uns: 'spatial', 'log1p', 'top_hvgs'
    obsm: 'spatial', 'S_scale', 'X_hvg_scale', 'X_pca', 'cell_names', 'STADIM'
    layers: 'X_raw', 'X_norm', 'X_log', 'STADIM', 'STADIM_withbatch'

Analysis

[4]:

import anndata as ad
import pandas as pd
import scanpy as sc
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap
gene_exp_colors = sns.blend_palette(["#C9E2FF", "#eae6cc", '#e31a1c'], n_colors=100)
gene_exp_palette = LinearSegmentedColormap.from_list("gene_exp", gene_exp_colors)
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import matplotlib.patches as mpatches
plt.rcParams.update({
    'font.size': 16,
    'axes.labelsize': 18,
    'axes.titlesize': 18,
    'xtick.labelsize': 16,
    'ytick.labelsize': 16,
    'legend.fontsize': 16,
})

styles = {
    'STADIM': {'color': '#d62728', 'marker': '*', 'ms': 10, 'lw': 2.5, 'zorder': 10},
    'X_norm': {'color': '#c7c7c7', 'marker': 'o', 'ms': 6, 'lw': 1.5, 'zorder': 5}
}

Gene expression visualization

[5]:

fig, axes = plt.subplots(1, 4, figsize=(18, 4))
for i, s_id in enumerate(sample_ids):
    sc.pl.spatial(adata[adata.obs['sample'] == s_id], layer='X_log', color='TRABD2A', cmap=gene_exp_palette, spot_size=150,
                  library_id=s_id, ax=axes[i], title=f"Sample {s_id}",show=False, vmax='p99')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

plt.tight_layout()
plt.show()

[6]:

adata.X = adata.layers['STADIM'].copy()
sc.pp.log1p(adata)
adata.layers['STADIM_log'] = adata.X.copy()

fig, axes = plt.subplots(1, 4, figsize=(18, 4))
for i, s_id in enumerate(sample_ids):
    sc.pl.spatial(adata[adata.obs['sample'] == s_id], layer='STADIM_log', color='TRABD2A', cmap=gene_exp_palette, spot_size=150,
                  library_id=s_id, ax=axes[i], title=f"Sample {s_id}",show=False, vmax='p99')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

plt.tight_layout()
plt.show()

WARNING: adata.X seems to be already log-transformed.

Batch Entropy

[7]:

%%time

res = stadim.calculate_batch_entropy(adata, test_layers=['X_norm', 'STADIM'], k_range=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
                                     group_key='Label', donor_id='DLPFC_Donor1', batch_key='sample', n_hvg=2000)

batch_entropy = pd.DataFrame(res)

display(batch_entropy)

	Method	Donor_ID	K	Avg_Entropy	Avg_Region_Ratio
0	X_norm	DLPFC_Donor1	10	0.561502	0.566145
1	X_norm	DLPFC_Donor1	20	0.639298	0.555410
2	X_norm	DLPFC_Donor1	30	0.675469	0.547668
3	X_norm	DLPFC_Donor1	40	0.697315	0.542279
4	X_norm	DLPFC_Donor1	50	0.712926	0.537930
5	X_norm	DLPFC_Donor1	60	0.725089	0.534229
6	X_norm	DLPFC_Donor1	70	0.735988	0.530475
7	X_norm	DLPFC_Donor1	80	0.744872	0.527497
8	X_norm	DLPFC_Donor1	90	0.752677	0.524724
9	X_norm	DLPFC_Donor1	100	0.759480	0.522301
10	STADIM	DLPFC_Donor1	10	0.818788	0.734438
11	STADIM	DLPFC_Donor1	20	0.893744	0.728073
12	STADIM	DLPFC_Donor1	30	0.919236	0.723916
13	STADIM	DLPFC_Donor1	40	0.932822	0.720812
14	STADIM	DLPFC_Donor1	50	0.941357	0.717912
15	STADIM	DLPFC_Donor1	60	0.947545	0.715151
16	STADIM	DLPFC_Donor1	70	0.952164	0.712708
17	STADIM	DLPFC_Donor1	80	0.955841	0.710375
18	STADIM	DLPFC_Donor1	90	0.958784	0.708112
19	STADIM	DLPFC_Donor1	100	0.961284	0.706041

CPU times: user 1min 29s, sys: 3.89 s, total: 1min 33s
Wall time: 40.6 s

[8]:

metrics = ['Avg_Entropy', 'Avg_Region_Ratio']
y_labels = ['Batch Entropy \u2191', 'Label Consistency \u2191']

plot_data = batch_entropy.groupby(['Method', 'K'])[metrics].mean().reset_index()

fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True)

for i, metric in enumerate(metrics):
    ax = axes[i]
    for m in ['STADIM','X_norm']:
        data = plot_data[plot_data['Method'] == m].sort_values('K')
        if data.empty: continue

        ax.plot(data['K'], data[metric], label=m,
                color=styles[m]['color'], marker=styles[m]['marker'],
                markersize=styles[m]['ms'], linewidth=styles[m]['lw'],
                zorder=styles[m]['zorder'], alpha=0.9)

    ax.set_title(f"#DLPFC_Donor1")
    ax.set_xlabel('Number of Neighbors (K)')
    ax.set_ylabel(y_labels[i])
    ax.set_xticks([20, 40, 60, 80, 100])
    ax.grid(True, linestyle=':', alpha=0.5)
    sns.despine(ax=ax)

handles, labels = axes[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='center left', bbox_to_anchor=(1.0, 0.5),
           title='Methods', frameon=False)

plt.tight_layout()
plt.show()

weighted Coefficient of Variation (wCV)

[9]:

benchmark_adata = stadim.create_shuffled_batches(adata, n_batches=5)
benchmark_adata.var = adata.var
print(f"\nbenchmark adata: {benchmark_adata}")


benchmark adata: AnnData object with n_obs × n_vars = 14243 × 19326
    obs: 'in_tissue', 'array_row', 'array_col', 'sample', 'Label', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'size_factor', 'sim_batch'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable'
    obsm: 'spatial', 'S_scale', 'X_hvg_scale', 'X_pca', 'STADIM'
    layers: 'X_raw', 'X_norm', 'X_log', 'STADIM', 'STADIM_withbatch', 'STADIM_log'

[10]:

%%time

wcv = {}
for gene_type in ["HVGs", "LVGs"]:
    print(f"\ngene_type: {gene_type}")

    # wCV before and after STADIM
    res_deno = stadim.calculate_cv(adata, batch_key='sample', label_key='Label', methods=['X_norm', 'STADIM'], gene_type=gene_type)

    # wCV of simulated benchmark no batch data, calculated with X_norm and shuffled labels
    res_bench = stadim.calculate_cv(benchmark_adata, batch_key='sim_batch', label_key='Label', methods=['X_norm'], gene_type=gene_type)
    res_bench = res_bench.rename(columns={'X_norm': 'Sim_NoBatch'})

    wcv[gene_type] = res_bench.join(res_deno, how='inner')

print(wcv["HVGs"].head())


gene_type: HVGs
--- Calculating CV on 2000 genes of type: HVGs ---
Methods successfully calculated: ['X_norm', 'STADIM']
--- Calculating CV on 2000 genes of type: HVGs ---
Methods successfully calculated: ['X_norm']

gene_type: LVGs
--- Calculating CV on 17326 genes of type: LVGs ---
Methods successfully calculated: ['X_norm', 'STADIM']
--- Calculating CV on 17326 genes of type: LVGs ---
Methods successfully calculated: ['X_norm']
            Sim_NoBatch    X_norm    STADIM
HES4           0.098084  0.125947  0.037362
VWA1           0.110024  0.116677  0.055306
AL645728.1     0.907844  0.635260  0.110994
FO704657.1     1.088555  0.903454  0.133141
AL109917.1     0.652083  0.713419  0.074418
CPU times: user 11.7 s, sys: 2.02 s, total: 13.7 s
Wall time: 13.7 s

[11]:

from mpl_toolkits.axes_grid1.inset_locator import inset_axes

methods = ['Sim_NoBatch', 'X_norm', 'STADIM']
colors = {'Sim_NoBatch': '#ff9896', 'X_norm': '#c7c7c7', 'STADIM': '#d62728'}
gene_types = ['HVGs', 'LVGs']


fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for i, g_type in enumerate(gene_types):
    ax = axes[i]
    df = wcv[g_type]

    ax_ins = inset_axes(ax, width="70%", height="60%", loc=1, borderpad=1)

    for m in methods:
        if m not in df.columns: continue

        style = {
            'label': m, 'color': colors.get(m, '#333333'), 'fill': False,
            'linewidth': 2.5 if m != 'X_norm' else 1.5,
            'linestyle': '--' if m == 'Sim_NoBatch' else '-',
            'zorder': 10 if m == 'STADIM' else 5,
            'cut': 0
        }

        sns.kdeplot(df[m].dropna(), ax=ax, **style)
        sns.kdeplot(df[m].dropna(), ax=ax_ins, **style)

    ax.set_title(f"#DLPFC_Donor1 ({g_type})")
    ax.set_xlabel("Weighted Coefficient of Variation \u2193")
    ax_ins.set_xlim(0, 0.4)
    ax_ins.set_title("Zoom-in", fontsize=14)
    sns.despine(ax=ax)

handles, labels = axes[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='center left', bbox_to_anchor=(1.0, 0.5),
           title='Methods', frameon=False)

plt.tight_layout()
plt.show()

Latent representation Umap

[12]:

%%time

sc.pp.neighbors(adata, use_rep='STADIM', key_added=f'neighbors_STADIM')
sc.tl.umap(adata, neighbors_key=f'neighbors_STADIM')
adata.obsm['X_umap_STADIM'] = adata.obsm['X_umap'].copy()
print(adata)

AnnData object with n_obs × n_vars = 14243 × 19326
    obs: 'in_tissue', 'array_row', 'array_col', 'sample', 'Label', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'size_factor'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable'
    uns: 'spatial', 'log1p', 'top_hvgs', 'neighbors_STADIM', 'umap'
    obsm: 'spatial', 'S_scale', 'X_hvg_scale', 'X_pca', 'STADIM', 'X_umap', 'X_umap_STADIM'
    layers: 'X_raw', 'X_norm', 'X_log', 'STADIM', 'STADIM_withbatch', 'STADIM_log'
    obsp: 'neighbors_STADIM_distances', 'neighbors_STADIM_connectivities'
CPU times: user 4min 28s, sys: 225 ms, total: 4min 28s
Wall time: 1min 3s

[13]:

fig, axes = plt.subplots(1, 2, figsize=(8, 3))

sc.pl.embedding(adata, basis='X_umap_STADIM', color='Label', ax=axes[0], show=False, legend_loc='on data', legend_fontweight='normal',
                title='STADIM (Label)', frameon=False, s=15)

sc.pl.embedding(adata[np.random.permutation(adata.n_obs)], basis='X_umap_STADIM', color='sample', ax=axes[1], show=False,
                title='STADIM (Sample)', frameon=False, s=15)

plt.tight_layout()
plt.show()

[ ]:

[ ]: