Locating TSE nets with ND energy Gaussians.

  • TSE nets are extracted from 10 transition-state ensembles (J. Mol. Biol. 2002; 324:151–163).
  • Approach 1: Kolmogorov-Smirnoff (KS) goodness-of-fit test.
    • TSE nets of a protein are located at the $Q$ whose ND energy Gaussian most resembles its ND energies ($E$s).
  • Approach 2: Bayesian approach with uniform priors.
    • TSE nets of a protein are located at the $Q$ whose ND energy Gaussian is most likely to generate its ND energies ($E$s).
    • Assumes a TSE's $E$ values are independent and identically distributed (i.i.d.), and is resonably described by a single Gaussian.

1. Compare distributions with KS goodness-of-fit test

  • Compare distributions with Kolmogorov-Smirnoff (KS) goodness-of-fit test.
  • Hypothesis test whether an empirical distribution function $F_n(x)$ fits a distribution function $F(x)$.
  • An empricial distribution function $F_n(x)$ is the fraction ($k/n$) of sample values that is $\le x$. With larger sample sizes ($n \to \infty$), $F_n(x)$ converges to the theoretical distribution function $F(x)$.
  • The Kolmogorov-Smirnoff (KS) test statistic: $D_n = sup_x \left[\vert F_n(x) - F_0(n)\vert\right]$
  • Null hypothesis: $F_n(x) = F_0(x)$. Reject if $D_n$ is too large.
  • https://online.stat.psu.edu/stat414/node/322/, http://www.mit.edu/~6.s085/notes/lecture5.pdf

2. Bayesian Inference for Hypothesis Testing

In [1]:
import numpy as np
import pandas as pd
import os
from scipy.stats import norm, pearsonr, shapiro, kstest
from scipy.special import logsumexp
import matplotlib.pyplot as plt
import NDutils as nd # my lib
In [2]:
# import matplotlib
# matplotlib.use('AGG')
# https://matplotlib.org/tutorials/introductory/usage.html?highlight=saving%20figures%20file

Paci

In [3]:
fn = os.path.join('P_ND3', 'P_frates.txt')
df = nd.load_frates(fn)

PIDs = df.pid.tolist(); num_pids = len(PIDs)
fold_rates = df.k_f.values
print(PIDs)
print(fold_rates)
print("Range of fold rates:", np.ptp(fold_rates))
['1imq', '1lmb4', '1bf4', '2ptl', '2ci2', '1aps', '1fmk', '1bk2', '1shf', '1ten']
[ 3.16   3.69   3.018  1.778  1.681 -0.638  1.756  1.881  1.973  0.462]
Range of fold rates: 4.328
In [4]:
# load stats for the three ND variants, dict of dfs
edv = 'a'
dirn = os.path.join("P_ND3", "edv_" + edv)
c_ndstats = nd.get_NDstats(dirn, PIDs, 'c', edv)
x_ndstats = nd.get_NDstats(dirn, PIDs, 'x', edv)
r_ndstats = nd.get_NDstats(dirn, PIDs, 'r', edv)
a_ndstats = nd.get_NDstats(dirn, PIDs, 'a', edv)
y_ndstats = nd.get_NDstats(dirn, PIDs, 'y', edv)
In [5]:
c_ndstats['1imq'].sc_mje # 100 runs per pid
Out[5]:
z
0.1     19.38
0.2     -3.99
0.3     21.58
0.4     18.67
0.5      1.94
        ...  
0.6    -47.29
0.7   -157.05
0.8   -258.75
0.9   -370.91
1.0   -424.52
Name: sc_mje, Length: 1000, dtype: float64
In [6]:
# ND peak energies and peak Qs
c_mje_params = nd.get_params(c_ndstats, 'sc_mje')
c_peakQs, c_peakEs = nd.get_peaks(c_mje_params, PIDs)
print("'c'", c_peakQs, np.mean(c_peakQs), np.std(c_peakQs))

x_mje_params = nd.get_params(x_ndstats, 'sc_mje')
x_peakQs, x_peakEs = nd.get_peaks(x_mje_params, PIDs)
print("'x'", x_peakQs, np.mean(x_peakQs), np.std(x_peakQs))

r_mje_params = nd.get_params(r_ndstats, 'sc_mje')
r_peakQs, r_peakEs = nd.get_peaks(r_mje_params, PIDs)
print("'r'", r_peakQs, np.mean(r_peakQs), np.std(r_peakQs))

a_mje_params = nd.get_params(a_ndstats, 'sc_mje')
a_peakQs, a_peakEs = nd.get_peaks(a_mje_params, PIDs)
print("'a'", np.mean(a_peakQs), np.std(a_peakQs))

y_mje_params = nd.get_params(y_ndstats, 'sc_mje')
y_peakQs, y_peakEs = nd.get_peaks(y_mje_params, PIDs)
print("'y'", np.mean(y_peakQs), np.std(y_peakQs))
'c' [0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.3] 0.26 0.04898979485566354
'x' [0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2] 0.25000000000000006 0.04999999999999999
'r' [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.2, 0.2] 0.19 0.03
'a' 0.25000000000000006 0.04999999999999999
'y' 0.15000000000000005 0.05000000000000001

Load $E$s of TSE nets, and test their normality.

  • Normality of model ND energy distributions has been established in another notebook.
In [7]:
TSE_stats = {}
def get_TSEstats():
    for pid in PIDs:        
        fn = os.path.join("P_pts", pid + '_TSstats.out')
        df = pd.read_table(fn, sep=' ')[['ts_pid', 'scMJE']]
        TSE_stats[pid] = df
    return

TSE_mje_params = {}
def get_mje_params():
    for pid in PIDs:         
        mjes = TSE_stats[pid].scMJE.values
        TSE_mje_params[pid] = (mjes.mean(), mjes.std())
    return

# normaltest requires sample >= 20, using shapiro-wilk
def check_normality(alpha=0.005):    
    for pid in PIDs:
        mjes = TSE_stats[pid].scMJE.values
        sw = shapiro(mjes)
        print(pid, sw)
        if sw[1] < alpha:
            print('\t not normal')        
    return
In [8]:
# load TSE stats
get_TSEstats()
# Number of structures in TSEs
print(PIDs)
num_ts = [len(TSE_stats[pid]) for pid in PIDs]
print(num_ts)
['1imq', '1lmb4', '1bf4', '2ptl', '2ci2', '1aps', '1fmk', '1bk2', '1shf', '1ten']
[16, 127, 74, 126, 184, 29, 147, 31, 33, 90]
  • Unlike model ND energy Gaussians which are based on a sample of 100 observations each, TSE sample sizes vary from as small as 16 to a maximum of 184.
In [9]:
# get TSE mje params (mean, stdev)
get_mje_params() 

# shapiro-wilk, if p-value <= 0.01, reject null hypothesis(that sample is normally distributed)
check_normality(0.01)
1imq ShapiroResult(statistic=0.9378026723861694, pvalue=0.32296502590179443)
1lmb4 ShapiroResult(statistic=0.9273229837417603, pvalue=3.7553995753114577e-06)
	 not normal
1bf4 ShapiroResult(statistic=0.9608604907989502, pvalue=0.021861882880330086)
2ptl ShapiroResult(statistic=0.994161069393158, pvalue=0.8853545784950256)
2ci2 ShapiroResult(statistic=0.9955769181251526, pvalue=0.868311882019043)
1aps ShapiroResult(statistic=0.9715569019317627, pvalue=0.6026254296302795)
1fmk ShapiroResult(statistic=0.9916707277297974, pvalue=0.5438514947891235)
1bk2 ShapiroResult(statistic=0.9130065441131592, pvalue=0.015462859533727169)
1shf ShapiroResult(statistic=0.9699060916900635, pvalue=0.47760269045829773)
1ten ShapiroResult(statistic=0.9897717237472534, pvalue=0.7142994999885559)
  • The null hypothesis for the Shapiro-Wilk test cannot be rejected at $\alpha$=0.01, for all but '1lmb4'. The working hypothesis in this notebook is that both model and TSE ND energy distributions are normal.
In [10]:
colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple', \
          'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan']

def plot_TSE(ax, pid):
    mjes = TSE_stats[pid].scMJE.values 
    ax.hist(mjes, density=True, alpha=0.5)
    mu, stdev = TSE_mje_params[pid]
    x_axis = np.linspace(mu - 3*stdev, mu + 3*stdev, 1000)
    ax.plot(x_axis, norm.pdf(x_axis, mu, stdev), colors[9], lw=2, label="TSE") 
    ax.set_title(pid + ' ' + str(num_ts[i]))
    ax.set_ylabel("Density"); ax.set_xlabel("ND energy (E)")

def plot_ND(ax, pid, q, nd_mje_params:dict , nd_type): 
    mu, stdev = nd_mje_params[pid, q]
    x_axis = np.linspace(mu - 3*stdev, mu + 3*stdev, 1000)    
    ax.plot(x_axis, norm.pdf(x_axis, mu, stdev), 
            colors[(int)(q*10-1)], lw=2, label= nd_type + ' ' + str(q))
In [11]:
fig, axes = plt.subplots(3, 4, figsize=(16, 12), constrained_layout=True)
plt.suptitle('Distribution of TSE E values.')
plt.delaxes(axes[2, 2])
plt.delaxes(axes[2, 3])

for i, pid in enumerate(PIDs):
    r = i // 4; c = i % 4; ax = axes[r,c]    
    plot_TSE(ax, pid)    
    
# plt.savefig("TSE_energydist.png")
In [12]:
# Plot a TSE ND energy histogram and Gaussian
pid = '1aps'
tse_Es = TSE_stats[pid].scMJE.values
tse_mu, tse_stdev = TSE_mje_params[pid]
plt.hist(tse_Es, density=True, alpha=0.5)
x_axis = np.linspace(tse_mu - 3*tse_stdev, tse_mu + 3*tse_stdev, 1000)
plt.plot(x_axis, norm.pdf(x_axis, tse_mu, tse_stdev), colors[9], lw=2, label='TSE')

# Plot 'c' ND energy Gaussians
for i, q in enumerate(nd.Qs[:-1]):
    mu, stdev = c_mje_params[pid, q]
    x_axis = np.linspace(mu - 3*stdev, mu + 3*stdev, 1000)
    plt.plot(x_axis, norm.pdf(x_axis, mu, stdev), colors[i], lw=2, label=str(q))

plt.title("Locating a TSE within the 'c' ND energy Gaussians: " + pid); 
plt.xlabel("ND energy (E)"); plt.ylabel('Density')
_= plt.legend(loc="upper left", bbox_to_anchor=(1.0,1.0))
In [13]:
# Plot a TSE ND energy histogram and Gaussian
pid = '1aps'
tse_Es = TSE_stats[pid].scMJE.values
tse_mu, tse_stdev = TSE_mje_params[pid]
plt.hist(tse_Es, density=True, alpha=0.5)
x_axis = np.linspace(tse_mu - 3*tse_stdev, tse_mu + 3*tse_stdev, 1000)
plt.plot(x_axis, norm.pdf(x_axis, tse_mu, tse_stdev), colors[9], lw=2, label='TSE')

# Plot 'a' ND energy Gaussians
for i, q in enumerate(nd.Qs[:-1]):
    mu, stdev = a_mje_params[pid, q]
    x_axis = np.linspace(mu - 3*stdev, mu + 3*stdev, 1000)
    plt.plot(x_axis, norm.pdf(x_axis, mu, stdev), colors[i], lw=2, label=str(q))

plt.title("Locating a TSE within the 'a' ND energy Gaussians: " + pid); 
plt.xlabel("ND energy (E)"); plt.ylabel('Density')
_= plt.legend(loc="upper left", bbox_to_anchor=(1.0,1.0))

Approach 1

  • Perform KS goodness-of-fit test between TSE $E$ (empirical) distribution, and model $E$ (cumulative) distribution for each $Q$.
  • Choose the $Q$ that produces the smallest test statistic (largest p-value, cannot reject null hypothesis).
In [14]:
pid = '1aps'
tse_Es = TSE_stats[pid].scMJE.values
for q in nd.Qs:
    print(q, kstest(tse_Es, 'norm', args=(a_mje_params[pid, q])))
0.1 KstestResult(statistic=0.4134300685536506, pvalue=5.3200199204297425e-05)
0.2 KstestResult(statistic=0.6852396340344085, pvalue=4.579262206400296e-14)
0.3 KstestResult(statistic=0.7320961308989387, pvalue=2.1033294900407774e-16)
0.4 KstestResult(statistic=0.4687469618210624, pvalue=2.303427950124474e-06)
0.5 KstestResult(statistic=0.469103837709136, pvalue=2.2537031303038013e-06)
0.6 KstestResult(statistic=0.9926084110515484, pvalue=3.1219533687675758e-62)
0.7 KstestResult(statistic=0.9999999999997276, pvalue=0.0)
0.8 KstestResult(statistic=1.0, pvalue=0.0)
0.9 KstestResult(statistic=1.0, pvalue=0.0)
1.0 KstestResult(statistic=1.0, pvalue=0.0)
  • For '1aps' TSE, the best-fit 'a' ND energy Gaussian according to the KS test is at $Q=0.1$.
In [15]:
def locate_TSE_E1(nd_mje_params: dict):
    n = len(PIDs)
    TSE_Qs = np.zeros(n); max_pvals = np.zeros(n)
    for i, pid in enumerate(PIDs):
        mjes = TSE_stats[pid].scMJE.values
        ks = [ kstest(mjes, 'norm', args=(nd_mje_params[pid, q])) for q in nd.Qs ]
        d, pval = zip(* ks)
        # print(d, np.min(d))
        # print(pval, np.max(pval))
        max_pvals[i] = np.max(pval)
        n = np.argmax(pval); assert np.argmin(d) == n, "argmin(d) is not equal argmax(pval)"
        TSE_Qs[i] = nd.Qs[n]
    return TSE_Qs, max_pvals
In [16]:
c_TSE_Qs_E1, c_pvals_E1 = locate_TSE_E1(c_mje_params.copy())
x_TSE_Qs_E1, x_pvals_E1 = locate_TSE_E1(x_mje_params.copy())
r_TSE_Qs_E1, r_pvals_E1 = locate_TSE_E1(r_mje_params.copy())
a_TSE_Qs_E1, a_pvals_E1 = locate_TSE_E1(a_mje_params.copy())
y_TSE_Qs_E1, y_pvals_E1 = locate_TSE_E1(y_mje_params.copy())
In [17]:
print(PIDs)
print('c', c_TSE_Qs_E1, np.mean(c_TSE_Qs_E1), np.std(c_TSE_Qs_E1)) 
print('x', x_TSE_Qs_E1, np.mean(x_TSE_Qs_E1), np.std(x_TSE_Qs_E1)) 
print('r', r_TSE_Qs_E1, np.mean(r_TSE_Qs_E1), np.std(r_TSE_Qs_E1))
print('a', a_TSE_Qs_E1, np.mean(a_TSE_Qs_E1), np.std(a_TSE_Qs_E1))
print('y', y_TSE_Qs_E1, np.mean(y_TSE_Qs_E1), np.std(y_TSE_Qs_E1))
['1imq', '1lmb4', '1bf4', '2ptl', '2ci2', '1aps', '1fmk', '1bk2', '1shf', '1ten']
c [0.3 0.3 0.7 0.2 0.2 0.1 0.5 0.4 0.6 0.5] 0.38 0.18330302779823363
x [0.3 0.4 0.7 0.3 0.3 0.1 0.5 0.4 0.6 0.5] 0.41 0.16401219466856723
r [0.4 0.2 0.6 0.4 0.2 0.4 0.4 0.4 0.5 0.4] 0.39 0.11357816691600546
a [0.3 0.3 0.7 0.3 0.2 0.1 0.5 0.4 0.6 0.5] 0.39 0.17578395831246946
y [0.4 0.3 0.4 0.4 0.1 0.4 0.3 0.3 0.4 0.4] 0.33999999999999997 0.09165151389911681
In [18]:
# Mean Absolute Difference: how close are the TSE Qs to ND peak Qs
print('c:', nd.mad(c_peakQs, c_TSE_Qs_E1))
print('x:', nd.mad(x_peakQs, x_TSE_Qs_E1))
print('r:', nd.mad(r_peakQs, r_TSE_Qs_E1))
print('a:', nd.mad(a_peakQs, a_TSE_Qs_E1))
print('y:', nd.mad(y_peakQs, y_TSE_Qs_E1))
c: 0.19999999999999998
x: 0.19999999999999998
r: 0.2
a: 0.19999999999999998
y: 0.21000000000000002
In [19]:
fig, axes = plt.subplots(4, 3, figsize=(12, 16), constrained_layout=True)
plt.suptitle('Best KS-fit ND energy Gaussian for TSE, Paci dataset.')
plt.delaxes(axes[3, 1])
plt.delaxes(axes[3, 2])

for i, pid in enumerate(PIDs):
    r = i//3; c = i%3; ax = axes[r, c]    
    plot_TSE(ax, pid)
    a_q = a_TSE_Qs_E1[i]
    plot_ND(ax, pid, a_q, a_mje_params.copy(), 'a')
    c_q = c_TSE_Qs_E1[i]
    plot_ND(ax, pid, c_q, c_mje_params.copy(), 'c')
    x_q = x_TSE_Qs_E1[i]
    plot_ND(ax, pid, x_q, x_mje_params.copy(), 'x')
    r_q = r_TSE_Qs_E1[i]
    plot_ND(ax, pid, r_q, r_mje_params.copy(), 'r') 
    y_q = y_TSE_Qs_E1[i]
    plot_ND(ax, pid, y_q, y_mje_params.copy(), 'y')     
    ax.legend()    

plt.savefig("TSE_energylocation_KS.png")

Approach 2

In [20]:
pid = '1aps'
mjes = TSE_stats[pid].scMJE.values

E_params = [ a_mje_params[pid, q] for q in nd.Qs ]
llhs = np.zeros(10) # log likelihoods, one for each ND energy Gaussian
for q in np.arange(10):
    m, s = E_params[q]
    dens = [ norm.pdf(x, m, s) for x in mjes ]
    llhs[q] = logsumexp(dens) 
    print(nd.Qs[q], np.round(llhs[q], 4))
    # can ignore priors: uniform, each Q has the same prob.        
    # denom = np.sum(llhs) # can ignore normalizing factor
    # post_probs = llhs/denom # np.sum(post_probs) approx. 1.0
    # posteriors[pid] = post_probs        

q = nd.Qs[np.argmax(llhs)] # MAP q
print(pid, "MAP Q:", q)
0.1 3.3743
0.2 3.371
0.3 3.3705
0.4 3.3732
0.5 3.3732
0.6 3.3673
0.7 3.3673
0.8 3.3673
0.9 3.3673
1.0 3.3673
1aps MAP Q: 0.1
In [21]:
# joint prob/density of the sample of i.i.d. observations
def locate_TSE_E2(nd_mje_params: dict):
    n = len(PIDs); TSE_Qs = np.zeros(n)
    for i, pid in enumerate(PIDs):    
        mjes = TSE_stats[pid].scMJE.values
        E_params = [ nd_mje_params[pid, q] for q in nd.Qs ]
        llhs = np.zeros(10) # log likelihoods, one for each ND energy Gaussian
        for q in np.arange(10):
            m, s = E_params[q]
            dens = [ norm.pdf(x, m, s) for x in mjes ]
            llhs[q] = logsumexp(dens) 
        q = nd.Qs[np.argmax(llhs)] # maximum likelihood
        TSE_Qs[i] = q
    return TSE_Qs
In [22]:
c_TSE_Qs_E2 = locate_TSE_E2(c_mje_params.copy())
x_TSE_Qs_E2 = locate_TSE_E2(x_mje_params.copy())
r_TSE_Qs_E2 = locate_TSE_E2(r_mje_params.copy())
a_TSE_Qs_E2 = locate_TSE_E2(a_mje_params.copy())
y_TSE_Qs_E2 = locate_TSE_E2(y_mje_params.copy())

print(PIDs)
print('c', c_TSE_Qs_E2, np.mean(c_TSE_Qs_E2), np.std(c_TSE_Qs_E2)) 
print('x', x_TSE_Qs_E2, np.mean(x_TSE_Qs_E2), np.std(x_TSE_Qs_E2))
print('r', r_TSE_Qs_E2, np.mean(r_TSE_Qs_E2), np.std(r_TSE_Qs_E2)) 
print('a', a_TSE_Qs_E2, np.mean(a_TSE_Qs_E2), np.std(a_TSE_Qs_E2)) 
print('y', y_TSE_Qs_E2, np.mean(y_TSE_Qs_E2), np.std(y_TSE_Qs_E2)) 
['1imq', '1lmb4', '1bf4', '2ptl', '2ci2', '1aps', '1fmk', '1bk2', '1shf', '1ten']
c [0.1 0.3 0.7 0.2 0.1 0.1 0.5 0.4 0.6 0.5] 0.35 0.21095023109728986
x [0.1 0.4 0.7 0.2 0.3 0.1 0.5 0.5 0.6 0.5] 0.39 0.19723082923316018
r [0.4 0.2 0.6 0.4 0.1 0.4 0.4 0.4 0.5 0.4] 0.38 0.132664991614216
a [0.1 0.3 0.7 0.1 0.1 0.1 0.5 0.4 0.6 0.5] 0.33999999999999997 0.21999999999999997
y [0.4 0.3 0.5 0.4 0.1 0.4 0.3 0.3 0.4 0.4] 0.35 0.10246950765959598
In [23]:
# Mean Absolute Difference: how close are the estimated Qs for PTS nets to ND peak Qs
print('c:', nd.mad(c_peakQs, c_TSE_Qs_E2))
print('x:', nd.mad(x_peakQs, x_TSE_Qs_E2))
print('r:', nd.mad(r_peakQs, r_TSE_Qs_E2))
print('a:', nd.mad(a_peakQs, a_TSE_Qs_E2))
print('y:', nd.mad(y_peakQs, y_TSE_Qs_E2))
c: 0.21000000000000002
x: 0.21999999999999997
r: 0.21000000000000002
a: 0.22999999999999998
y: 0.22000000000000003
In [24]:
fig, axes = plt.subplots(3, 4, figsize=(16, 12), constrained_layout=True)
plt.suptitle('Most likely ND energy Gaussian for TSE, Paci dataset.')
plt.delaxes(axes[2, 2])
plt.delaxes(axes[2, 3])

for i, pid in enumerate(PIDs):
    r = i//4; c = i%4; ax = axes[r, c]    
    plot_TSE(ax, pid)
    c_q = c_TSE_Qs_E2[i]
    plot_ND(ax, pid, c_q, c_mje_params.copy(), 'c')
    x_q = x_TSE_Qs_E2[i]
    plot_ND(ax, pid, x_q, x_mje_params.copy(), 'x')
    r_q = r_TSE_Qs_E2[i]
    plot_ND(ax, pid, r_q, r_mje_params.copy(), 'r') 
    a_q = a_TSE_Qs_E2[i]
    plot_ND(ax, pid, a_q, a_mje_params.copy(), 'a') 
    y_q = y_TSE_Qs_E2[i]
    plot_ND(ax, pid, y_q, y_mje_params.copy(), 'y')     
    ax.legend()    

plt.savefig("TSE_energylocation_MAP.png")

Compare TSE Qs selected by Approach 1 and 2.

In [25]:
# plot ND_peak_Qs, TSE_Qs_E1, TSE_Qs_E2
fig, axes = plt.subplots(2, 3, figsize=(12, 8), constrained_layout=True, sharey=True)
fig.suptitle("Location of TSE structures with ND energy.")
plt.delaxes(axes[1][2])

ax = axes[0][0]
ax.scatter(PIDs, a_peakQs, marker='^', s=80, color='tab:red', label='peak')
ax.scatter(PIDs, a_TSE_Qs_E1, marker='s', s=80, edgecolors='black', facecolors='none', 
           label='KS')
ax.scatter(PIDs, a_TSE_Qs_E2, marker='x', s=80, color='red', label='MAP')
ax.set_title("'a' ND variant")
ax.set_ylabel('Q', fontsize=14)

ax = axes[0][1]
ax.scatter(PIDs, c_peakQs, marker='^', s=80, color='tab:blue', label='peak')
ax.scatter(PIDs, c_TSE_Qs_E1, marker='s', s=80, edgecolors='black', facecolors='none', 
           label='KS')
ax.scatter(PIDs, c_TSE_Qs_E2, marker='x', s=80, color='red', label='MAP')
ax.set_title("'c' ND variant")

ax = axes[0][2]
ax.scatter(PIDs, x_peakQs, marker='^', s=80, color='tab:orange', label='peak')
ax.scatter(PIDs, x_TSE_Qs_E1, marker='s', s=80, edgecolors='black', facecolors='none', 
           label='KS')
ax.scatter(PIDs, x_TSE_Qs_E2, marker='x', s=80, color='red', label='MAP')
ax.set_title("'x' ND variant")

ax = axes[1][0]
ax.scatter(PIDs, r_peakQs, marker='^', s=80, color='tab:green', label='peak')
ax.scatter(PIDs, r_TSE_Qs_E1, marker='s', s=80, edgecolors='black', facecolors='none', 
           label='KS')
ax.scatter(PIDs, r_TSE_Qs_E2, marker='x', s=80, color='red', label='MAP')
ax.set_title("'r' ND variant")
ax.set_ylabel('Q', fontsize=14)

ax = axes[1][1]
ax.scatter(PIDs, y_peakQs, marker='^', s=80, color='tab:purple', label='peak')
ax.scatter(PIDs, y_TSE_Qs_E1, marker='s', s=80, edgecolors='black', facecolors='none', 
           label='KS')
ax.scatter(PIDs, y_TSE_Qs_E2, marker='x', s=80, color='red', label='MAP')
ax.set_title("'y' ND variant")

for ax in axes.flatten()[:-1]:
    ax.set_xticks(PIDs); ax.set_xticklabels(PIDs, rotation=90)
    ax.legend(loc="upper center"); ax.grid()
  • Approaches 1 and 2 pick the same $Q$ for most TSEs.
In [26]:
# boxplot ND_peak_Qs, TSE_Qs_E1, TSE_Qs_E2
fig, axes = plt.subplots(1, 3, figsize=(12, 4), constrained_layout=True)
fig.suptitle("Location of TSE structures with ND energy.")

ax = axes[0]
ax.boxplot([a_peakQs, c_peakQs, x_peakQs, r_peakQs, y_peakQs], showmeans=True)
ax.set_title("ND Peak")
ax.set_xlabel("ND variant"); ax.set_xticklabels(['a', 'c', 'x', 'r', 'y'], fontsize=14)
ax.set_ylabel('Q', fontsize=14); ax.set_yticks(nd.Qs); ax.grid()

ax = axes[1]
ax.boxplot([a_TSE_Qs_E1, c_TSE_Qs_E1, x_TSE_Qs_E1, r_TSE_Qs_E1, y_TSE_Qs_E1], showmeans=True)
ax.set_title("Approach 1: KS")

ax.set_xlabel("ND variant"); ax.set_xticklabels(['a', 'c', 'x', 'r', 'y'], fontsize=14)
ax.set_yticks(nd.Qs); ax.grid()

ax = axes[2]
ax.boxplot([a_TSE_Qs_E2, c_TSE_Qs_E2, x_TSE_Qs_E2, r_TSE_Qs_E2, y_TSE_Qs_E2], showmeans=True)
ax.set_title("Approach 2: MAP")

ax.set_xlabel("ND variant"); ax.set_xticklabels(['a', 'c', 'x', 'r', 'y'], fontsize=14)
ax.set_yticks(nd.Qs); ax.grid()

plt.savefig("TSE_energylocation.png")

End of notebook

SKLC September 2020