Calculated phi-values from ND PRNs.

  • ND PRNs generated with EDS variant = 'a'.
  • ND node statistics normed to [0.0, 1.0] by max of a sequence.
In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from scipy.stats import ttest_rel # samples have same variance, paired 
from scipy.stats import ttest_ind_from_stats as tts # different variance, paired?
# check with R, use CI
In [2]:
Qs =[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
dirname = "Paci_ND3"
PIDs = ["1imq", "1lmb4", "1bf4", "2ptl", "2ci2", "1aps", "1fmk", "1bk2", "1shf", "1ten"]
In [3]:
EDVs = ['a'] # EDVs = ['a', 'b', 's', 'l', 'v']
NDVs = ['c', 'x', 'r', 'a', 'y']
ndv_colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple']
In [4]:
# computed previously
c_peak_Qs = np.array([0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.3])
x_peak_Qs = np.array([0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2])
r_peak_Qs = np.array([0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.2, 0.2])
a_peak_Qs = np.array([0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2])
y_peak_Qs = np.array([0.2, 0.1, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1])

# cxray
peak_Qs = np.column_stack([c_peak_Qs, x_peak_Qs, r_peak_Qs, a_peak_Qs, y_peak_Qs])
peak_Qs
Out[4]:
array([[0.2, 0.2, 0.2, 0.2, 0.2],
       [0.3, 0.3, 0.2, 0.3, 0.1],
       [0.3, 0.3, 0.2, 0.3, 0.2],
       [0.3, 0.3, 0.2, 0.3, 0.2],
       [0.3, 0.3, 0.2, 0.3, 0.2],
       [0.3, 0.3, 0.2, 0.3, 0.2],
       [0.2, 0.2, 0.2, 0.2, 0.1],
       [0.2, 0.2, 0.1, 0.2, 0.1],
       [0.2, 0.2, 0.2, 0.2, 0.1],
       [0.3, 0.2, 0.2, 0.2, 0.1]])

Mean Absolute Difference/Errror between experimental $\phi$ values and calculated $\phi$ values.

In [5]:
def mad_from_phi(ndv, edv):        
    dirn = os.path.join(dirname, "edv_a") # ND data
    ofn = os.path.join(dirname, ndv + "_phi_mad.out")
    fout = open(ofn, "w"); fout.close() # truncates existing file
    fout = open(ofn, "a") # appends to file 
    fout.write("pid Q mad_deg_avg mad_cent_avg\n")
    # fout.write("pid Q mad_deg_avg mad_deg_avg2 mad_cent_avg mad_cent_avg2\n")
    for pid in PIDs:
        fn0 = os.path.join(dirname, pid + "_nodestats.out") # NS
        df0 = pd.read_table(fn0, sep="\s+") # nid exp_phi deg0 cent0
        N = len(df0)
        fn1 = os.path.join(dirn, pid + "_" + ndv + "_nodestats.out") # ND
        df1 = pd.read_table(fn1, sep="\s+") # Q nid deg_avg cent_avg
        for q in Qs:
            dq = df1.query("Q == @q")                        
            m = dq.deg_avg.max() # normalize then select phi residues
            dq = dq.assign(deg_avg_norm = dq.deg_avg.values/m)
            # prop. of native degree
            # dq = dq.assign(deg_avg_norm2 = dq.deg_avg.values/df0.deg0.values) 
            m = dq.cent_avg.max()
            dq = dq.assign(cent_avg_norm = dq.cent_avg.values/m)
            # fraction of all paths
            # dq = dq.assign(cent_avg_norm2 = dq.cent_avg.values/(N*N-N))
            dq = dq.assign(exp_phi = df0.exp_phi.values)
            de = dq.query("exp_phi > -1") 
            a = np.round(np.mean(np.abs(de.deg_avg_norm.values - de.exp_phi.values)), 3)
            b = np.round(np.mean(np.abs(de.cent_avg_norm.values - de.exp_phi.values)), 3)
            fout.write("{0} {1} {2} {3}\n".format(pid, q, a, b))
            # a2 = np.round(np.mean(np.abs(de.deg_avg_norm2.values - de.exp_phi.values)), 3)
            # b2 = np.round(np.mean(np.abs(de.cent_avg_norm2.values - de.exp_phi.values)), 3)
            # fout.write("{0} {1} {2} {3} {4} {5}\n".format(pid, q, a, a2, b, b2))
    fout.close() 
    df = pd.read_table(ofn, sep=' ')
    # in general a2 > a
    # count = np.sum(df.mad_deg_avg.values > df.mad_deg_avg2.values) # 21, 22, 19, 21, 14 
    # in general b2 > b
    # count = np.sum(df.mad_cent_avg.values > df.mad_cent_avg2.values) # 16, 16, 56, 20, 64
    # in general a > b, desire to minimize mad, use b
    count = np.sum(df.mad_deg_avg.values > df.mad_cent_avg.values) # 90, 91, 73, 79, 62    
    # print(ndv, edv, count, len(df))
    return count
In [6]:
# this cell takes some time to run, only need to do this once to create ndv_phi_mad.out files
counts = [] # deg > cent mad across all Qs for each pid
for edv in EDVs:
    row = []
    for ndv in NDVs:
        count = mad_from_phi(ndv, edv)
        row.append(count)
    counts.append(row)

counts 
Out[6]:
[[90, 91, 73, 79, 62]]
In [7]:
# formats counts and do summation
df = pd.DataFrame(counts).assign(EDV=EDVs).set_index("EDV")
df.columns = NDVs
s = df.sum(axis=0) # col sum
s.name = "sum"
df = df.append(s)
df = df.assign(sum = df.sum(axis=1)) # row sum
df
Out[7]:
c x r a y sum
EDV
a 90 91 73 79 62 395
sum 90 91 73 79 62 395
  • Max per cell in the table is 100: 10 pids with 10 Qs each = 100 trials per NDV,EDV pair. Smaller MAD is better.
  • At least 60% of the time, calculating phi-values with (normalized) node degree produces larger MAD than doing so with (normalised) node centrality.
    • ND phi-values calculated with node centrality correspond better with experimental phi-values than ND phi-values calculated with node degree.
    • Count results suggest to calculate ND phi-values from node centrality.
  • Another consideration beside MAD from exp. $\phi$: $Q$ where MAD is minimum is inside ND-TS region?
    • $\phi$ values are a property of TS structures.
In [8]:
# load the ndv_phi_mad.out files
edv = 'a' 
cxray_dfs = []
for ndv in NDVs:
    fn = os.path.join(dirname, ndv + "_phi_mad.out")
    df = pd.read_table(fn, sep=' ')
    cxray_dfs.append(df)
In [9]:
import random
random.seed()
# mean_95CI with resampling/bootstrap
def bootstrap(vals):
    means = np.zeros(10000); stdevs = np.zeros(10000)
    for i in np.arange(10000):
        resample = random.choices(vals, k=10) # with replacement
        means[i], stdevs[i] = np.mean(resample), np.std(resample, ddof=1)
    a, z = np.percentile(means, 2.5), np.percentile(means, 97.5) 
    moe = (z-a)/2.0
    return moe, a, z

# calc the 95%CI of the mean, with sample stdev, n >= 30
def mean_95CI(sample_mean, sample_stdev, sample_size):
    moe = (1.96 * sample_stdev)/np.sqrt(sample_size) # margin of error
    return moe, sample_mean - moe, sample_mean + moe

def summary_stats(vals):
    mu = np.mean(vals); snu = np.std(vals, ddof=1)
    gmoe, ga, gz = mean_95CI(mu, snu, 10) # Gaussian estimated
    bmoe, ba, bz = bootstrap(vals) # bootstrap
    return np.round([mu, snu, np.ptp(vals), gmoe, ga, gz, bmoe, ba, bz], 4)

Node centrality

In [10]:
# compare mads of ND variants
fig, axes = plt.subplots(4, 3, figsize=(12, 16), constrained_layout=True, sharey=True)
plt.suptitle("Mean absolute difference between phi-values from experiments and \n \
phi-values calculated with normalized node centrality from ND nets, EDS variant " + edv)
plt.delaxes(axes[3][1])
plt.delaxes(axes[3][2])

min_mads, min_Qs = [], []
for i, pid in enumerate(PIDs):
    r = i//3; c = i%3; ax = axes[r][c]    
    res = []
    for i, ndv in enumerate(NDVs):
        dp = cxray_dfs[i].query("pid == @pid")
        mad = dp.mad_cent_avg.values 
        # mad = dp.mad_cent_avg2.values 
        ax.plot(Qs, mad, 'o--', label=ndv)    
        res.append(mad)            
    ax.set_title(pid)
    ax.set_xlabel('Q'); ax.set_xticks(np.arange(0, 1.1, 0.1))
    ax.legend(); ax.grid()        
    min_mad = np.round(np.min(res, axis=1), 3) # row-wise per ND variant, over Qs for a pid
    min_q = (np.round(np.argmin(res, axis=1), 3) + 1)/10
    min_mads.append(min_mad)
    min_Qs.append(min_q)
    print(pid, min_mad, min_q)

# ofn = os.path.join(dirname, "ND_centphi_" + edv + ".png")
# print(ofn)
# plt.savefig(ofn)
1imq [0.394 0.343 0.466 0.274 0.417] [0.1 0.1 0.1 0.1 0.1]
1lmb4 [0.286 0.237 0.296 0.372 0.306] [0.2 0.3 0.1 1.  1. ]
1bf4 [0.36  0.333 0.341 0.301 0.368] [0.3 0.1 0.1 0.1 0.1]
2ptl [0.327 0.329 0.321 0.287 0.313] [0.4 0.1 0.2 0.4 1. ]
2ci2 [0.26  0.314 0.296 0.251 0.291] [0.1 0.1 0.1 0.1 0.7]
1aps [0.26  0.312 0.346 0.242 0.332] [0.6 0.9 1.  0.5 0.1]
1fmk [0.304 0.313 0.416 0.302 0.41 ] [0.2 0.3 0.1 0.1 0.1]
1bk2 [0.23  0.258 0.339 0.204 0.289] [0.5 0.7 0.1 0.4 0.1]
1shf [0.391 0.375 0.457 0.399 0.45 ] [0.1 0.2 1.  0.1 1. ]
1ten [0.252 0.3   0.341 0.313 0.328] [0.1 0.1 0.1 0.1 0.1]
In [11]:
# can skip if file already created
cent_minmad_df = pd.DataFrame(min_mads)
cent_minmad_df.columns = NDVs
cent_minmad_df = cent_minmad_df.assign(pid=PIDs).set_index('pid')
ofn = os.path.join(dirname, "ND_centphi_minmad_" + edv + ".out")
cent_minmad_df.to_csv(ofn, sep=' ')

cent_MADstats_df = pd.DataFrame({ "MAD_stats": ["sample_avg", "sample_stdev", "range", \
                                  "gMOE", "95meanCI_gleft", "95meanCI_gright",
                                  "bMOE", "95meanCI_bleft", "95meanCI_bright",],
                "c": summary_stats(cent_minmad_df['c'].values), 
                "x": summary_stats(cent_minmad_df['x'].values),
                "r": summary_stats(cent_minmad_df['r'].values), 
                "a": summary_stats(cent_minmad_df['a'].values),
                "y": summary_stats(cent_minmad_df['y'].values) })
cent_MADstats_df = cent_MADstats_df.set_index("MAD_stats")

# ofn = os.path.join(dirname, "ND_cent_MADstats_" + edv + ".out")
# cent_MADstats_df.to_csv(ofn, sep=' ')
In [12]:
# can skip if file already created
cent_minQ_df = pd.DataFrame(min_Qs)
cent_minQ_df.columns = NDVs
cent_minQ_df = cent_minQ_df.assign(pid=PIDs).set_index('pid')
ofn = os.path.join(dirname, "ND_centphi_minQ_" + edv + ".out")
cent_minQ_df.to_csv(ofn, sep=' ')

cent_minQstats_df = pd.DataFrame({ "minQ_stats": ["sample_avg", "sample_stdev", "range", \
                                "gMOE", "95meanCI_gleft", "95meanCI_gright",
                                "bMOE", "95meanCI_bleft", "95meanCI_bright"],
                "c": summary_stats(cent_minQ_df['c'].values), 
                "x": summary_stats(cent_minQ_df['x'].values),
                "r": summary_stats(cent_minQ_df['r'].values), 
                "a": summary_stats(cent_minQ_df['a'].values),
                "y": summary_stats(cent_minQ_df['y'].values)})
cent_minQstats_df = cent_minQstats_df.set_index("minQ_stats")

# ofn = os.path.join(dirname, "ND_cent_minQstats_" + edv + ".out")
# cent_minQstats_df.to_csv(ofn, sep=' ')

Which ND variant produces the best calculated $\phi$ values with node centrality?

  • Two considerations:
    • min MAD from exp_phi; and
    • $Q$ where MAD from exp_phi is min is within ND-TS region.
In [13]:
# edv = 'a'
# fn = os.path.join(dirname, "ND_centphi_minmad_" + edv + ".out")
# cent_minmad_df = pd.read_table(fn, sep=' ').set_index('pid')
print(cent_minmad_df.describe()) # sample std
cent_minmad_df
               c          x          r          a          y
count  10.000000  10.000000  10.000000  10.000000  10.000000
mean    0.306400   0.311400   0.361900   0.294500   0.350400
std     0.059304   0.039925   0.062198   0.058388   0.057471
min     0.230000   0.237000   0.296000   0.204000   0.289000
25%     0.260000   0.303000   0.325500   0.256750   0.307750
50%     0.295000   0.313500   0.341000   0.294000   0.330000
75%     0.351750   0.332000   0.398500   0.310250   0.399500
max     0.394000   0.375000   0.466000   0.399000   0.450000
Out[13]:
c x r a y
pid
1imq 0.394 0.343 0.466 0.274 0.417
1lmb4 0.286 0.237 0.296 0.372 0.306
1bf4 0.360 0.333 0.341 0.301 0.368
2ptl 0.327 0.329 0.321 0.287 0.313
2ci2 0.260 0.314 0.296 0.251 0.291
1aps 0.260 0.312 0.346 0.242 0.332
1fmk 0.304 0.313 0.416 0.302 0.410
1bk2 0.230 0.258 0.339 0.204 0.289
1shf 0.391 0.375 0.457 0.399 0.450
1ten 0.252 0.300 0.341 0.313 0.328
In [14]:
print(np.mean(cent_minmad_df['x'].values))
print(np.std(cent_minmad_df['x'].values))
print(np.std(cent_minmad_df['x'].values, ddof=1))
0.3114
0.03787664187860376
0.03992548615163558
In [15]:
# fn = os.path.join(dirname, "ND_centphi_minQ_" + edv + ".out")
# cent_minQ_df = pd.read_table(fn, sep=' ').set_index('pid')
print(cent_minQ_df.describe())
cent_minQ_df
               c          x        r          a          y
count  10.000000  10.000000  10.0000  10.000000  10.000000
mean    0.260000   0.290000   0.2900   0.290000   0.430000
std     0.183787   0.284605   0.3755   0.296086   0.434741
min     0.100000   0.100000   0.1000   0.100000   0.100000
25%     0.100000   0.100000   0.1000   0.100000   0.100000
50%     0.200000   0.150000   0.1000   0.100000   0.100000
75%     0.375000   0.300000   0.1750   0.400000   0.925000
max     0.600000   0.900000   1.0000   1.000000   1.000000
Out[15]:
c x r a y
pid
1imq 0.1 0.1 0.1 0.1 0.1
1lmb4 0.2 0.3 0.1 1.0 1.0
1bf4 0.3 0.1 0.1 0.1 0.1
2ptl 0.4 0.1 0.2 0.4 1.0
2ci2 0.1 0.1 0.1 0.1 0.7
1aps 0.6 0.9 1.0 0.5 0.1
1fmk 0.2 0.3 0.1 0.1 0.1
1bk2 0.5 0.7 0.1 0.4 0.1
1shf 0.1 0.2 1.0 0.1 1.0
1ten 0.1 0.1 0.1 0.1 0.1
In [16]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4), constrained_layout=True)
plt.suptitle("Calculating $\phi$ values with normalized node centrality from ND nets.", 
             fontsize=14)

ax0 = axes[0]
cent_minmad_df.boxplot(ax = ax0, showmeans=True)
ax0.set_title('Minimum MAD from exp_phi')
ax0.set_xticklabels(NDVs, fontsize=14); ax0.set_xlabel('ND variant')
# ax0.set_yticks(np.arange(0.2, 0.55, 0.05))

ax1 = axes[1]
cent_minQ_df.boxplot(ax = ax1, showmeans=True)
ax1.set_title('$Q$ with min MAD');
ax1.set_xticklabels(NDVs, fontsize=14); ax1.set_xlabel('ND variant')
_ = ax1.set_yticks(np.arange(0, 1.1, 0.1))
In [17]:
# MAD between min_Qs (Q where MAD from exp_phi is minimum) and peak Qs for cxray NDVs
np.mean(np.abs(cent_minQ_df.values - peak_Qs), axis=0)
Out[17]:
array([0.14, 0.2 , 0.22, 0.2 , 0.34])
In [18]:
print("MAD between experimental and calculated phi-values")
cent_MADstats_df
MAD between experimental and calculated phi-values
Out[18]:
c x r a y
MAD_stats
sample_avg 0.3064 0.3114 0.3619 0.2945 0.3504
sample_stdev 0.0593 0.0399 0.0622 0.0584 0.0575
range 0.1640 0.1380 0.1700 0.1950 0.1610
gMOE 0.0368 0.0247 0.0386 0.0362 0.0356
95meanCI_gleft 0.2696 0.2867 0.3233 0.2583 0.3148
95meanCI_gright 0.3432 0.3361 0.4005 0.3307 0.3860
bMOE 0.0345 0.0234 0.0363 0.0342 0.0339
95meanCI_bleft 0.2726 0.2871 0.3282 0.2612 0.3184
95meanCI_bright 0.3416 0.3339 0.4007 0.3295 0.3862
In [19]:
print("Q with minimum MAD between experimental and calculated phi-values")
cent_minQstats_df
Q with minimum MAD between experimental and calculated phi-values
Out[19]:
c x r a y
minQ_stats
sample_avg 0.2600 0.2900 0.2900 0.2900 0.4300
sample_stdev 0.1838 0.2846 0.3755 0.2961 0.4347
range 0.5000 0.8000 0.9000 0.9000 0.9000
gMOE 0.1139 0.1764 0.2327 0.1835 0.2695
95meanCI_gleft 0.1461 0.1136 0.0573 0.1065 0.1605
95meanCI_gright 0.3739 0.4664 0.5227 0.4735 0.6995
bMOE 0.1050 0.1700 0.2250 0.1750 0.2550
95meanCI_bleft 0.1600 0.1400 0.1000 0.1300 0.1900
95meanCI_bright 0.3700 0.4800 0.5500 0.4800 0.7000
In [20]:
methods = cent_MADstats_df.columns.values; n = len(methods)
plt_colors = plt.get_cmap("tab20")
gcolors = plt_colors(list(range(1, n*2, 2))) # odd
bcolors = plt_colors(list(range(0, n*2, 2))) # even

fig, axes = plt.subplots(2, 1, figsize=(6, 8), constrained_layout=True)
plt.suptitle("Calculating $\phi$ values with normalized ND node centrality.", fontsize=14)

ax = axes[0]
mu = cent_MADstats_df.loc['sample_avg'].values
# both ways produce similar CIs
ga = cent_MADstats_df.loc['95meanCI_gleft'].values
gz = cent_MADstats_df.loc['95meanCI_gright'].values
# gmoe = cent_MADstats_df.loc['gMOE'].values
ba = cent_MADstats_df.loc['95meanCI_bleft'].values
bz = cent_MADstats_df.loc['95meanCI_bright'].values
# bmoe = cent_MADstats_df.loc['bMOE'].values
for i, p in enumerate(zip(ga, gz, ba, bz)):
    ax.plot([i, i], [p[0], p[1]], marker='o', color=gcolors[i])
    ax.plot([i+0.2, i+0.2], [p[2], p[3]], marker='o', color=bcolors[i], label=str(p[3])) 
    ax.scatter(i+0.2, mu[i], marker='^')
ax.legend(title="97.5 percentile", loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.set_title("average MAD from exp_phi", fontsize=14)
ax.set_ylabel("95% confidence interval", fontsize=14)
ax.set_xticks(list(range(0, n, 1))); ax.set_xticklabels(methods, fontsize=14)
# ax.set_xlabel("Method to calculate phi-values", fontsize=12)
ax.grid()

ax = axes[1] # methods = cent_minQstats_df.columns.values
mu = cent_minQstats_df.loc['sample_avg'].values
ga = cent_minQstats_df.loc['95meanCI_gleft'].values
gz = cent_minQstats_df.loc['95meanCI_gright'].values
# gmoe = cent_minQstats_df.loc['gMOE'].values
ba = cent_minQstats_df.loc['95meanCI_bleft'].values
bz = cent_minQstats_df.loc['95meanCI_bright'].values
# bmoe = cent_minQstats_df.loc['bMOE'].values
for i, p in enumerate(zip(ga, gz, ba, bz)):
    ax.plot([i, i], [p[0], p[1]], marker='o', color=gcolors[i])
    ax.plot([i+0.2, i+0.2], [p[2], p[3]], marker='o', color=bcolors[i], label=str(p[3])) 
    ax.scatter(i+0.2, mu[i], marker='^')
ax.legend(title="97.5 percentile", loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.set_title("average minQ", fontsize=14)
ax.set_ylabel("95% confidence interval", fontsize=14)
ax.set_xticks(list(range(0, n, 1))); ax.set_xticklabels(methods, fontsize=14)
# ax.set_xlabel("ND node centrality", fontsize=14)
ax.grid()
  • A Confidence Interval (CI) as above gives a range estimate for the average MAD (min$Q$) of a method; the average MAD (min$Q$) will lie somewhere in the interval 95% of the time the method is used.
  • Prefer a method that yields smaller largest average MAD, and largest average min$Q$ that is within the ND-TS region ($Q < 0.5$). These best worst case criteria are decided with the 97.5 percentile estimate of the sample statistics.
  • There is no significant difference between the average MADs of the biased ND variants ('a', 'c', 'x'); but their average MADs are significantly smaller than that of 'r'.
  • ND variant 'a' produces the best calculated $\phi$ values with node centrality for the dataset:

    • It has the smallest largest average MAD from exp_phi.
    • Its largest average min$Q$ is within the ND-TS region.
    • Its MAD between min$Q$s and peak $Q$s is second smallest (0.2).
  • ND variant 'c' yields the smallest MAD between min$Q$s and peak $Q$s (0.14).

    • This is meaningful since peak $Q$ ND energy values are used to estimate folding rates; the idea is that the smaller this MAD is, the better the agreement these microscopic results have with macroscopic level results.
In [21]:
ndc_cent = cent_minmad_df['c'].values
ndx_cent = cent_minmad_df['x'].values
nda_cent = cent_minmad_df['a'].values
ndr_cent = cent_minmad_df['r'].values
In [22]:
print("c a")
print(ttest_rel(ndc_cent, nda_cent))
print(tts(ndc_cent.mean(), ndc_cent.std(ddof=1), 10, 
          nda_cent.mean(), nda_cent.std(ddof=1), 10, equal_var=False))
print("\nx a")
print(ttest_rel(ndx_cent, nda_cent))
print(tts(ndx_cent.mean(), ndx_cent.std(ddof=1), 10, 
          nda_cent.mean(), nda_cent.std(ddof=1), 10, equal_var=False))
print("\nx c")
print(ttest_rel(ndx_cent, ndc_cent))
print(tts(ndx_cent.mean(), ndx_cent.std(ddof=1), 10, 
          ndc_cent.mean(), ndc_cent.std(ddof=1), 10, equal_var=False))
print("\nr c")
print(ttest_rel(ndr_cent, ndc_cent))
print(tts(ndr_cent.mean(), ndr_cent.std(ddof=1), 10, 
          ndc_cent.mean(), ndc_cent.std(ddof=1), 10, equal_var=False))
print("\nr x")
print(ttest_rel(ndr_cent, ndx_cent))
print(tts(ndr_cent.mean(), ndr_cent.std(ddof=1), 10, 
          ndx_cent.mean(), ndx_cent.std(ddof=1), 10, equal_var=False))
print("\nr a")
print(ttest_rel(ndr_cent, nda_cent))
print(tts(ndr_cent.mean(), ndr_cent.std(ddof=1), 10, 
          nda_cent.mean(), nda_cent.std(ddof=1), 10, equal_var=False))
c a
Ttest_relResult(statistic=0.648224688230753, pvalue=0.533026887103344)
Ttest_indResult(statistic=0.45217087209059836, pvalue=0.6565489395088739)

x a
Ttest_relResult(statistic=0.8501053358302337, pvalue=0.4173074186919915)
Ttest_indResult(statistic=0.7555484003492737, pvalue=0.4609626942864189)

x c
Ttest_relResult(statistic=0.3935124271000536, pvalue=0.7031007776326037)
Ttest_indResult(statistic=0.22116581902271235, pvalue=0.8278018714046645)

r c
Ttest_relResult(statistic=3.697946155822872, pvalue=0.0049362382105273125)
Ttest_indResult(statistic=2.0422254558170283, pvalue=0.05609447686866125)

r x
Ttest_relResult(statistic=3.363305041598036, pvalue=0.008345232358759343)
Ttest_indResult(statistic=2.1606882967954486, pvalue=0.046923631978165785)

r a
Ttest_relResult(statistic=2.9130971259841085, pvalue=0.017224208877324253)
Ttest_indResult(statistic=2.49840298673565, pvalue=0.022425410752482097)

Plot the best results for node centrality with ND PRNs.

In [23]:
minMad_c = cent_minmad_df['c'].values
minMad_a = cent_minmad_df['a'].values
print('minMad c', minMad_c)
print('minMad a', minMad_a)
minQ_c = cent_minQ_df['c'].values
minQ_a = cent_minQ_df['a'].values
print('minQ c', minQ_c)
print('minQ a', minQ_a)
minMad c [0.394 0.286 0.36  0.327 0.26  0.26  0.304 0.23  0.391 0.252]
minMad a [0.274 0.372 0.301 0.287 0.251 0.242 0.302 0.204 0.399 0.313]
minQ c [0.1 0.2 0.3 0.4 0.1 0.6 0.2 0.5 0.1 0.1]
minQ a [0.1 1.  0.1 0.4 0.1 0.5 0.1 0.4 0.1 0.1]
In [24]:
fig, axes = plt.subplots(4, 3, figsize=(12, 18), constrained_layout=True)
plt.suptitle("Calculated $\phi$ values from normalized node centrality, \
ND variants 'c' and 'a' with EDS variant 'a'.", fontsize=14)
plt.delaxes(axes[3, 1])
plt.delaxes(axes[3, 2])
dirn = os.path.join(dirname, "edv_a") # ND data

for i, pid in enumerate(PIDs):    
    r = i//3; c = i%3; ax = axes[r][c]
    fn0 = os.path.join(dirname, pid + "_nodestats.out")
    df0 = pd.read_table(fn0, sep="\s+") # nid exp_phi deg0 cent0        
    fn1 = os.path.join(dirn, pid + "_c_nodestats.out")
    df1 = pd.read_table(fn1, sep="\s+")    
    q_c = minQ_c[i]
    dq = df1.query("Q == @q_c")
    dq = dq.assign(cent_avg_norm = dq.cent_avg.values/dq.cent_avg.max())    
    dq = dq.assign(exp_phi = df0.exp_phi.values)
    de = dq.query("exp_phi > -1")    
    mad_c = np.round(np.mean(np.abs(de.cent_avg_norm.values - de.exp_phi.values)), 2)
    # print(pid, q, mad_c, spearmanr(de.cent_avg_norm.values, de.exp_phi.values))
    ax.plot(dq.nid.values+1, dq.cent_avg_norm.values, color="tab:blue", alpha=0.5, 
            label="'c' cent " + str(mad_c))
        
    fn1 = os.path.join(dirn, pid + "_a_nodestats.out")
    df1 = pd.read_table(fn1, sep="\s+")
    q_a = minQ_a[i]
    dq = df1.query("Q == @q_a")
    dq = dq.assign(cent_avg_norm = dq.cent_avg.values/dq.cent_avg.max())    
    dq = dq.assign(exp_phi = df0.exp_phi.values)
    de = dq.query("exp_phi > -1")    
    mad_a = np.round(np.mean(np.abs(de.cent_avg_norm.values - de.exp_phi.values)), 2)
    # print(pid, q, mad_a, spearmanr(de.cent_avg_norm.values, de.exp_phi.values))    
    ax.plot(dq.nid.values+1, dq.cent_avg_norm.values, color="tab:red", alpha=0.5, 
            label="'a' cent " + str(mad_a))
    
    ax.plot(de.nid.values+1, de.exp_phi.values, "o--g", alpha=0.5, label='exp_phi')
    ax.set_title(pid + " " + str(q_c) + " " + str(q_a), loc='left')
    ax.set_xlabel('Residue'); ax.set_yticks(np.arange(0, 1.1, 0.1))
    ax.legend(loc="lower left", bbox_to_anchor=(0.5, 1.0)) 

# ofn = os.path.join(dirname, "ND_centphi_ca_" + edv + ".png")
# plt.savefig(ofn)
  • Figure above: Calculated phi-values as normalized node centrality (cent) of 'c' and 'a' ND generated PRNs at the $Q$ where the mean absolute difference (MAD) between experimental phi-values and cent is minimum. The numbers in the title of each subplot is this $Q$ for 'c' and 'a' respectively. The numbers in the legend is the MAD associated with these $Q$ values. ND node centrality for a $Q$ is the average over 100 independent ND generated PRNs at the $Q$.

Node degree

In [25]:
# compare mads of ND variants
fig, axes = plt.subplots(4, 3, figsize=(12, 16), constrained_layout=True, sharey=True)
plt.suptitle("Mean absolute difference between phi-values from experiments and \n \
phi-values calculated with normalized node degree from ND nets, EDS variant " + edv)
plt.delaxes(axes[3][1])
plt.delaxes(axes[3][2])

min_mads, min_Qs = [], []
for i, pid in enumerate(PIDs):
    r = i//3; c = i%3; ax = axes[r][c]
    res = []
    for i, ndv in enumerate(NDVs):
        dp = cxray_dfs[i].query("pid == @pid")
        mad = dp.mad_deg_avg.values        
        ax.plot(Qs, mad, 'o--', label=ndv)    
        res.append(mad)            
    ax.set_title(pid)
    ax.set_xlabel('Q'); ax.set_xticks(np.arange(0, 1.1, 0.1))
    ax.legend(); ax.grid()        
    min_mad = np.round(np.min(res, axis=1), 3) # row-wise per ND variant, over Qs for a pid
    min_q = (np.round(np.argmin(res, axis=1), 3) + 1)/10
    min_mads.append(min_mad)
    min_Qs.append(min_q)
    print(pid, min_mad, min_q)

# ofn = os.path.join(dirname, "ND_degphi_" + edv + ".png")
# plt.savefig(ofn)
1imq [0.449 0.472 0.48  0.33  0.467] [0.8 0.9 0.6 0.1 0.1]
1lmb4 [0.295 0.256 0.31  0.315 0.355] [0.8 0.1 0.1 0.8 1. ]
1bf4 [0.375 0.467 0.397 0.402 0.37 ] [0.1 0.1 0.1 0.1 0.1]
2ptl [0.37  0.371 0.328 0.318 0.322] [0.8 0.1 0.3 0.1 0.1]
2ci2 [0.32  0.378 0.341 0.271 0.32 ] [0.1 0.9 0.4 0.1 0.5]
1aps [0.336 0.388 0.366 0.295 0.333] [0.1 0.8 0.1 0.1 0.1]
1fmk [0.377 0.404 0.457 0.355 0.432] [0.2 0.6 1.  0.1 0.1]
1bk2 [0.305 0.313 0.296 0.268 0.28 ] [0.1 0.7 0.1 0.1 0.1]
1shf [0.442 0.381 0.483 0.444 0.517] [0.6 0.1 0.1 0.6 0.5]
1ten [0.332 0.334 0.41  0.299 0.286] [0.6 0.8 0.9 0.1 0.1]
In [26]:
# can skip if file already exists
deg_minmad_df = pd.DataFrame(min_mads)
deg_minmad_df.columns = NDVs
deg_minmad_df = deg_minmad_df.assign(pid=PIDs).set_index('pid')
ofn = os.path.join(dirname, "ND_degphi_minmad_" + edv + ".out")
deg_minmad_df.to_csv(ofn, sep=' ')

deg_MADstats_df = pd.DataFrame({ "MAD_stats": ["sample_avg", "sample_stdev", "range", \
                                  "gMOE", "95meanCI_gleft", "95meanCI_gright",
                                  "bMOE", "95meanCI_bleft", "95meanCI_bright",],
                "c": summary_stats(deg_minmad_df['c'].values), 
                "x": summary_stats(deg_minmad_df['x'].values),
                "r": summary_stats(deg_minmad_df['r'].values), 
                "a": summary_stats(deg_minmad_df['a'].values),
                "y": summary_stats(deg_minmad_df['y'].values) })
deg_MADstats_df = deg_MADstats_df.set_index("MAD_stats")

# ofn = os.path.join(dirname, "ND_deg_MADstats_" + edv + ".out")
# deg_MADstats_df.to_csv(ofn, sep=' ')
In [27]:
# can skip if file already exists
deg_minQ_df = pd.DataFrame(min_Qs)
deg_minQ_df.columns = NDVs
deg_minQ_df = deg_minQ_df.assign(pid=PIDs).set_index('pid')
ofn = os.path.join(dirname, "ND_degphi_minQ_" + edv + ".out")
deg_minQ_df.to_csv(ofn, sep=' ')

deg_minQstats_df = pd.DataFrame({ "minQ_stats": ["sample_avg", "sample_stdev", "range", \
                                "gMOE", "95meanCI_gleft", "95meanCI_gright",
                                "bMOE", "95meanCI_bleft", "95meanCI_bright"],
                "c": summary_stats(deg_minQ_df['c'].values), 
                "x": summary_stats(deg_minQ_df['x'].values),
                "r": summary_stats(deg_minQ_df['r'].values), 
                "a": summary_stats(deg_minQ_df['a'].values),
                "y": summary_stats(deg_minQ_df['y'].values)})
deg_minQstats_df = deg_minQstats_df.set_index("minQ_stats")

# ofn = os.path.join(dirname, "ND_deg_minQstats_" + edv + ".out")
# deg_minQstats_df.to_csv(ofn, sep=' ')
In [28]:
# fn = os.path.join(dirname, "ND_degphi_minmad_" + edv + ".out")
# deg_minmad_df = pd.read_table(fn, sep=' ').set_index('pid')
print(deg_minmad_df.describe())
deg_minmad_df
               c          x          r          a          y
count  10.000000  10.000000  10.000000  10.000000  10.000000
mean    0.360100   0.376400   0.386800   0.329700   0.368200
std     0.053197   0.065498   0.069635   0.056455   0.079151
min     0.295000   0.256000   0.296000   0.268000   0.280000
25%     0.323000   0.343250   0.331250   0.296000   0.320500
50%     0.353000   0.379500   0.381500   0.316500   0.344000
75%     0.376500   0.400000   0.445250   0.348750   0.416500
max     0.449000   0.472000   0.483000   0.444000   0.517000
Out[28]:
c x r a y
pid
1imq 0.449 0.472 0.480 0.330 0.467
1lmb4 0.295 0.256 0.310 0.315 0.355
1bf4 0.375 0.467 0.397 0.402 0.370
2ptl 0.370 0.371 0.328 0.318 0.322
2ci2 0.320 0.378 0.341 0.271 0.320
1aps 0.336 0.388 0.366 0.295 0.333
1fmk 0.377 0.404 0.457 0.355 0.432
1bk2 0.305 0.313 0.296 0.268 0.280
1shf 0.442 0.381 0.483 0.444 0.517
1ten 0.332 0.334 0.410 0.299 0.286
In [29]:
# fn = os.path.join(dirname, "ND_degphi_minQ_" + edv + ".out")
# deg_minQ_df = pd.read_table(fn, sep=' ').set_index('pid')
print(deg_minQ_df.describe())
deg_minQ_df
               c          x          r          a          y
count  10.000000  10.000000  10.000000  10.000000  10.000000
mean    0.420000   0.510000   0.370000   0.220000   0.270000
std     0.325918   0.363471   0.349762   0.257337   0.305687
min     0.100000   0.100000   0.100000   0.100000   0.100000
25%     0.100000   0.100000   0.100000   0.100000   0.100000
50%     0.400000   0.650000   0.200000   0.100000   0.100000
75%     0.750000   0.800000   0.550000   0.100000   0.400000
max     0.800000   0.900000   1.000000   0.800000   1.000000
Out[29]:
c x r a y
pid
1imq 0.8 0.9 0.6 0.1 0.1
1lmb4 0.8 0.1 0.1 0.8 1.0
1bf4 0.1 0.1 0.1 0.1 0.1
2ptl 0.8 0.1 0.3 0.1 0.1
2ci2 0.1 0.9 0.4 0.1 0.5
1aps 0.1 0.8 0.1 0.1 0.1
1fmk 0.2 0.6 1.0 0.1 0.1
1bk2 0.1 0.7 0.1 0.1 0.1
1shf 0.6 0.1 0.1 0.6 0.5
1ten 0.6 0.8 0.9 0.1 0.1
In [30]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4), constrained_layout=True)
plt.suptitle("Calculating $\phi$ values with normalized node degree from ND nets.", 
             fontsize=14)

ax0 = axes[0]
deg_minmad_df.boxplot(ax = ax0, showmeans=True)
ax0.set_title('Minimum MAD from exp_phi')
ax0.set_xticklabels(NDVs, fontsize=14); ax0.set_xlabel('ND variant')
# ax0.set_yticks(np.arange(0.2, 0.55, 0.05))

ax1 = axes[1]
deg_minQ_df.boxplot(ax = ax1, showmeans=True)
ax1.set_title('$Q$ with min MAD');
ax1.set_xticklabels(NDVs, fontsize=14); ax1.set_xlabel('ND variant')
_ = ax1.set_yticks(np.arange(0, 1.1, 0.1))
In [31]:
# MAD between min_Qs (Q where MAD from exp_phi is minimum) and peak Qs for cxray NDVs
np.mean(np.abs(deg_minQ_df.values - peak_Qs), axis=0)
Out[31]:
array([0.3 , 0.4 , 0.26, 0.21, 0.2 ])
In [32]:
print("MAD between experimental and calculated phi-values")
deg_MADstats_df
MAD between experimental and calculated phi-values
Out[32]:
c x r a y
MAD_stats
sample_avg 0.3601 0.3764 0.3868 0.3297 0.3682
sample_stdev 0.0532 0.0655 0.0696 0.0565 0.0792
range 0.1540 0.2160 0.1870 0.1760 0.2370
gMOE 0.0330 0.0406 0.0432 0.0350 0.0491
95meanCI_gleft 0.3271 0.3358 0.3436 0.2947 0.3191
95meanCI_gright 0.3931 0.4170 0.4300 0.3647 0.4173
bMOE 0.0307 0.0387 0.0402 0.0328 0.0469
95meanCI_bleft 0.3305 0.3370 0.3470 0.2990 0.3246
95meanCI_bright 0.3918 0.4143 0.4273 0.3647 0.4183
In [33]:
print("Q with minimum MAD between experimental and calculated phi-values")
deg_minQstats_df
Q with minimum MAD between experimental and calculated phi-values
Out[33]:
c x r a y
minQ_stats
sample_avg 0.4200 0.5100 0.3700 0.2200 0.2700
sample_stdev 0.3259 0.3635 0.3498 0.2573 0.3057
range 0.7000 0.8000 0.9000 0.7000 0.9000
gMOE 0.2020 0.2253 0.2168 0.1595 0.1895
95meanCI_gleft 0.2180 0.2847 0.1532 0.0605 0.0805
95meanCI_gright 0.6220 0.7353 0.5868 0.3795 0.4595
bMOE 0.1900 0.2150 0.2050 0.1450 0.1900
95meanCI_bleft 0.2300 0.2900 0.1800 0.1000 0.1000
95meanCI_bright 0.6100 0.7200 0.5900 0.3900 0.4800
In [34]:
methods = deg_MADstats_df.columns.values; n = len(methods)
plt_colors = plt.get_cmap("tab20")
gcolors = plt_colors(list(range(1, n*2, 2))) # odd
bcolors = plt_colors(list(range(0, n*2, 2))) # even

fig, axes = plt.subplots(2, 1, figsize=(6, 8), constrained_layout=True)
plt.suptitle("Calculating $\phi$ values with normalized ND node degree.", fontsize=14)

ax = axes[0]
mu = deg_MADstats_df.loc['sample_avg'].values
# both ways produce similar CIs
ga = deg_MADstats_df.loc['95meanCI_gleft'].values
gz = deg_MADstats_df.loc['95meanCI_gright'].values
# gmoe = deg_MADstats_df.loc['gMOE'].values
ba = deg_MADstats_df.loc['95meanCI_bleft'].values
bz = deg_MADstats_df.loc['95meanCI_bright'].values
# bmoe = deg_MADstats_df.loc['bMOE'].values
for i, p in enumerate(zip(ga, gz, ba, bz)):
    ax.plot([i, i], [p[0], p[1]], marker='o', color=gcolors[i])
    ax.plot([i+0.2, i+0.2], [p[2], p[3]], marker='o', color=bcolors[i], label=str(p[3])) 
    ax.scatter(i+0.2, mu[i], marker='^')
ax.set_title("mean MAD from exp_phi", fontsize=14)
ax.set_xticks(list(range(0, n, 1))); ax.set_xticklabels(methods, fontsize=14)
# ax.set_xlabel("Method to calculate phi-values", fontsize=12)
ax.set_ylabel("95% confidence interval", fontsize=12)
ax.legend(title="97.5 percentile", loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.grid()

ax = axes[1] # methods = deg_minQstats_df.columns.values
mu = deg_minQstats_df.loc['sample_avg'].values
ga = deg_minQstats_df.loc['95meanCI_gleft'].values
gz = deg_minQstats_df.loc['95meanCI_gright'].values
# gmoe = deg_minQstats_df.loc['gMOE'].values
ba = deg_minQstats_df.loc['95meanCI_bleft'].values
bz = deg_minQstats_df.loc['95meanCI_bright'].values
# bmoe = deg_minQstats_df.loc['bMOE'].values
for i, p in enumerate(zip(ga, gz, ba, bz)):
    ax.plot([i, i], [p[0], p[1]], marker='o', color=gcolors[i])
    ax.plot([i+0.2, i+0.2], [p[2], p[3]], marker='o', color=bcolors[i], label=str(p[3])) 
    ax.scatter(i+0.2, mu[i], marker='^')
ax.set_title("mean minQ", fontsize=14)
ax.set_xticks(list(range(0, n, 1))); ax.set_xticklabels(methods, fontsize=14)
ax.set_ylabel("95% confidence interval", fontsize=12)
# ax.set_xlabel("ND node degree", fontsize=12)
ax.legend(title="97.5 percentile", loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.grid()
  • A Confidence Interval (CI) as above gives a range estimate for the average MAD (min$Q$) of a method; the average MAD (min$Q$) will lie somewhere in the interval 95% of the time the method is used.
  • Prefer a method that yields smaller largest average MAD, and largest average min$Q$ that is within the ND-TS region ($Q < 0.5$). These best worst case criteria are decided with the 97.5 percentile estimate of the sample statistics.
  • ND variant 'a' produces the best calculated $\phi$ values with node degree for the dataset:
    • It has the smallest largest average MAD from exp_phi.
    • Its largest average min$Q$ is within the ND-TS region; 'a' is the only biased ND variant that satisfies this criterion.
    • Its MAD between min$Q$s and peak $Q$s is second smallest (0.21).
  • The average MAD of ND variant 'a' is significantly smaller than that of the unbiased ND variant 'r'.
In [35]:
ndc_deg = deg_minmad_df['c'].values
ndx_deg = deg_minmad_df['x'].values
nda_deg = deg_minmad_df['a'].values
ndr_deg = deg_minmad_df['r'].values
In [36]:
# pvalues are two-sided
print("c a")
print(ttest_rel(ndc_deg, nda_deg))
print(tts(ndc_deg.mean(), ndc_deg.std(ddof=1), 10, 
          nda_deg.mean(), nda_deg.std(ddof=1), 10, equal_var=False))
print("\nx a")
print(ttest_rel(ndx_deg, nda_deg))
print(tts(ndx_deg.mean(), ndx_deg.std(ddof=1), 10, 
          nda_deg.mean(), nda_deg.std(ddof=1), 10, equal_var=False))
print("\nr c")
print(ttest_rel(ndr_deg, ndc_deg))
print(tts(ndr_deg.mean(), ndr_deg.std(ddof=1), 10, 
          ndc_deg.mean(), ndc_deg.std(ddof=1), 10, equal_var=False))
print("\nr a")
print(ttest_rel(ndr_deg, nda_deg))
print(tts(ndr_deg.mean(), ndr_deg.std(ddof=1), 10, 
          nda_deg.mean(), nda_deg.std(ddof=1), 10, equal_var=False))
c a
Ttest_relResult(statistic=2.295809425696055, pvalue=0.047322752386071545)
Ttest_indResult(statistic=1.2393203410584936, pvalue=0.2311986092612509)

x a
Ttest_relResult(statistic=2.2553369627684665, pvalue=0.05056042615758677)
Ttest_indResult(statistic=1.707844596388886, pvalue=0.10522611028794224)

r c
Ttest_relResult(statistic=2.3227541240079685, pvalue=0.04528117474937724)
Ttest_indResult(statistic=0.9635194780700006, pvalue=0.3489232601692108)

r a
Ttest_relResult(statistic=3.424842598769379, pvalue=0.007569724516548448)
Ttest_indResult(statistic=2.0142391834783093, pvalue=0.059838649317197574)

Plot the best results for node degree with ND PRNs.

In [37]:
minMad_a = deg_minmad_df['a'].values
print('minMad a', minMad_a)
minQ_a = deg_minQ_df['a'].values
print('minQ a', minQ_a)
minMad a [0.33  0.315 0.402 0.318 0.271 0.295 0.355 0.268 0.444 0.299]
minQ a [0.1 0.8 0.1 0.1 0.1 0.1 0.1 0.1 0.6 0.1]
In [38]:
fig, axes = plt.subplots(4, 3, figsize=(12, 18), constrained_layout=True)
plt.suptitle("Calculated $\phi$ values from normalized node degree, \
ND variant 'a' with EDS variant 'a'.", fontsize=14)
plt.delaxes(axes[3, 1])
plt.delaxes(axes[3, 2])
dirn = os.path.join(dirname, "edv_a") # ND data

for i, pid in enumerate(PIDs):    
    r = i//3; c = i%3; ax = axes[r][c]
    fn0 = os.path.join(dirname, pid + "_nodestats.out")
    df0 = pd.read_table(fn0, sep="\s+") # nid exp_phi deg0 cent0        
        
    fn1 = os.path.join(dirn, pid + "_a_nodestats.out")
    df1 = pd.read_table(fn1, sep="\s+")
    q_a = minQ_a[i]
    dq = df1.query("Q == @q_a")
    dq = dq.assign(deg_avg_norm = dq.deg_avg.values/dq.deg_avg.max())    
    dq = dq.assign(exp_phi = df0.exp_phi.values)
    de = dq.query("exp_phi > -1")    
    mad_a = np.round(np.mean(np.abs(de.deg_avg_norm.values - de.exp_phi.values)), 2)    
    ax.plot(dq.nid.values+1, dq.deg_avg_norm.values, color="tab:red", alpha=0.5, 
            label="'a' deg " + str(mad_a))
    
    ax.plot(de.nid.values+1, de.exp_phi.values, "o--g", alpha=0.5, label='exp_phi')
    ax.set_title(pid + " " + str(q_a), loc='left')
    ax.set_xlabel('Residue'); ax.set_yticks(np.arange(0, 1.1, 0.1))
    ax.legend(loc="lower left", bbox_to_anchor=(0.5, 1.0)) 

# ofn = os.path.join(dirname, "ND_degphi_a_" + edv + ".png")
# plt.savefig(ofn)
  • Figure above: Calculated phi-values as normalized node degree (deg) of 'a' ND generated PRNs at the $Q$ where the mean absolute difference (MAD) between experimental phi-values and deg is minimum. The numbers in the title of each subplot is this $Q$ for 'a'. The numbers in the legend is the MAD associated with these $Q$ values. ND node degree for a $Q$ is the average over 100 independent ND generated PRNs at the $Q$.

Summary

In [39]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5), constrained_layout=True, sharey=True)
plt.suptitle("Calculating $\phi$ values with ND PRNs, EDS variant " + edv, fontsize=14)

ax = axes[0]
ax.set_title("Node centrality", fontsize=14)
for ndv in NDVs:
    vals = cent_minmad_df[ndv].values
    ax.plot(PIDs, vals, 'o-', label=ndv + ' ' + str(np.round(np.mean(vals), 3)))
ax.set_xticks(PIDs); ax.set_xticklabels(PIDs, rotation=90, fontsize=14)
ax.set_ylabel("Minimum MAD from exp_phi", fontsize=14)
ax.legend(); ax.grid()

ax = axes[1]
ax.set_title("Node degree", fontsize=14)
for ndv in NDVs:
    vals = deg_minmad_df[ndv].values
    ax.plot(PIDs, vals, 'o--', label=ndv + ' ' + str(np.round(np.mean(vals), 3)))
ax.set_xticks(PIDs); ax.set_xticklabels(PIDs, rotation=90, fontsize=14)
ax.legend(); ax.grid()
# ax.set_ylabel("Minimum MAD ", fontsize=14)

# ofn = os.path.join(dirname, "ND_minMadphi_" + edv + ".png")
# plt.savefig(ofn) 
In [40]:
fig, axes = plt.subplots(2, 2, figsize=(8, 8), constrained_layout=True)
plt.suptitle("Calculating $\phi$ values with ND PRNs, EDS variant " + edv, fontsize=14)

ax0 = axes[0][0]
cent_minmad_df.boxplot(ax = ax0, showmeans=True)
ax0.set_title('Node centrality', fontsize=14)
ax0.set_xticklabels(NDVs, fontsize=14)
ax0.set_ylabel('Minimum MAD from exp_phi', fontsize=14)
ax0.set_yticks(np.arange(0.15, 0.6, 0.05))

ax1 = axes[0][1]
deg_minmad_df.boxplot(ax = ax1, showmeans=True)
ax1.set_title('Node degree', fontsize=14)
ax1.set_xticklabels(NDVs, fontsize=14)
ax1.set_yticks(np.arange(0.15, 0.6, 0.05))

ax2 = axes[1][0]
cent_minQ_df.boxplot(ax = ax2, showmeans=True)
ax2.set_xticklabels(NDVs, fontsize=14); ax2.set_xlabel('ND variant', fontsize=14)
ax2.set_ylabel('$Q$ with minimum MAD', fontsize=14)
ax2.set_yticks(np.arange(0, 1.1, 0.1))

ax3 = axes[1][1]
deg_minQ_df.boxplot(ax = ax3, showmeans=True)
ax3.set_xticklabels(NDVs, fontsize=14); ax3.set_xlabel('ND variant', fontsize=14)
_ = ax3.set_yticks(np.arange(0, 1.1, 0.1))

# ofn = os.path.join(dirname, "ND_phi_" + edv + ".png")
# plt.savefig(ofn)
  • Figure above: Summary of results for calculating phi-values from ND PRNs as normalized node centrality (Left) and normalized node degree (Right). It is desirable for a ND variant to yield smaller minimum MAD scores within the ND-TS region.

  • Interpretation of box plots: The green triangle marks the mean of the data. The box extends from the lower (Q1) to upper (Q3) quartiles of the data with a line at the median (Q2). The whiskers capture data that lie beyond the box which are within 1.5 times the inter-quartile range (Q3-Q1) in either direction. Points beyond the whiskers are deemed outliers.

  • Top: Distribution of minimum MAD (between calculated and experimental phi-values) scores for the 10 proteins in the Paci dataset. Notably for the biased ND variants ('a', 'c', 'x'), the averages tend to be smaller for node centrality than for node degree.

  • Bottom: Distribution of $Q$ values where the minimum MAD scores are found. Notably the $Q$ values are more concentrated within the ND-TS region ($Q$ < 0.5) with node centrality than with node degree, for the biased ND variants ('a', 'c', 'x').

In [41]:
methods = cent_MADstats_df.columns.values; n = len(methods)
plt_colors = plt.get_cmap("tab20")
gcolors = plt_colors(list(range(1, n*2, 2))) # odd
bcolors = plt_colors(list(range(0, n*2, 2))) # even

fig, axes = plt.subplots(2, 2, figsize=(8, 8), constrained_layout=True)
plt.suptitle("Calculating $\phi$ values from ND nets.", fontsize=14)

ax = axes[0][0]
mu = cent_MADstats_df.loc['sample_avg'].values
ga = cent_MADstats_df.loc['95meanCI_gleft'].values
gz = cent_MADstats_df.loc['95meanCI_gright'].values
ba = cent_MADstats_df.loc['95meanCI_bleft'].values
bz = cent_MADstats_df.loc['95meanCI_bright'].values
for i, p in enumerate(zip(ga, gz, ba, bz)):
    ax.plot([i, i], [p[0], p[1]], marker='o', color=gcolors[i])
    ax.plot([i+0.2, i+0.2], [p[2], p[3]], marker='o', color=bcolors[i], label=str(p[3])) 
    ax.scatter(i+0.2, mu[i], marker='^')
ax.set_title("Node centrality", fontsize=14)
ax.set_yticks(np.arange(0.24, 0.46, 0.02))
ax.set_ylabel("95% CI mean MAD from exp_phi", fontsize=14)
ax.set_xticks(list(range(0, n, 1))); ax.set_xticklabels(methods, fontsize=14)
ax.legend(); ax.grid()

ax = axes[1][0]
mu = cent_minQstats_df.loc['sample_avg'].values
ga = cent_minQstats_df.loc['95meanCI_gleft'].values
gz = cent_minQstats_df.loc['95meanCI_gright'].values
ba = cent_minQstats_df.loc['95meanCI_bleft'].values
bz = cent_minQstats_df.loc['95meanCI_bright'].values
for i, p in enumerate(zip(ga, gz, ba, bz)):
    ax.plot([i, i], [p[0], p[1]], marker='o', color=gcolors[i])
    ax.plot([i+0.2, i+0.2], [p[2], p[3]], marker='o', color=bcolors[i], label=str(p[3])) 
    ax.scatter(i+0.2, mu[i], marker='^')
ax.set_yticks(np.arange(0.0, 0.9, 0.1))
ax.set_ylabel("95% CI mean minQ", fontsize=14)
ax.set_xticks(list(range(0, n, 1))); ax.set_xticklabels(methods, fontsize=14)
ax.set_xlabel("ND variant", fontsize=14)
ax.legend(); ax.grid()

ax = axes[0][1]
mu = deg_MADstats_df.loc['sample_avg'].values
ga = deg_MADstats_df.loc['95meanCI_gleft'].values
gz = deg_MADstats_df.loc['95meanCI_gright'].values
ba = deg_MADstats_df.loc['95meanCI_bleft'].values
bz = deg_MADstats_df.loc['95meanCI_bright'].values
for i, p in enumerate(zip(ga, gz, ba, bz)):
    ax.plot([i, i], [p[0], p[1]], marker='o', color=gcolors[i])
    ax.plot([i+0.2, i+0.2], [p[2], p[3]], marker='o', color=bcolors[i], label=str(p[3])) 
    ax.scatter(i+0.2, mu[i], marker='^')
ax.set_title("Node degree", fontsize=14)
ax.set_yticks(np.arange(0.24, 0.46, 0.02))
# ax.set_ylabel("95% confidence interval mean MAD from exp_phi", fontsize=14)
ax.set_xticks(list(range(0, n, 1))); ax.set_xticklabels(methods, fontsize=14)
ax.legend(); ax.grid()

ax = axes[1][1]
mu = deg_minQstats_df.loc['sample_avg'].values
ga = deg_minQstats_df.loc['95meanCI_gleft'].values
gz = deg_minQstats_df.loc['95meanCI_gright'].values
ba = deg_minQstats_df.loc['95meanCI_bleft'].values
bz = deg_minQstats_df.loc['95meanCI_bright'].values
for i, p in enumerate(zip(ga, gz, ba, bz)):
    ax.plot([i, i], [p[0], p[1]], marker='o', color=gcolors[i])
    ax.plot([i+0.2, i+0.2], [p[2], p[3]], marker='o', color=bcolors[i], label=str(p[3])) 
    ax.scatter(i+0.2, mu[i], marker='^')
ax.set_yticks(np.arange(0.0, 0.9, 0.1))
ax.set_xticks(list(range(0, n, 1))); ax.set_xticklabels(methods, fontsize=14)
ax.set_xlabel("ND variant", fontsize=14)
ax.legend(); ax.grid()

# ofn = os.path.join(dirname, "ND_95CIstats_" + edv + ".png")
# print(ofn)
# plt.savefig(ofn)

Main takeaways:

  • Based on the analysis so far, if calculating $\phi$ values from ND nets, better to use node centrality than node degree.
  • With either node centrality or node degree, ND variant 'a' generates the best node statistics for calculating $\phi$ values.
  • In general, the biased ND variants ('a', 'c', 'x') generate better node statistics for calculating $\phi$ values than the unbiased ('r') or improbable ('y') ND variant (as it should be!).

How do calculated $\phi$ values from ND PRNs fare against TSE PRNs?

In [42]:
# MAD values computed previously
ns_deg = np.array([0.48, 0.34, 0.49, 0.35, 0.34, 0.37, 0.45, 0.35, 0.51, 0.41])
ns_cent = np.array([0.44, 0.3, 0.41, 0.29, 0.32, 0.3, 0.43, 0.35, 0.44, 0.36])
ns_locent = np.array([0.33, 0.5, 0.2, 0.19, 0.24, 0.22, 0.3, 0.15, 0.24, 0.16])

ts_deg = np.array([0.46, 0.24, 0.35, 0.29, 0.28, 0.26, 0.3, 0.28, 0.3, 0.3])
ts_deg2 = np.array([0.25, 0.21, 0.4, 0.29, 0.36, 0.35, 0.28, 0.31, 0.22, 0.32])
ts_cent = np.array([0.41, 0.34, 0.37, 0.34, 0.29, 0.26, 0.42, 0.26, 0.43, 0.4])

ndc_cent = np.array([0.394, 0.286, 0.36, 0.327, 0.26, 0.26, 0.304, 0.23, 0.391, 0.252])
nda_cent = np.array([0.274, 0.372, 0.301, 0.287, 0.251, 0.242, 0.302, 0.204, 0.399, 0.313])
nda_deg = np.array([0.33, 0.315, 0.402, 0.318, 0.271, 0.295, 0.355, 0.268, 0.444, 0.299])
In [43]:
print("nda_deg nda_cent")
print(ttest_rel(nda_deg, nda_cent))
print(tts(nda_deg.mean(), nda_deg.std(ddof=1), 10,
          nda_cent.mean(), nda_cent.std(ddof=1), 10, equal_var=False))

print("\nts_deg2 nda_cent")
print(ttest_rel(ts_deg2, nda_cent))
print(tts(ts_deg2.mean(), ts_deg2.std(ddof=1), 10,
          nda_cent.mean(), nda_cent.std(ddof=1), 10, equal_var=False))

print("\nns_locent nda_cent")
print(ttest_rel(ns_locent, nda_cent))
print(tts(ns_locent.mean(), ns_locent.std(ddof=1), 10,
          nda_cent.mean(), nda_cent.std(ddof=1), 10, equal_var=False))
nda_deg nda_cent
Ttest_relResult(statistic=2.5249288785805075, pvalue=0.03250470927124305)
Ttest_indResult(statistic=1.3705435170621822, pvalue=0.18738682517163457)

ts_deg2 nda_cent
Ttest_relResult(statistic=0.1325170332372161, pvalue=0.8974908023891056)
Ttest_indResult(statistic=0.16774069521547652, pvalue=0.8686626539626287)

ns_locent nda_cent
Ttest_relResult(statistic=-1.446086667459005, pvalue=0.18206196842547162)
Ttest_indResult(statistic=-1.1034888351254046, pvalue=0.28817383425307563)
In [44]:
# sample size is too small for CLT which 95%meanCI assumes (should actual do resampling?)
print("MAD between experimental and calculated phi-values")
df = pd.DataFrame({ "MAD_stats": ["sample_avg", "sample_stdev", "range", \
                                  "gMOE", "95meanCI_gleft", "95meanCI_gright", 
                                  "bMOE", "95meanCI_bleft", "95meanCI_bright"],
                "NS_deg": summary_stats(ns_deg), "NS_cent": summary_stats(ns_cent),
                "NS_locent": summary_stats(ns_locent), "TSE_deg": summary_stats(ts_deg),
                "TSE_deg2": summary_stats(ts_deg2), "TSE_cent": summary_stats(ts_cent),
                "NDc_cent": summary_stats(ndc_cent), "NDa_cent": summary_stats(nda_cent),
                "NDa_deg": summary_stats(nda_deg)})
df = df.set_index("MAD_stats")
df
MAD between experimental and calculated phi-values
Out[44]:
NS_deg NS_cent NS_locent TSE_deg TSE_deg2 TSE_cent NDc_cent NDa_cent NDa_deg
MAD_stats
sample_avg 0.4090 0.3640 0.2530 0.3060 0.2990 0.3520 0.3064 0.2945 0.3297
sample_stdev 0.0679 0.0613 0.1036 0.0613 0.0615 0.0648 0.0593 0.0584 0.0565
range 0.1700 0.1500 0.3500 0.2200 0.1900 0.1700 0.1640 0.1950 0.1760
gMOE 0.0421 0.0380 0.0642 0.0380 0.0381 0.0401 0.0368 0.0362 0.0350
95meanCI_gleft 0.3669 0.3260 0.1888 0.2680 0.2609 0.3119 0.2696 0.2583 0.2947
95meanCI_gright 0.4511 0.4020 0.3172 0.3440 0.3371 0.3921 0.3432 0.3307 0.3647
bMOE 0.0395 0.0360 0.0590 0.0345 0.0350 0.0370 0.0344 0.0344 0.0330
95meanCI_bleft 0.3710 0.3290 0.2020 0.2760 0.2640 0.3140 0.2733 0.2608 0.2989
95meanCI_bright 0.4500 0.4010 0.3200 0.3450 0.3340 0.3880 0.3422 0.3296 0.3648
In [45]:
# edv = 'a'
plt.figure(figsize=(8, 6))
methods = df.columns.values; n = len(methods)
plt_colors = plt.get_cmap("tab20")
gcolors = plt_colors(list(range(1, n*2, 2))) # odd
bcolors = plt_colors(list(range(0, n*2, 2))) # even

mu = df.loc['sample_avg'].values
ga = df.loc['95meanCI_gleft'].values; gz = df.loc['95meanCI_gright'].values
# gmoe = df.loc['gMOE'].values
ba = df.loc['95meanCI_bleft'].values; bz = df.loc['95meanCI_bright'].values
bmoe = df.loc['bMOE'].values

for i, p in enumerate(zip(ga, gz, ba, bz)):
    plt.plot([i, i], [p[0], p[1]], color=gcolors[i], marker='o')
    plt.scatter(i+0.2, mu[i], marker='^')
    plt.plot([i+0.2, i+0.2], [p[2], p[3]], color=bcolors[i], marker='o', label=str(p[3]))
    
plt.title("Calculating $\phi$ values", fontsize=14)
plt.xticks(list(range(0, n, 1)), methods, rotation=45, fontsize=14)
plt.xlabel("Method: PRN type and node statistic combination", fontsize=14)
plt.ylabel("95% confidence interval around mean MAD", fontsize=14)
plt.legend(title="97.5 percentile", loc="upper left", bbox_to_anchor=(1.0, 1.0)); plt.grid()

# ofn = os.path.join(dirname, "NSTSEND_95CIstats_" + edv + ".png")
# print(ofn)
# plt.savefig(ofn)
  • Note that these MAD from exp_phi scores assume we know which residues have phi-values!
  • If TSE PRNs are available, calculating phi-values with node degree (TSE_deg or TSE_deg2) yields smaller largest average MAD than with node centrality (TSE_cent).
  • There's no significant difference between TS_deg and TS_deg2, which uses different normalization methods.
  • Desirable to produce results as good as or better than TSE_deg(2); sets the benchmark for other methods.
  • If calculating phi-values from NS PRNs, better to use normalized local node centrality (NS_locent) than (global) node centrality (NS_cent) or node degree (NS_deg). But all-helix bundles could be problematic (1lmb4).
  • If calculating phi-values from ND PRNs, better to use normalized node centrality with a biased ND variant, notably 'a' (NDa_cent). Its average MAD is not significantly different from that of TS_deg2.

ND folding model is capturing some salient aspect of protein folding observed from experiments at the residue level of protein chains.

End of notebook

SKLC December 2020