Examine ND results by chain SStype

  • Protein chains are partitioned into three classes by secondary structure content.
  • Correlations with exp. fold rate vary by SStype and ND variant.
  • Fold rate corrs should at least outperform N for the task of fold rate prediction, and outperform h for the purpose of estimating a resonable edge ordering.
  • Slightly improved correlation for UZ when peak Es from ND variants are mixed by SStype and $p_e$ is adjusted by SStype.
  • K correlation largely a factor of AB chains, which dominates the dataset.
  • Results point to a need for a more subtle contact formation strategy in ND: context-sensitive at level of nodes or edges.
In [1]:
import os
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from sklearn import linear_model
import matplotlib.pyplot as plt
import NDutils as nd # my lib
In [2]:
NDVs = ['y', 'r', 'a', 'c', 'x', 'h']
  • ND variants are arranged in LE/SE priority order in the sense that models towards the right respect the tenent of forming SE over LE more strictly, while models towards the left give more chances for LE to form before SE, with 'y' adopting this inversion of priority explicitly in its $p_e$.
  • 'h' is distinct from the other ND variants in that edge ordering is strictly determined by sequence distance (increasing monotonically), while the other ND variants are stochastic processes.

Uzunoglu

In [3]:
dirname = "Uzunoglu"
fn = os.path.join(dirname, "UZ_foldtype.txt")
DF = pd.read_table(fn, sep=' ')
DF.head() # reserve DF for this purpose!
Out[3]:
pid N SRIM LRIM k_f SStype
0 1afi 72 -7.50 -6.25 0.26 ab
1 1aps 98 -9.41 -6.97 -0.64 ab
2 1ayi 86 -5.87 -5.75 3.13 a
3 1bf4 63 -5.77 -5.43 3.02 ab
4 1c9o 66 -7.54 -6.07 3.14 b
In [4]:
UZ_PIDs = DF.pid.values
UZ_frates = DF.k_f.values
print(len(UZ_PIDs), np.ptp(UZ_frates))
UZ_SStypes = DF.SStype.values
52 5.96

Defining SStype for chains

  • The three chain fold types or chain SStypes are:
    • A for chains with predominantly $\alpha$ residues
      • number of $\alpha$ residues (nH) > 3, number of $\beta$ residues (nS) < 11, and nH/nS $\approx$ nH or >> 1;
    • B for chains with predominantly $\beta$ residues
      • nS > 10, nH < 4, and nH/nS $\approx$ 0.0;
    • AB otherwise.
In [5]:
df = nd.count_HSTs("Uzunoglu", UZ_PIDs, UZ_SStypes)
print(df.dtypes)
df.sample(5)
H           int64
S           int64
T           int64
hsr       float64
SStype     object
dtype: object
Out[5]:
H S T hsr SStype
2qjl 29 33 37 0.878788 ab
1ayi 50 0 36 50.000000 a
1e0m 0 13 24 0.000000 b
1div_c 34 37 22 0.918919 ab
1rlq 0 22 34 0.000000 b
In [6]:
da = df.query("SStype == 'a'")
print(len(da))
da
15
Out[6]:
H S T hsr SStype
1ayi 50 0 36 50.000000 a
1enh 38 0 16 38.000000 a
1fex 31 0 28 31.000000 a
1hrc 43 2 59 21.500000 a
1idy 33 0 21 33.000000 a
1imq 46 0 40 46.000000 a
1lmb 62 0 25 62.000000 a
1ryk 45 0 24 45.000000 a
1ss1 43 0 17 43.000000 a
1w4e 19 4 22 4.750000 a
1w4j 23 3 25 7.666667 a
1yza 70 0 36 70.000000 a
2abd 52 0 34 52.000000 a
2bth 22 2 21 11.000000 a
2pdd 19 0 23 19.000000 a
In [7]:
# reclassified 1o6x as 'ab', nH > 3
db = df.query("SStype == 'b'")
print(len(db))
db
17
Out[7]:
H S T hsr SStype
1c9o 3 41 22 0.073171 b
1csp 3 37 27 0.081081 b
1e0l 0 12 25 0.000000 b
1e0m 0 13 24 0.000000 b
1fnf_9 0 49 39 0.000000 b
1g6p 0 31 35 0.000000 b
1jo8 3 26 29 0.115385 b
1k9q 0 13 27 0.000000 b
1m9s 3 17 55 0.176471 b
1mjc 3 34 32 0.088235 b
1psf 0 21 48 0.000000 b
1rlq 0 22 34 0.000000 b
1shf 3 27 29 0.111111 b
1shg 3 29 25 0.103448 b
1ten 0 48 41 0.000000 b
1tit 0 33 56 0.000000 b
3ait 0 38 36 0.000000 b
In [8]:
# reclassified 1g6p, 1m9s 'b'
dab = df.query("SStype == 'ab'")
print(len(dab))
dab
20
Out[8]:
H S T hsr SStype
1afi 21 21 30 1.000000 ab
1aps 18 36 44 0.500000 ab
1bf4 15 33 15 0.454545 ab
1div_c 34 37 22 0.918919 ab
1div_n 26 13 18 2.000000 ab
1fkb 11 43 53 0.255814 ab
1n88 22 27 47 0.814815 ab
1o6x 25 16 40 1.562500 ab
1pgb 14 24 18 0.583333 ab
1poh 29 23 33 1.260870 ab
1rfa 15 21 42 0.714286 ab
1ris 28 47 22 0.595745 ab
1spr 18 31 54 0.580645 ab
1ubq 18 26 32 0.692308 ab
1urn 31 24 41 1.291667 ab
2acy 24 41 33 0.585366 ab
2ci2 14 18 33 0.777778 ab
2ptl 12 25 25 0.480000 ab
2qjl 29 33 37 0.878788 ab
2vik 27 30 69 0.900000 ab

Initial exploration of UZ DF

  • There should be a sensible relationship between exp. fold rate with:
    • chain length N: longer chains (larger N) generally take longer to fold (smaller fold rate value). Scaled N is also examined: 1/2, 2/3.
    • chain SStype: A chains generally fold faster (larger fold rate value) than B chains.
In [9]:
n_dr = nd.frate_corrs_with_N_by_SStype(DF, "Uzunoglu (52)")
n_dr
Pearson correlation with exp. fold rate: chain-length N 1
All: [-0.4355  0.0013]
A: [-0.6721  0.0061]
AB: [-0.185   0.4349]
B: [-7.425e-01  6.000e-04]

Pearson correlation with exp. fold rate: chain-length N 0.5
All: [-0.4572  0.0007]
A: [-0.6842  0.0049]
AB: [-0.1988  0.4009]
B: [-7.502e-01  5.000e-04]

Pearson correlation with exp. fold rate: chain-length N 0.6666666666666666
All: [-0.4504  0.0008]
A: [-0.6807  0.0052]
AB: [-0.1944  0.4115]
B: [-7.479e-01  6.000e-04]

Out[9]:
n_rs n_ps n12_rs n12_ps n23_rs n23_ps
SStype
Z -0.435466 0.001253 -0.457243 0.000655 -0.450355 0.000808
A -0.672115 0.006056 -0.684221 0.004902 -0.680697 0.005218
AB -0.185007 0.434873 -0.198762 0.400861 -0.194397 0.411493
B -0.742475 0.000641 -0.750195 0.000523 -0.747886 0.000556

ND-hard (ndh)

  • PRN0 edges are batch recreated in monotonically increasing sequence distance order.
In [10]:
dirn = os.path.join("UZ_NDH", "hard")
In [11]:
# example of a datafile
pid = "1aps"
fn = os.path.join(dirn, pid + "_bsd_hstats_h.out")
dh = pd.read_table(fn, sep=' ')
dh.head()
Out[11]:
bid fe fsc sc_mje num_sc num_scle num_nnsc num_nnscle
0 2 0.137536 0.000000 0.00 0 0 0 0
1 3 0.237822 0.371257 71.70 62 0 96 0
2 4 0.295129 0.479042 101.00 80 0 119 0
3 5 0.340974 0.485030 180.40 81 0 151 0
4 6 0.368195 0.491018 226.07 82 0 168 0
In [12]:
maxEs, argmaxEs, maxE_sds = nd.get_ndh_maxEs(dirn, UZ_PIDs)
h_df = DF.assign(maxE=maxEs, maxE_sd=maxE_sds)
h_df.head()
Out[12]:
pid N SRIM LRIM k_f SStype maxE maxE_sd
0 1afi 72 -7.50 -6.25 0.26 ab 372.23 13
1 1aps 98 -9.41 -6.97 -0.64 ab 462.15 27
2 1ayi 86 -5.87 -5.75 3.13 a 213.28 10
3 1bf4 63 -5.77 -5.43 3.02 ab 95.47 9
4 1c9o 66 -7.54 -6.07 3.14 b 109.78 10
In [13]:
print("maxE_sds:", np.mean(maxE_sds), np.std(maxE_sds, ddof=1), 
      np.min(maxE_sds), np.max(maxE_sds))
# correlation maxEs with exp. fold rate
print("maxE Pearson corr. with exp. fold rate:", pearsonr(maxEs, UZ_frates))
# check for outliers
_ = plt.scatter(maxEs, UZ_frates)
maxE_sds: 12.73076923076923 5.299335454510013 4 31
maxE Pearson corr. with exp. fold rate: (-0.4445872304072675, 0.0009598432602786288)
In [14]:
h_df.query("maxE > 700") # outlier
Out[14]:
pid N SRIM LRIM k_f SStype maxE maxE_sd
50 2vik 126 -7.44 -6.25 2.95 ab 762.08 21
In [15]:
h_rs, h_ps = nd.analyze_maxE_by_SStype(h_df)
maxE_sd mean stdev
All: [12.7308  5.2993]
A: [10.6667  2.3503]
AB: [16.1    5.821]
B: [10.5882  4.6241]

Pearson correlation with exp. fold rate: maxE
All: [-0.4446  0.001 ]
A: [-0.5552  0.0317]
AB: [-0.326   0.1607]
B: [-0.555   0.0207]

  • Compared with A and B chains, AB chains have larger maxE_sd on average.
    • maxE_sd marks the point where ND energy profile changes direction (increase to decrease).
    • A hint that perhaps LE need to appear earlier.
    • Corr. for AB is not significant (p > 0.05) because of the outlier.
In [16]:
fig, ax1 = plt.subplots()
ax = nd.plot_maxE_by_SStype(h_df, ax1, h_rs)
ax.set_title("Uzunoglo h " + str(np.round(h_rs[0], 3)), fontsize=14)
_ = ax.legend(title="SStype", fontsize=12)
In [17]:
# excluding outlier
h1_df = h_df.query("pid != '2vik'") # ab
len(h1_df)
Out[17]:
51
In [18]:
h1_rs, h1_ps = nd.analyze_maxE_by_SStype(h1_df)
maxE_sd mean stdev
All: [12.5686  5.2202]
A: [10.6667  2.3503]
AB: [15.8421  5.862 ]
B: [10.5882  4.6241]

Pearson correlation with exp. fold rate: maxE
All: [-0.5417  0.    ]
A: [-0.5552  0.0317]
AB: [-0.6196  0.0047]
B: [-0.555   0.0207]

ND variants (ndv): craxy

In [19]:
dirn = os.path.join("UZ_ND3", "edv_a")
In [20]:
peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='a')
a_df = DF.assign(peakE = peakEs)
a_rs, a_ps = nd.analyze_peakE_by_SStype(a_df, 'a')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='c')
c_df = DF.assign(peakE = peakEs)
c_rs, c_ps = nd.analyze_peakE_by_SStype(c_df, 'c')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='x')
x_df = DF.assign(peakE = peakEs)
x_rs, x_ps = nd.analyze_peakE_by_SStype(x_df, 'x')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='r')
r_df = DF.assign(peakE = peakEs)
r_rs, r_ps = nd.analyze_peakE_by_SStype(r_df, 'r')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='y')
y_df = DF.assign(peakE = peakEs)
y_rs, y_ps = nd.analyze_peakE_by_SStype(y_df, 'y')
Pearson correlation with exp. fold rate: a
All: [-0.7267  0.    ]
A: [-0.4121  0.1269]
AB: [-0.6481  0.002 ]
B: [-0.5744  0.0159]

Pearson correlation with exp. fold rate: c
All: [-0.7106  0.    ]
A: [-0.437   0.1033]
AB: [-0.5412  0.0137]
B: [-0.6408  0.0056]

Pearson correlation with exp. fold rate: x
All: [-0.6995  0.    ]
A: [-0.47    0.0771]
AB: [-0.4938  0.0269]
B: [-0.7052  0.0016]

Pearson correlation with exp. fold rate: r
All: [-0.2145  0.1268]
A: [-0.559   0.0303]
AB: [-0.1903  0.4217]
B: [-0.3294  0.1967]

Pearson correlation with exp. fold rate: y
All: [0.0664 0.6402]
A: [-0.6905  0.0044]
AB: [-0.1091  0.6471]
B: [-0.3274  0.1995]

Summary of results for analysis of UZ by SStype

In [21]:
# df = eval('a' + "_df")
# np.all(df.peakE.values == a_df.peakE.values) # True

fig, axes = plt.subplots(2, 3, figsize=(12, 7), constrained_layout=True)
plt.suptitle("Uzunoglo (52)", fontsize=14)
for i, ndv in enumerate(NDVs[:-1]):
    r = i//3; c = i%3; ax1 = axes[r][c]
    df = eval(ndv + "_df"); rs = eval(ndv + "_rs")
    nd.plot_peakE_by_SStype(df, ax1, rs)
    ax1.set_title(ndv + ' ' + str(np.round(rs[0], 3)) , fontsize=14)
    ax1.legend(title="SStype")

ax1 = axes[1][2]
nd.plot_maxE_by_SStype(h_df, ax1, h_rs)
ax1.set_title('h ' + str(np.round(h_rs[0], 3)), fontsize=14)
_ = ax1.legend(title="SStype")
In [22]:
# Z == All
dr = pd.DataFrame({'SStype': ['Z', 'A', 'AB', 'B'], 
                'y_rs': y_rs, 'y_ps': y_ps, 'r_rs': r_rs, 'r_ps': r_ps,
                'a_rs': a_rs, 'a_ps': a_ps, 'c_rs': c_rs, 'c_ps': c_ps, 
                'x_rs': x_rs, 'x_ps': x_ps, 'h_rs': h_rs, 'h_ps': h_ps,}).set_index('SStype')
UZcorrs_dr = dr.join(n_dr)
In [23]:
# save to file
UZcorrs_dr.to_csv("UZcorrs_dr.txt", sep=' ')
In [24]:
# desire corrs with ND variants to be stronger than with 'h' or 'n',
# since generating peak Es involve more work
df = UZcorrs_dr.filter(regex='_rs').T
df.round(4)
Out[24]:
SStype Z A AB B
y_rs 0.0664 -0.6905 -0.1091 -0.3274
r_rs -0.2145 -0.5590 -0.1903 -0.3294
a_rs -0.7267 -0.4121 -0.6481 -0.5744
c_rs -0.7106 -0.4370 -0.5412 -0.6408
x_rs -0.6995 -0.4700 -0.4938 -0.7052
h_rs -0.4446 -0.5552 -0.3260 -0.5550
n_rs -0.4355 -0.6721 -0.1850 -0.7425
n12_rs -0.4572 -0.6842 -0.1988 -0.7502
n23_rs -0.4504 -0.6807 -0.1944 -0.7479
In [25]:
# smaller (more negative) is better
Xs = NDVs[:]; Xs.extend(['N', 'N1/2', 'N2/3'])
fig, ax = plt.subplots()
ax.scatter(Xs, df.A.values, marker='^', edgecolor='green', facecolor='none', label='A')
ax.scatter(Xs, df.AB.values, edgecolor='orange', lw=2, facecolor='none', label='AB')
ax.scatter(Xs, df.B.values, marker='s', edgecolor='magenta', facecolor='none', label='B')
ax.set_xticks(Xs); ax.set_xticklabels(Xs, fontsize=14)
ax.set_xlabel("LE <--- priority ---> SE", fontsize=14)
ax.set_title("Uzunoglo (52): correlation with exp. fold rate by SS type.", fontsize=14)
ax.legend(title="SStype", fontsize=12, loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.grid()
  • If the task at hand is simply predicting fold-rate, N is competitive with ND, and simpler than ND, for structurally pure SStype chains, A and B.

  • ND variants are also better at modelling structurally pure chains:

    • For A, ND variants that "prioritize" LE over SE ('y' and 'r') do better.
    • For B, ND variants that "prioritize" SE over LE ('x' and 'c') do better.
  • The problematic class is structurally mixed SStype chains (AB), which seem to thrive in the "Goldilocks region" created by ND variant 'a' which strikes a good balance between LE and SE.

  • Goal is to design a ND variant that can handle all kinds of chains well.

    • An idea that presents itself is to combine the best of these ND variants.

Problem of trying to mix peak $E$s from different ND variants

  • ND energy ranges differ.
In [26]:
alpha_df = y_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = x_df.query("SStype == 'b'")
yax_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(yax_df))
yax_rs, yax_ps = nd.analyze_peakE_by_SStype(yax_df, ndv='y-a-x')
15 20 17 52
Pearson correlation with exp. fold rate: y-a-x
All: [0.1927 0.1711]
A: [-0.6905  0.0044]
AB: [-0.6481  0.002 ]
B: [-0.7052  0.0016]

In [27]:
alpha_df = r_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = x_df.query("SStype == 'b'")
rax_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(rax_df))
rax_rs, rax_ps = nd.analyze_peakE_by_SStype(rax_df, ndv='r-a-x')
15 20 17 52
Pearson correlation with exp. fold rate: r-a-x
All: [-0.3896  0.0043]
A: [-0.559   0.0303]
AB: [-0.6481  0.002 ]
B: [-0.7052  0.0016]

In [28]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4), constrained_layout=True)
plt.suptitle("Uzunoglo (52): combined peak Es.", fontsize=14)
ax1.set_title("y-a-x " + str(np.round(yax_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(yax_df, ax1, yax_rs)
ax1.legend(title="SStype")
ax2.set_title("r-a-x " + str(np.round(rax_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(rax_df, ax2, rax_rs)
_ = ax2.legend(title="SStype")

A simple first solution with unscaled $p_e$

  • The effect of scaling on $p_e$ for longer range contacts was reported previously, and now put to use.
In [29]:
dirn = os.path.join("UZ_ND3", "notscaled")
In [30]:
peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='x_notscaled')
ux_df = DF.assign(peakE = peakEs)
ux_rs, ux_ps = nd.analyze_peakE_by_SStype(ux_df, ndv='ux')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='c_notscaled')
uc_df = DF.assign(peakE = peakEs)
uc_rs, uc_ps = nd.analyze_peakE_by_SStype(uc_df, ndv='uc')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='a_notscaled')
ua_df = DF.assign(peakE = peakEs)
ua_rs, ua_ps = nd.analyze_peakE_by_SStype(ua_df, ndv='ua')
Pearson correlation with exp. fold rate: ux
All: [-0.5418  0.    ]
A: [-0.5723  0.0258]
AB: [-0.3542  0.1254]
B: [-0.4549  0.0666]

Pearson correlation with exp. fold rate: uc
All: [-0.5533  0.    ]
A: [-0.5264  0.0438]
AB: [-0.4392  0.0527]
B: [-0.4379  0.0788]

Pearson correlation with exp. fold rate: ua
All: [-0.4299  0.0015]
A: [-0.2204  0.4299]
AB: [-0.3907  0.0886]
B: [-0.2014  0.4382]

In [31]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 4), constrained_layout=True)
plt.suptitle("Uzunoglo: unscaled ND variants.", fontsize=14)

ax1.set_title("ux " + str(np.round(ux_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(ux_df, ax1, ux_rs)
ax1.legend(title="SStype")

ax2.set_title("uc " + str(np.round(uc_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(uc_df, ax2, uc_rs)
ax2.legend(title="SStype")

ax3.set_title("ua " + str(np.round(ua_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(ua_df, ax3, ua_rs)
_ = ax3.legend(title="SStype")
In [32]:
unscaled_dr = pd.DataFrame({'SStype': ['Z', 'A', 'AB', 'B'],
                            'ua_rs': ua_rs, 'ua_ps': ua_ps, 'uc_rs': uc_rs, 'uc_ps': uc_ps,
                            'ux_rs': ux_rs, 'ux_ps': ux_ps}).set_index('SStype')
In [33]:
UZcorrs_dr = UZcorrs_dr.join(unscaled_dr)
UZcorrs_dr.to_csv("UZcorrs_dr.txt", sep=' ')
In [34]:
CDF = UZcorrs_dr.filter(regex='_rs').T
CDF.round(4)
Out[34]:
SStype Z A AB B
y_rs 0.0664 -0.6905 -0.1091 -0.3274
r_rs -0.2145 -0.5590 -0.1903 -0.3294
a_rs -0.7267 -0.4121 -0.6481 -0.5744
c_rs -0.7106 -0.4370 -0.5412 -0.6408
x_rs -0.6995 -0.4700 -0.4938 -0.7052
h_rs -0.4446 -0.5552 -0.3260 -0.5550
n_rs -0.4355 -0.6721 -0.1850 -0.7425
n12_rs -0.4572 -0.6842 -0.1988 -0.7502
n23_rs -0.4504 -0.6807 -0.1944 -0.7479
ua_rs -0.4299 -0.2204 -0.3907 -0.2014
uc_rs -0.5533 -0.5264 -0.4392 -0.4379
ux_rs -0.5418 -0.5723 -0.3542 -0.4549
In [35]:
CDF.A.sort_values()
Out[35]:
y_rs     -0.690474
n12_rs   -0.684221
n23_rs   -0.680697
n_rs     -0.672115
ux_rs    -0.572340
r_rs     -0.559032
h_rs     -0.555163
uc_rs    -0.526444
x_rs     -0.470023
c_rs     -0.437038
a_rs     -0.412113
ua_rs    -0.220431
Name: A, dtype: float64
In [36]:
CDF.AB.sort_values()
Out[36]:
a_rs     -0.648068
c_rs     -0.541155
x_rs     -0.493830
uc_rs    -0.439166
ua_rs    -0.390667
ux_rs    -0.354226
h_rs     -0.325991
n12_rs   -0.198762
n23_rs   -0.194397
r_rs     -0.190254
n_rs     -0.185007
y_rs     -0.109097
Name: AB, dtype: float64
In [37]:
CDF.B.sort_values()
Out[37]:
n12_rs   -0.750195
n23_rs   -0.747886
n_rs     -0.742475
x_rs     -0.705247
c_rs     -0.640761
a_rs     -0.574378
h_rs     -0.555005
ux_rs    -0.454861
uc_rs    -0.437884
r_rs     -0.329368
y_rs     -0.327428
ua_rs    -0.201439
Name: B, dtype: float64
In [38]:
# plot to compare these ND variants
Xs = ['y', 'r', 'ua', 'a', 'uc', 'c', 'ux', 'x', 'h', 'n12']
hdf = CDF.copy()
hdf = hdf.reindex([l+"_rs" for l in Xs])
hdf
Out[38]:
SStype Z A AB B
y_rs 0.066354 -0.690474 -0.109097 -0.327428
r_rs -0.214498 -0.559032 -0.190254 -0.329368
ua_rs -0.429877 -0.220431 -0.390667 -0.201439
a_rs -0.726708 -0.412113 -0.648068 -0.574378
uc_rs -0.553336 -0.526444 -0.439166 -0.437884
c_rs -0.710594 -0.437038 -0.541155 -0.640761
ux_rs -0.541776 -0.572340 -0.354226 -0.454861
x_rs -0.699528 -0.470023 -0.493830 -0.705247
h_rs -0.444587 -0.555163 -0.325991 -0.555005
n12_rs -0.457243 -0.684221 -0.198762 -0.750195
In [39]:
# smaller (more negative) is better
NDVs = ['y', 'r', 'ua', 'a', 'uc', 'c', 'ux', 'x', 'h', 'N1/2']
fig, ax = plt.subplots()
ax.scatter(NDVs, hdf.A.values, marker='^',s=60, edgecolor='green',facecolor='none', label='A')
ax.scatter(NDVs, hdf.AB.values, edgecolor='orange', lw=2, s=60, facecolor='none', label='AB')
ax.scatter(NDVs, hdf.B.values, marker='s', edgecolor='magenta', facecolor='none', label='B')
ax.axhline(y=-0.5723, color="tab:green", lw=1, ls='--')
ax.annotate("", xy=('ux', -0.5723), xycoords='data', xytext=('x', -0.47), 
        arrowprops=dict(arrowstyle='-|>', connectionstyle="arc3", color='tab:green'))
ax.annotate("", xy=('uc', -0.5264), xycoords='data', xytext=('c', -0.437), 
        arrowprops=dict(arrowstyle='-|>', connectionstyle="arc3", color='tab:green'))
ax.annotate("", xy=('ua', -0.2204), xycoords='data', xytext=('a', -0.4121), 
        arrowprops=dict(arrowstyle='-|>', connectionstyle="arc3", color='tab:green'))
ax.set_xticks(NDVs); ax.set_xticklabels(NDVs, fontsize=14)
ax.set_title("Uzunoglo (52): correlation with exp. fold rate by SStype.", fontsize=14)
ax.legend(title="SStype", fontsize=12, loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.grid()
  • Unscaled $p_e$ with ND variants 'x' and 'c':
    • peak$E$ correlation with exp. fold rate strengthens for A, but weakens for both B and AB.
    • unscaled 'x' ('ux') corr. for A is competitive with 'r'.
  • This is expected behavior, since scaling reduces $p_e$ for LE, and findings so far indicate that A will benefit but B will suffer from an increase in $p_e$ of LE.

  • However, all correlations weaken with unscaled 'a' ('ua'). This could be the lingering of effect of restricting node selection to a local neighborhood (crony).

    • Node selection is more "explorative" with 'c' (selection is by remaining node degree over all nodes) and 'x' (uniform node selection, like 'r').

Combine unscaled and scaled ND variants

In [40]:
# mixed unscaled and scaled x, with a
alpha_df = ux_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = x_df.query("SStype == 'b'")
uxax_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(uxax_df))
uxax_rs, uxax_ps = nd.analyze_peakE_by_SStype(uxax_df, ndv='ux-a-x') 
print(nd.mjes_range(uxax_df.peakE.values))
15 20 17 52
Pearson correlation with exp. fold rate: ux-a-x
All: [-0.7666  0.    ]
A: [-0.5723  0.0258]
AB: [-0.6481  0.002 ]
B: [-0.7052  0.0016]

4.165713579469035
In [41]:
fig, ax1 = plt.subplots()
ax1.set_title("Uzunoglo(52) ux-a-x " + str(np.round(uxax_rs[0], 3)))
nd.plot_peakE_by_SStype(uxax_df, ax1, uxax_rs) 
ax1.set_xlabel("Combined peak Es from ND variants")
_ = ax1.legend(title="SStype")
  • Best mixed ND-soft model found so far is 'ux-a-x':
    • achieves overall correlation of -0.7666, and correlations by SStype are all significant.
    • uses unscaled 'x' for a type chains (-0.5723, 0.0258).
    • uses scaled 'a' for ab type chains (-0.6481, 0.002).
    • uses scaled 'x' for b type chains (-0.7052, 0.0016).

Kamagata dataset.

  • ND results are much weaker for this dataset, which is dominated by AB chains.
  • ND 'a' corr. is not much better than with chain length N, but 'a' gives insight into the folding process by suggesting an edge ordering, while N does not and cannot.
In [42]:
dirname = "Kamagata"
fn = os.path.join(dirname, "K_foldtype.txt")
df = pd.read_table(fn, sep='\t')
# remove outlier '1ael' for ND results
DF = df.query("pid != '1ael'") # reserve DF for this purpose!
DF.head()
Out[42]:
pid N k_f SStype
0 1pgb 56 2.78 ab
1 2cro 64 1.52 a
2 1cei 85 2.49 a
3 2abd 86 2.81 a
4 1hng 86 0.78 b
In [43]:
K_PIDs = DF.pid.values
print(len(K_PIDs), np.ptp(DF.k_f.values), DF.N.max())
K_SStypes = DF.SStype.values
20 4.55 175
In [44]:
df = nd.count_HSTs("Kamagata", K_PIDs, K_SStypes)
df.sample(5)
Out[44]:
H S T hsr SStype
1adw 23 45 55 0.511111 ab
1bni 24 25 59 0.960000 ab
2a5e 66 0 90 66.000000 a
1bta 37 16 36 2.312500 ab
2rn2 55 47 53 1.170213 ab
In [45]:
# nH > 3, nS < 10
da = df.query("SStype == 'a'")
print(len(da))
da
5
Out[45]:
H S T hsr SStype
2cro 40 0 24 40.0 a
1cei 49 2 34 24.5 a
2abd 52 0 34 52.0 a
1dwr 115 0 37 115.0 a
2a5e 66 0 90 66.0 a
In [46]:
# nH < 3, nS > 10
db = df.query("SStype == 'b'")
print(len(db))
db
3
Out[46]:
H S T hsr SStype
1hng 0 41 45 0.0 b
1tit 0 33 56 0.0 b
1ttg 0 43 51 0.0 b
In [47]:
# nH > 3, nS > 10
dab = df.query("SStype == 'ab'")
print(len(dab))
dab
12
Out[47]:
H S T hsr SStype
1pgb 14 24 18 0.583333 ab
1gxt 21 36 31 0.583333 ab
1bta 37 16 36 2.312500 ab
1bni 24 25 59 0.960000 ab
1adw 23 45 55 0.511111 ab
3chy 58 22 48 2.636364 ab
1joo 35 37 77 0.945946 ab
1i1b 8 72 71 0.111111 ab
2rn2 55 47 53 1.170213 ab
1ra9 40 50 69 0.800000 ab
1l63 107 15 40 7.133333 ab
1php 69 27 79 2.555556 ab
In [48]:
n_dr = nd.frate_corrs_with_N_by_SStype(DF, "Kamagata(20)")
n_dr
Pearson correlation with exp. fold rate: chain-length N 1
All: [-0.5623  0.0099]
A: [-0.5364  0.3513]
AB: [-0.5564  0.0603]
B: [0.996 0.057]

Pearson correlation with exp. fold rate: chain-length N 0.5
All: [-0.5719  0.0084]
A: [-0.4994  0.3917]
AB: [-0.5823  0.047 ]
B: [0.9965 0.0533]

Pearson correlation with exp. fold rate: chain-length N 0.6666666666666666
All: [-0.5691  0.0088]
A: [-0.5124  0.3774]
AB: [-0.5741  0.0509]
B: [0.9963 0.0545]

Out[48]:
n_rs n_ps n12_rs n12_ps n23_rs n23_ps
SStype
Z -0.562293 0.009862 -0.571902 0.008422 -0.569059 0.008829
A -0.536434 0.351332 -0.499379 0.391687 -0.512422 0.377363
AB -0.556397 0.060275 -0.582329 0.046954 -0.574065 0.050948
B 0.995988 0.057047 0.996501 0.053271 0.996334 0.054532

ND-hard (ndh)

In [49]:
dirn = os.path.join("K_NDH", "hard")
pid = "2abd"
fn = os.path.join(dirn, pid + "_bsd_hstats_h.out")
dh = pd.read_table(fn, sep=' ')
dh.head()
Out[49]:
bid fe fsc sc_mje num_sc num_scle num_nnsc num_nnscle
0 2 0.109661 0.000000 0.00 0 0 0 0
1 3 0.210183 0.476821 24.92 72 0 83 0
2 4 0.297650 0.768212 -46.39 116 0 107 0
3 5 0.379896 0.814570 81.04 123 0 161 0
4 6 0.447781 0.821192 206.50 124 0 208 0
In [50]:
maxEs, argmaxEs, maxE_sds = nd.get_ndh_maxEs(dirn, K_PIDs)
h_df = DF.assign(maxE=maxEs, maxE_sd=maxE_sds)
h_df.head()
Out[50]:
pid N k_f SStype maxE maxE_sd
0 1pgb 56 2.78 ab 91.88 11
1 2cro 64 1.52 a 49.89 13
2 1cei 85 2.49 a 154.90 10
3 2abd 86 2.81 a 359.52 12
4 1hng 86 0.78 b 161.08 17
In [51]:
print("maxE_sds:", np.mean(maxE_sds), np.std(maxE_sds, ddof=1), 
      np.min(maxE_sds), np.max(maxE_sds))
# correlation maxEs with exp. fold rate
print("maxE Pearson corr. with exp. fold rate:", pearsonr(maxEs, DF.k_f.values))
# outlier?
_ = plt.scatter(maxEs, DF.k_f.values)
maxE_sds: 20.65 9.166730462297942 10 38
maxE Pearson corr. with exp. fold rate: (-0.350480953363002, 0.12977119399473655)
In [52]:
h_rs, h_ps = nd.analyze_maxE_by_SStype(h_df)
maxE_sd mean stdev
All: [20.65    9.1667]
A: [17.6     9.7622]
AB: [22.9167  9.6244]
B: [16.6667  4.5092]

Pearson correlation with exp. fold rate: maxE
All: [-0.3505  0.1298]
A: [-0.3426  0.5725]
AB: [-0.3071  0.3316]
B: [0.1789 0.8855]

  • K maxE_sd is much larger than UZ.
  • All corrs. are insignificant.
In [53]:
fig, ax1 = plt.subplots()
ax1.set_title("Kamagata h " + str(np.round(h_rs[0] ,3)), fontsize=14)
nd.plot_maxE_by_SStype(h_df, ax1, h_rs)
_ = ax1.legend(title="SStype", fontsize=12, loc="upper left", bbox_to_anchor=(1.0, 1.0))

ND variants: craxy

In [54]:
dirn = os.path.join("K_ND3", "edv_a")
In [55]:
peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='a')
a_df = DF.assign(peakE = peakEs)
a_rs, a_ps = nd.analyze_peakE_by_SStype(a_df, 'a')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='c')
c_df = DF.assign(peakE = peakEs)
c_rs, c_ps = nd.analyze_peakE_by_SStype(c_df, 'c')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='x')
x_df = DF.assign(peakE = peakEs)
x_rs, x_ps = nd.analyze_peakE_by_SStype(x_df, 'x')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='r')
r_df = DF.assign(peakE = peakEs)
r_rs, r_ps = nd.analyze_peakE_by_SStype(r_df, 'r')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='y')
y_df = DF.assign(peakE = peakEs)
y_rs, y_ps = nd.analyze_peakE_by_SStype(y_df, 'y')
Pearson correlation with exp. fold rate: a
All: [-0.5618  0.0099]
A: [-0.0612  0.9221]
AB: [-0.6306  0.0279]
B: [-0.0617  0.9607]

Pearson correlation with exp. fold rate: c
All: [-0.5525  0.0115]
A: [-0.2831  0.6444]
AB: [-0.522   0.0817]
B: [-0.0771  0.9509]

Pearson correlation with exp. fold rate: x
All: [-0.5543  0.0112]
A: [-0.1763  0.7767]
AB: [-0.5111  0.0894]
B: [-0.0014  0.9991]

Pearson correlation with exp. fold rate: r
All: [-0.278   0.2354]
A: [-0.3437  0.5712]
AB: [-0.3263  0.3005]
B: [0.0228 0.9855]

Pearson correlation with exp. fold rate: y
All: [-0.0909  0.7031]
A: [-0.4993  0.3918]
AB: [-0.0982  0.7614]
B: [0.0273 0.9826]

In [56]:
dirn = os.path.join("K_ND3", "notscaled")

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='a_notscaled')
ua_df = DF.assign(peakE = peakEs)
ua_rs, ua_ps = nd.analyze_peakE_by_SStype(ua_df, 'ua')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='c_notscaled')
uc_df = DF.assign(peakE = peakEs)
uc_rs, uc_ps = nd.analyze_peakE_by_SStype(uc_df, 'uc')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='x_notscaled')
ux_df = DF.assign(peakE = peakEs)
ux_rs, ux_ps = nd.analyze_peakE_by_SStype(ux_df, 'ux')
Pearson correlation with exp. fold rate: ua
All: [-0.3502  0.1301]
A: [0.0087 0.989 ]
AB: [-0.4751  0.1186]
B: [-0.0931  0.9407]

Pearson correlation with exp. fold rate: uc
All: [-0.4824  0.0312]
A: [-0.2302  0.7095]
AB: [-0.6005  0.039 ]
B: [-0.0823  0.9475]

Pearson correlation with exp. fold rate: ux
All: [-0.4338  0.056 ]
A: [-0.2483  0.6871]
AB: [-0.4988  0.0988]
B: [-0.0094  0.994 ]

Summary of results for analysis of K by SS type:

In [57]:
Kcorrs_dr = pd.DataFrame({'SStype': ['Z', 'A', 'AB', 'B'], 
                'y_rs': y_rs, 'y_ps': y_ps, 'r_rs': r_rs, 'r_ps': r_ps,
                'ua_rs': ua_rs, 'ua_ps': ua_ps, 'a_rs': a_rs, 'a_ps': a_ps, 
                'uc_rs': uc_rs, 'uc_ps': uc_ps, 'c_rs': c_rs, 'c_ps': c_ps, 
                'ux_rs': ux_rs, 'ux_ps': ux_ps, 'x_rs': x_rs, 'x_ps': x_ps, 
                'h_rs': h_rs, 'h_ps': h_ps, }).set_index('SStype')
Kcorrs_dr = Kcorrs_dr.join(n_dr)
In [58]:
Kcorrs_dr.to_csv("KCorrs_dr.txt", sep=' ')
In [59]:
CDF = Kcorrs_dr.filter(regex="_rs").T
CDF.round(4)
Out[59]:
SStype Z A AB B
y_rs -0.0909 -0.4993 -0.0982 0.0273
r_rs -0.2780 -0.3437 -0.3263 0.0228
ua_rs -0.3502 0.0087 -0.4751 -0.0931
a_rs -0.5618 -0.0612 -0.6306 -0.0617
uc_rs -0.4824 -0.2302 -0.6005 -0.0823
c_rs -0.5525 -0.2831 -0.5220 -0.0771
ux_rs -0.4338 -0.2483 -0.4988 -0.0094
x_rs -0.5543 -0.1763 -0.5111 -0.0014
h_rs -0.3505 -0.3426 -0.3071 0.1789
n_rs -0.5623 -0.5364 -0.5564 0.9960
n12_rs -0.5719 -0.4994 -0.5823 0.9965
n23_rs -0.5691 -0.5124 -0.5741 0.9963
In [60]:
CDF.Z.sort_values()
Out[60]:
n12_rs   -0.571902
n23_rs   -0.569059
n_rs     -0.562293
a_rs     -0.561783
x_rs     -0.554254
c_rs     -0.552456
uc_rs    -0.482393
ux_rs    -0.433768
h_rs     -0.350481
ua_rs    -0.350213
r_rs     -0.277979
y_rs     -0.090915
Name: Z, dtype: float64
In [61]:
CDF.A.sort_values()
Out[61]:
n_rs     -0.536434
n23_rs   -0.512422
n12_rs   -0.499379
y_rs     -0.499269
r_rs     -0.343654
h_rs     -0.342557
c_rs     -0.283099
ux_rs    -0.248295
uc_rs    -0.230229
x_rs     -0.176308
a_rs     -0.061210
ua_rs     0.008665
Name: A, dtype: float64
In [62]:
CDF.AB.sort_values()
Out[62]:
a_rs     -0.630587
uc_rs    -0.600451
n12_rs   -0.582329
n23_rs   -0.574065
n_rs     -0.556397
c_rs     -0.521978
x_rs     -0.511124
ux_rs    -0.498820
ua_rs    -0.475065
r_rs     -0.326349
h_rs     -0.307109
y_rs     -0.098194
Name: AB, dtype: float64
In [63]:
CDF.B.sort_values()
Out[63]:
ua_rs    -0.093051
uc_rs    -0.082309
c_rs     -0.077122
a_rs     -0.061738
ux_rs    -0.009396
x_rs     -0.001361
r_rs      0.022767
y_rs      0.027334
h_rs      0.178924
n_rs      0.995988
n23_rs    0.996334
n12_rs    0.996501
Name: B, dtype: float64
In [64]:
NDVs = ['y', 'r', 'ua', 'a', 'uc', 'c', 'ux', 'x', 'h', 'N', 'N1/2', 'N2/3']
# smaller (more negative) is better
fig, ax = plt.subplots()
ax.scatter(NDVs, CDF.A.values, marker='^', edgecolor='green', facecolor='none', label='A')
ax.scatter(NDVs, CDF.AB.values, edgecolor='orange', lw=2, s=60, facecolor='none', label='AB')
ax.scatter(NDVs, CDF.B.values, marker='s', edgecolor='magenta', facecolor='none', label='B')
ax.set_xticks(NDVs); ax.set_xticklabels(NDVs, fontsize=14)
ax.set_xlabel("LE <--- priority ---> SE", fontsize=14)
ax.set_title("Kamagata (20): correlation with exp. fold rate by SS type.", fontsize=14)
ax.legend(title="SStype", fontsize=12, loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.grid()
In [65]:
dp = Kcorrs_dr.filter(regex="_ps").T
db = dp.apply(lambda x: x < 0.05)
db
Out[65]:
SStype Z A AB B
y_ps False False False False
r_ps False False False False
ua_ps False False False False
a_ps True False True False
uc_ps True False True False
c_ps True False False False
ux_ps False False False False
x_ps True False False False
h_ps False False False False
n_ps True False False False
n12_ps True False True False
n23_ps True False False False
  • As observed previously with UZ, 'y' and 'r' perform better (but insig.) for A chains, and 'a' produces strongest (and sig.) corr. for AB. However, no ND variant produces a sig. corr. for B.
  • Except for ND variants 'a' an 'uc' on AB chains, the corrs. by chain SStype are insignificant (p-value < 0.05).
  • Data problem: A and B classes have fewer than 10 chains each.
In [66]:
# combine y-a-ua
alpha_df = y_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = ua_df.query("SStype == 'b'")
yaua_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(yaua_df))
yaua_rs, yaua_ps = nd.analyze_peakE_by_SStype(yaua_df, ndv='y-a-ua') 
print(nd.mjes_range(yaua_df.peakE.values))
5 12 3 20
Pearson correlation with exp. fold rate: y-a-ua
All: [0.1176 0.6215]
A: [-0.4993  0.3918]
AB: [-0.6306  0.0279]
B: [-0.0931  0.9407]

3.5322812572893088
In [67]:
# combine ux-a-ua
alpha_df = ux_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = ua_df.query("SStype == 'b'")
uxaua_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(uxaua_df))
uxaua_rs, uxaua_ps = nd.analyze_peakE_by_SStype(uxaua_df, ndv='ux-a-ua') 
print(nd.mjes_range(uxaua_df.peakE.values))
5 12 3 20
Pearson correlation with exp. fold rate: ux-a-ua
All: [-0.5784  0.0076]
A: [-0.2483  0.6871]
AB: [-0.6306  0.0279]
B: [-0.0931  0.9407]

3.0378335189840704
In [68]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 4), constrained_layout=True)
plt.suptitle("Kamagata(20): combined peak Es", fontsize=14)

ax1.set_title("y-a-ua " + str(np.round(yaua_rs[0], 3)))
nd.plot_peakE_by_SStype(yaua_df, ax1, yaua_rs) 
ax1.legend(title="SStype")

ax2.set_title("ux-a-ua " + str(np.round(uxaua_rs[0], 3)))
nd.plot_peakE_by_SStype(uxaua_df, ax2, uxaua_rs) 
ax2.legend(title="SStype")

ax3.set_title("a " + str(np.round(a_rs[0], 3)))
nd.plot_peakE_by_SStype(a_df, ax3, a_rs) 
_ = ax3.legend(title="SStype")

End of notebook

SKLC January 2021