Examine ND results by chain SStype¶

Protein chains are partitioned into three classes by secondary structure content.
Correlations with exp. fold rate vary by SStype and ND variant.
Fold rate corrs should at least outperform N for the task of fold rate prediction, and outperform h for the purpose of estimating a resonable edge ordering.

Slightly improved correlation for UZ when peak Es from ND variants are mixed by SStype and $p_e$ is adjusted by SStype.

K correlation largely a factor of AB chains, which dominates the dataset.

Results point to a need for a more subtle contact formation strategy in ND: context-sensitive at level of nodes or edges.

import os
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from sklearn import linear_model
import matplotlib.pyplot as plt
import NDutils as nd # my lib

NDVs = ['y', 'r', 'a', 'c', 'x', 'h']

ND variants are arranged in LE/SE priority order in the sense that models towards the right respect the tenent of forming SE over LE more strictly, while models towards the left give more chances for LE to form before SE, with 'y' adopting this inversion of priority explicitly in its $p_e$.
'h' is distinct from the other ND variants in that edge ordering is strictly determined by sequence distance (increasing monotonically), while the other ND variants are stochastic processes.

Uzunoglu¶

dirname = "Uzunoglu"
fn = os.path.join(dirname, "UZ_foldtype.txt")
DF = pd.read_table(fn, sep=' ')
DF.head() # reserve DF for this purpose!

UZ_PIDs = DF.pid.values
UZ_frates = DF.k_f.values
print(len(UZ_PIDs), np.ptp(UZ_frates))
UZ_SStypes = DF.SStype.values

52 5.96

Defining SStype for chains¶

The three chain fold types or chain SStypes are:
- A for chains with predominantly $\alpha$ residues
  - number of $\alpha$ residues (nH) > 3, number of $\beta$ residues (nS) < 11, and nH/nS $\approx$ nH or >> 1;
- B for chains with predominantly $\beta$ residues
  - nS > 10, nH < 4, and nH/nS $\approx$ 0.0;
- AB otherwise.

df = nd.count_HSTs("Uzunoglu", UZ_PIDs, UZ_SStypes)
print(df.dtypes)
df.sample(5)

H           int64
S           int64
T           int64
hsr       float64
SStype     object
dtype: object

da = df.query("SStype == 'a'")
print(len(da))
da

15

# reclassified 1o6x as 'ab', nH > 3
db = df.query("SStype == 'b'")
print(len(db))
db

17

# reclassified 1g6p, 1m9s 'b'
dab = df.query("SStype == 'ab'")
print(len(dab))
dab

20

Initial exploration of UZ DF¶

There should be a sensible relationship between exp. fold rate with:
- chain length N: longer chains (larger N) generally take longer to fold (smaller fold rate value). Scaled N is also examined: 1/2, 2/3.
- chain SStype: A chains generally fold faster (larger fold rate value) than B chains.

n_dr = nd.frate_corrs_with_N_by_SStype(DF, "Uzunoglu (52)")
n_dr

Pearson correlation with exp. fold rate: chain-length N 1
All: [-0.4355  0.0013]
A: [-0.6721  0.0061]
AB: [-0.185   0.4349]
B: [-7.425e-01  6.000e-04]

Pearson correlation with exp. fold rate: chain-length N 0.5
All: [-0.4572  0.0007]
A: [-0.6842  0.0049]
AB: [-0.1988  0.4009]
B: [-7.502e-01  5.000e-04]

Pearson correlation with exp. fold rate: chain-length N 0.6666666666666666
All: [-0.4504  0.0008]
A: [-0.6807  0.0052]
AB: [-0.1944  0.4115]
B: [-7.479e-01  6.000e-04]

ND-hard (ndh)¶

PRN0 edges are batch recreated in monotonically increasing sequence distance order.

dirn = os.path.join("UZ_NDH", "hard")

# example of a datafile
pid = "1aps"
fn = os.path.join(dirn, pid + "_bsd_hstats_h.out")
dh = pd.read_table(fn, sep=' ')
dh.head()

maxEs, argmaxEs, maxE_sds = nd.get_ndh_maxEs(dirn, UZ_PIDs)
h_df = DF.assign(maxE=maxEs, maxE_sd=maxE_sds)
h_df.head()

print("maxE_sds:", np.mean(maxE_sds), np.std(maxE_sds, ddof=1), 
      np.min(maxE_sds), np.max(maxE_sds))
# correlation maxEs with exp. fold rate
print("maxE Pearson corr. with exp. fold rate:", pearsonr(maxEs, UZ_frates))
# check for outliers
_ = plt.scatter(maxEs, UZ_frates)

maxE_sds: 12.73076923076923 5.299335454510013 4 31
maxE Pearson corr. with exp. fold rate: (-0.4445872304072675, 0.0009598432602786288)

h_df.query("maxE > 700") # outlier

h_rs, h_ps = nd.analyze_maxE_by_SStype(h_df)

maxE_sd mean stdev
All: [12.7308  5.2993]
A: [10.6667  2.3503]
AB: [16.1    5.821]
B: [10.5882  4.6241]

Pearson correlation with exp. fold rate: maxE
All: [-0.4446  0.001 ]
A: [-0.5552  0.0317]
AB: [-0.326   0.1607]
B: [-0.555   0.0207]

Compared with A and B chains, AB chains have larger maxE_sd on average.
- maxE_sd marks the point where ND energy profile changes direction (increase to decrease).
- A hint that perhaps LE need to appear earlier.
- Corr. for AB is not significant (p > 0.05) because of the outlier.

fig, ax1 = plt.subplots()
ax = nd.plot_maxE_by_SStype(h_df, ax1, h_rs)
ax.set_title("Uzunoglo h " + str(np.round(h_rs[0], 3)), fontsize=14)
_ = ax.legend(title="SStype", fontsize=12)

# excluding outlier
h1_df = h_df.query("pid != '2vik'") # ab
len(h1_df)

51

h1_rs, h1_ps = nd.analyze_maxE_by_SStype(h1_df)

maxE_sd mean stdev
All: [12.5686  5.2202]
A: [10.6667  2.3503]
AB: [15.8421  5.862 ]
B: [10.5882  4.6241]

Pearson correlation with exp. fold rate: maxE
All: [-0.5417  0.    ]
A: [-0.5552  0.0317]
AB: [-0.6196  0.0047]
B: [-0.555   0.0207]

ND variants (ndv): craxy¶

dirn = os.path.join("UZ_ND3", "edv_a")

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='a')
a_df = DF.assign(peakE = peakEs)
a_rs, a_ps = nd.analyze_peakE_by_SStype(a_df, 'a')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='c')
c_df = DF.assign(peakE = peakEs)
c_rs, c_ps = nd.analyze_peakE_by_SStype(c_df, 'c')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='x')
x_df = DF.assign(peakE = peakEs)
x_rs, x_ps = nd.analyze_peakE_by_SStype(x_df, 'x')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='r')
r_df = DF.assign(peakE = peakEs)
r_rs, r_ps = nd.analyze_peakE_by_SStype(r_df, 'r')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='y')
y_df = DF.assign(peakE = peakEs)
y_rs, y_ps = nd.analyze_peakE_by_SStype(y_df, 'y')

Pearson correlation with exp. fold rate: a
All: [-0.7267  0.    ]
A: [-0.4121  0.1269]
AB: [-0.6481  0.002 ]
B: [-0.5744  0.0159]

Pearson correlation with exp. fold rate: c
All: [-0.7106  0.    ]
A: [-0.437   0.1033]
AB: [-0.5412  0.0137]
B: [-0.6408  0.0056]

Pearson correlation with exp. fold rate: x
All: [-0.6995  0.    ]
A: [-0.47    0.0771]
AB: [-0.4938  0.0269]
B: [-0.7052  0.0016]

Pearson correlation with exp. fold rate: r
All: [-0.2145  0.1268]
A: [-0.559   0.0303]
AB: [-0.1903  0.4217]
B: [-0.3294  0.1967]

Pearson correlation with exp. fold rate: y
All: [0.0664 0.6402]
A: [-0.6905  0.0044]
AB: [-0.1091  0.6471]
B: [-0.3274  0.1995]

Summary of results for analysis of UZ by SStype¶

# df = eval('a' + "_df")
# np.all(df.peakE.values == a_df.peakE.values) # True

fig, axes = plt.subplots(2, 3, figsize=(12, 7), constrained_layout=True)
plt.suptitle("Uzunoglo (52)", fontsize=14)
for i, ndv in enumerate(NDVs[:-1]):
    r = i//3; c = i%3; ax1 = axes[r][c]
    df = eval(ndv + "_df"); rs = eval(ndv + "_rs")
    nd.plot_peakE_by_SStype(df, ax1, rs)
    ax1.set_title(ndv + ' ' + str(np.round(rs[0], 3)) , fontsize=14)
    ax1.legend(title="SStype")

ax1 = axes[1][2]
nd.plot_maxE_by_SStype(h_df, ax1, h_rs)
ax1.set_title('h ' + str(np.round(h_rs[0], 3)), fontsize=14)
_ = ax1.legend(title="SStype")

# Z == All
dr = pd.DataFrame({'SStype': ['Z', 'A', 'AB', 'B'], 
                'y_rs': y_rs, 'y_ps': y_ps, 'r_rs': r_rs, 'r_ps': r_ps,
                'a_rs': a_rs, 'a_ps': a_ps, 'c_rs': c_rs, 'c_ps': c_ps, 
                'x_rs': x_rs, 'x_ps': x_ps, 'h_rs': h_rs, 'h_ps': h_ps,}).set_index('SStype')
UZcorrs_dr = dr.join(n_dr)

# save to file
UZcorrs_dr.to_csv("UZcorrs_dr.txt", sep=' ')

# desire corrs with ND variants to be stronger than with 'h' or 'n',
# since generating peak Es involve more work
df = UZcorrs_dr.filter(regex='_rs').T
df.round(4)

# smaller (more negative) is better
Xs = NDVs[:]; Xs.extend(['N', 'N1/2', 'N2/3'])
fig, ax = plt.subplots()
ax.scatter(Xs, df.A.values, marker='^', edgecolor='green', facecolor='none', label='A')
ax.scatter(Xs, df.AB.values, edgecolor='orange', lw=2, facecolor='none', label='AB')
ax.scatter(Xs, df.B.values, marker='s', edgecolor='magenta', facecolor='none', label='B')
ax.set_xticks(Xs); ax.set_xticklabels(Xs, fontsize=14)
ax.set_xlabel("LE <--- priority ---> SE", fontsize=14)
ax.set_title("Uzunoglo (52): correlation with exp. fold rate by SS type.", fontsize=14)
ax.legend(title="SStype", fontsize=12, loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.grid()

If the task at hand is simply predicting fold-rate, N is competitive with ND, and simpler than ND, for structurally pure SStype chains, A and B.
ND variants are also better at modelling structurally pure chains:
- For A, ND variants that "prioritize" LE over SE ('y' and 'r') do better.
- For B, ND variants that "prioritize" SE over LE ('x' and 'c') do better.
The problematic class is structurally mixed SStype chains (AB), which seem to thrive in the "Goldilocks region" created by ND variant 'a' which strikes a good balance between LE and SE.
Goal is to design a ND variant that can handle all kinds of chains well.
- An idea that presents itself is to combine the best of these ND variants.

Problem of trying to mix peak $E$s from different ND variants¶

ND energy ranges differ.

alpha_df = y_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = x_df.query("SStype == 'b'")
yax_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(yax_df))
yax_rs, yax_ps = nd.analyze_peakE_by_SStype(yax_df, ndv='y-a-x')

15 20 17 52
Pearson correlation with exp. fold rate: y-a-x
All: [0.1927 0.1711]
A: [-0.6905  0.0044]
AB: [-0.6481  0.002 ]
B: [-0.7052  0.0016]

alpha_df = r_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = x_df.query("SStype == 'b'")
rax_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(rax_df))
rax_rs, rax_ps = nd.analyze_peakE_by_SStype(rax_df, ndv='r-a-x')

15 20 17 52
Pearson correlation with exp. fold rate: r-a-x
All: [-0.3896  0.0043]
A: [-0.559   0.0303]
AB: [-0.6481  0.002 ]
B: [-0.7052  0.0016]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4), constrained_layout=True)
plt.suptitle("Uzunoglo (52): combined peak Es.", fontsize=14)
ax1.set_title("y-a-x " + str(np.round(yax_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(yax_df, ax1, yax_rs)
ax1.legend(title="SStype")
ax2.set_title("r-a-x " + str(np.round(rax_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(rax_df, ax2, rax_rs)
_ = ax2.legend(title="SStype")

A simple first solution with unscaled $p_e$¶

The effect of scaling on $p_e$ for longer range contacts was reported previously, and now put to use.

dirn = os.path.join("UZ_ND3", "notscaled")

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='x_notscaled')
ux_df = DF.assign(peakE = peakEs)
ux_rs, ux_ps = nd.analyze_peakE_by_SStype(ux_df, ndv='ux')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='c_notscaled')
uc_df = DF.assign(peakE = peakEs)
uc_rs, uc_ps = nd.analyze_peakE_by_SStype(uc_df, ndv='uc')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, UZ_PIDs, "stats", ndv='a_notscaled')
ua_df = DF.assign(peakE = peakEs)
ua_rs, ua_ps = nd.analyze_peakE_by_SStype(ua_df, ndv='ua')

Pearson correlation with exp. fold rate: ux
All: [-0.5418  0.    ]
A: [-0.5723  0.0258]
AB: [-0.3542  0.1254]
B: [-0.4549  0.0666]

Pearson correlation with exp. fold rate: uc
All: [-0.5533  0.    ]
A: [-0.5264  0.0438]
AB: [-0.4392  0.0527]
B: [-0.4379  0.0788]

Pearson correlation with exp. fold rate: ua
All: [-0.4299  0.0015]
A: [-0.2204  0.4299]
AB: [-0.3907  0.0886]
B: [-0.2014  0.4382]

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 4), constrained_layout=True)
plt.suptitle("Uzunoglo: unscaled ND variants.", fontsize=14)

ax1.set_title("ux " + str(np.round(ux_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(ux_df, ax1, ux_rs)
ax1.legend(title="SStype")

ax2.set_title("uc " + str(np.round(uc_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(uc_df, ax2, uc_rs)
ax2.legend(title="SStype")

ax3.set_title("ua " + str(np.round(ua_rs[0], 3)), fontsize=14)
nd.plot_peakE_by_SStype(ua_df, ax3, ua_rs)
_ = ax3.legend(title="SStype")

unscaled_dr = pd.DataFrame({'SStype': ['Z', 'A', 'AB', 'B'],
                            'ua_rs': ua_rs, 'ua_ps': ua_ps, 'uc_rs': uc_rs, 'uc_ps': uc_ps,
                            'ux_rs': ux_rs, 'ux_ps': ux_ps}).set_index('SStype')

UZcorrs_dr = UZcorrs_dr.join(unscaled_dr)
UZcorrs_dr.to_csv("UZcorrs_dr.txt", sep=' ')

CDF = UZcorrs_dr.filter(regex='_rs').T
CDF.round(4)

CDF.A.sort_values()

y_rs     -0.690474
n12_rs   -0.684221
n23_rs   -0.680697
n_rs     -0.672115
ux_rs    -0.572340
r_rs     -0.559032
h_rs     -0.555163
uc_rs    -0.526444
x_rs     -0.470023
c_rs     -0.437038
a_rs     -0.412113
ua_rs    -0.220431
Name: A, dtype: float64

CDF.AB.sort_values()

a_rs     -0.648068
c_rs     -0.541155
x_rs     -0.493830
uc_rs    -0.439166
ua_rs    -0.390667
ux_rs    -0.354226
h_rs     -0.325991
n12_rs   -0.198762
n23_rs   -0.194397
r_rs     -0.190254
n_rs     -0.185007
y_rs     -0.109097
Name: AB, dtype: float64

CDF.B.sort_values()

n12_rs   -0.750195
n23_rs   -0.747886
n_rs     -0.742475
x_rs     -0.705247
c_rs     -0.640761
a_rs     -0.574378
h_rs     -0.555005
ux_rs    -0.454861
uc_rs    -0.437884
r_rs     -0.329368
y_rs     -0.327428
ua_rs    -0.201439
Name: B, dtype: float64

# plot to compare these ND variants
Xs = ['y', 'r', 'ua', 'a', 'uc', 'c', 'ux', 'x', 'h', 'n12']
hdf = CDF.copy()
hdf = hdf.reindex([l+"_rs" for l in Xs])
hdf

# smaller (more negative) is better
NDVs = ['y', 'r', 'ua', 'a', 'uc', 'c', 'ux', 'x', 'h', 'N1/2']
fig, ax = plt.subplots()
ax.scatter(NDVs, hdf.A.values, marker='^',s=60, edgecolor='green',facecolor='none', label='A')
ax.scatter(NDVs, hdf.AB.values, edgecolor='orange', lw=2, s=60, facecolor='none', label='AB')
ax.scatter(NDVs, hdf.B.values, marker='s', edgecolor='magenta', facecolor='none', label='B')
ax.axhline(y=-0.5723, color="tab:green", lw=1, ls='--')
ax.annotate("", xy=('ux', -0.5723), xycoords='data', xytext=('x', -0.47), 
        arrowprops=dict(arrowstyle='-|>', connectionstyle="arc3", color='tab:green'))
ax.annotate("", xy=('uc', -0.5264), xycoords='data', xytext=('c', -0.437), 
        arrowprops=dict(arrowstyle='-|>', connectionstyle="arc3", color='tab:green'))
ax.annotate("", xy=('ua', -0.2204), xycoords='data', xytext=('a', -0.4121), 
        arrowprops=dict(arrowstyle='-|>', connectionstyle="arc3", color='tab:green'))
ax.set_xticks(NDVs); ax.set_xticklabels(NDVs, fontsize=14)
ax.set_title("Uzunoglo (52): correlation with exp. fold rate by SStype.", fontsize=14)
ax.legend(title="SStype", fontsize=12, loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.grid()

Unscaled $p_e$ with ND variants 'x' and 'c':
- peak$E$ correlation with exp. fold rate strengthens for A, but weakens for both B and AB.
- unscaled 'x' ('ux') corr. for A is competitive with 'r'.
This is expected behavior, since scaling reduces $p_e$ for LE, and findings so far indicate that A will benefit but B will suffer from an increase in $p_e$ of LE.
However, all correlations weaken with unscaled 'a' ('ua'). This could be the lingering of effect of restricting node selection to a local neighborhood (crony).
- Node selection is more "explorative" with 'c' (selection is by remaining node degree over all nodes) and 'x' (uniform node selection, like 'r').

Combine unscaled and scaled ND variants¶

# mixed unscaled and scaled x, with a
alpha_df = ux_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = x_df.query("SStype == 'b'")
uxax_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(uxax_df))
uxax_rs, uxax_ps = nd.analyze_peakE_by_SStype(uxax_df, ndv='ux-a-x') 
print(nd.mjes_range(uxax_df.peakE.values))

15 20 17 52
Pearson correlation with exp. fold rate: ux-a-x
All: [-0.7666  0.    ]
A: [-0.5723  0.0258]
AB: [-0.6481  0.002 ]
B: [-0.7052  0.0016]

4.165713579469035

fig, ax1 = plt.subplots()
ax1.set_title("Uzunoglo(52) ux-a-x " + str(np.round(uxax_rs[0], 3)))
nd.plot_peakE_by_SStype(uxax_df, ax1, uxax_rs) 
ax1.set_xlabel("Combined peak Es from ND variants")
_ = ax1.legend(title="SStype")

Best mixed ND-soft model found so far is 'ux-a-x':
- achieves overall correlation of -0.7666, and correlations by SStype are all significant.
- uses unscaled 'x' for a type chains (-0.5723, 0.0258).
- uses scaled 'a' for ab type chains (-0.6481, 0.002).
- uses scaled 'x' for b type chains (-0.7052, 0.0016).

Kamagata dataset.¶

ND results are much weaker for this dataset, which is dominated by AB chains.
ND 'a' corr. is not much better than with chain length N, but 'a' gives insight into the folding process by suggesting an edge ordering, while N does not and cannot.

dirname = "Kamagata"
fn = os.path.join(dirname, "K_foldtype.txt")
df = pd.read_table(fn, sep='\t')
# remove outlier '1ael' for ND results
DF = df.query("pid != '1ael'") # reserve DF for this purpose!
DF.head()

K_PIDs = DF.pid.values
print(len(K_PIDs), np.ptp(DF.k_f.values), DF.N.max())
K_SStypes = DF.SStype.values

20 4.55 175

df = nd.count_HSTs("Kamagata", K_PIDs, K_SStypes)
df.sample(5)

# nH > 3, nS < 10
da = df.query("SStype == 'a'")
print(len(da))
da

5

# nH < 3, nS > 10
db = df.query("SStype == 'b'")
print(len(db))
db

3

# nH > 3, nS > 10
dab = df.query("SStype == 'ab'")
print(len(dab))
dab

12

n_dr = nd.frate_corrs_with_N_by_SStype(DF, "Kamagata(20)")
n_dr

Pearson correlation with exp. fold rate: chain-length N 1
All: [-0.5623  0.0099]
A: [-0.5364  0.3513]
AB: [-0.5564  0.0603]
B: [0.996 0.057]

Pearson correlation with exp. fold rate: chain-length N 0.5
All: [-0.5719  0.0084]
A: [-0.4994  0.3917]
AB: [-0.5823  0.047 ]
B: [0.9965 0.0533]

Pearson correlation with exp. fold rate: chain-length N 0.6666666666666666
All: [-0.5691  0.0088]
A: [-0.5124  0.3774]
AB: [-0.5741  0.0509]
B: [0.9963 0.0545]

ND-hard (ndh)¶

dirn = os.path.join("K_NDH", "hard")
pid = "2abd"
fn = os.path.join(dirn, pid + "_bsd_hstats_h.out")
dh = pd.read_table(fn, sep=' ')
dh.head()

maxEs, argmaxEs, maxE_sds = nd.get_ndh_maxEs(dirn, K_PIDs)
h_df = DF.assign(maxE=maxEs, maxE_sd=maxE_sds)
h_df.head()

print("maxE_sds:", np.mean(maxE_sds), np.std(maxE_sds, ddof=1), 
      np.min(maxE_sds), np.max(maxE_sds))
# correlation maxEs with exp. fold rate
print("maxE Pearson corr. with exp. fold rate:", pearsonr(maxEs, DF.k_f.values))
# outlier?
_ = plt.scatter(maxEs, DF.k_f.values)

maxE_sds: 20.65 9.166730462297942 10 38
maxE Pearson corr. with exp. fold rate: (-0.350480953363002, 0.12977119399473655)

h_rs, h_ps = nd.analyze_maxE_by_SStype(h_df)

maxE_sd mean stdev
All: [20.65    9.1667]
A: [17.6     9.7622]
AB: [22.9167  9.6244]
B: [16.6667  4.5092]

Pearson correlation with exp. fold rate: maxE
All: [-0.3505  0.1298]
A: [-0.3426  0.5725]
AB: [-0.3071  0.3316]
B: [0.1789 0.8855]

K maxE_sd is much larger than UZ.
All corrs. are insignificant.

fig, ax1 = plt.subplots()
ax1.set_title("Kamagata h " + str(np.round(h_rs[0] ,3)), fontsize=14)
nd.plot_maxE_by_SStype(h_df, ax1, h_rs)
_ = ax1.legend(title="SStype", fontsize=12, loc="upper left", bbox_to_anchor=(1.0, 1.0))

ND variants: craxy¶

dirn = os.path.join("K_ND3", "edv_a")

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='a')
a_df = DF.assign(peakE = peakEs)
a_rs, a_ps = nd.analyze_peakE_by_SStype(a_df, 'a')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='c')
c_df = DF.assign(peakE = peakEs)
c_rs, c_ps = nd.analyze_peakE_by_SStype(c_df, 'c')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='x')
x_df = DF.assign(peakE = peakEs)
x_rs, x_ps = nd.analyze_peakE_by_SStype(x_df, 'x')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='r')
r_df = DF.assign(peakE = peakEs)
r_rs, r_ps = nd.analyze_peakE_by_SStype(r_df, 'r')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='y')
y_df = DF.assign(peakE = peakEs)
y_rs, y_ps = nd.analyze_peakE_by_SStype(y_df, 'y')

Pearson correlation with exp. fold rate: a
All: [-0.5618  0.0099]
A: [-0.0612  0.9221]
AB: [-0.6306  0.0279]
B: [-0.0617  0.9607]

Pearson correlation with exp. fold rate: c
All: [-0.5525  0.0115]
A: [-0.2831  0.6444]
AB: [-0.522   0.0817]
B: [-0.0771  0.9509]

Pearson correlation with exp. fold rate: x
All: [-0.5543  0.0112]
A: [-0.1763  0.7767]
AB: [-0.5111  0.0894]
B: [-0.0014  0.9991]

Pearson correlation with exp. fold rate: r
All: [-0.278   0.2354]
A: [-0.3437  0.5712]
AB: [-0.3263  0.3005]
B: [0.0228 0.9855]

Pearson correlation with exp. fold rate: y
All: [-0.0909  0.7031]
A: [-0.4993  0.3918]
AB: [-0.0982  0.7614]
B: [0.0273 0.9826]

dirn = os.path.join("K_ND3", "notscaled")

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='a_notscaled')
ua_df = DF.assign(peakE = peakEs)
ua_rs, ua_ps = nd.analyze_peakE_by_SStype(ua_df, 'ua')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='c_notscaled')
uc_df = DF.assign(peakE = peakEs)
uc_rs, uc_ps = nd.analyze_peakE_by_SStype(uc_df, 'uc')

peakEs, peakQs = nd.get_ndv_peakEs(dirn, K_PIDs, "stats", ndv='x_notscaled')
ux_df = DF.assign(peakE = peakEs)
ux_rs, ux_ps = nd.analyze_peakE_by_SStype(ux_df, 'ux')

Pearson correlation with exp. fold rate: ua
All: [-0.3502  0.1301]
A: [0.0087 0.989 ]
AB: [-0.4751  0.1186]
B: [-0.0931  0.9407]

Pearson correlation with exp. fold rate: uc
All: [-0.4824  0.0312]
A: [-0.2302  0.7095]
AB: [-0.6005  0.039 ]
B: [-0.0823  0.9475]

Pearson correlation with exp. fold rate: ux
All: [-0.4338  0.056 ]
A: [-0.2483  0.6871]
AB: [-0.4988  0.0988]
B: [-0.0094  0.994 ]

Summary of results for analysis of K by SS type:¶

Kcorrs_dr = pd.DataFrame({'SStype': ['Z', 'A', 'AB', 'B'], 
                'y_rs': y_rs, 'y_ps': y_ps, 'r_rs': r_rs, 'r_ps': r_ps,
                'ua_rs': ua_rs, 'ua_ps': ua_ps, 'a_rs': a_rs, 'a_ps': a_ps, 
                'uc_rs': uc_rs, 'uc_ps': uc_ps, 'c_rs': c_rs, 'c_ps': c_ps, 
                'ux_rs': ux_rs, 'ux_ps': ux_ps, 'x_rs': x_rs, 'x_ps': x_ps, 
                'h_rs': h_rs, 'h_ps': h_ps, }).set_index('SStype')
Kcorrs_dr = Kcorrs_dr.join(n_dr)

Kcorrs_dr.to_csv("KCorrs_dr.txt", sep=' ')

CDF = Kcorrs_dr.filter(regex="_rs").T
CDF.round(4)

CDF.Z.sort_values()

n12_rs   -0.571902
n23_rs   -0.569059
n_rs     -0.562293
a_rs     -0.561783
x_rs     -0.554254
c_rs     -0.552456
uc_rs    -0.482393
ux_rs    -0.433768
h_rs     -0.350481
ua_rs    -0.350213
r_rs     -0.277979
y_rs     -0.090915
Name: Z, dtype: float64

CDF.A.sort_values()

n_rs     -0.536434
n23_rs   -0.512422
n12_rs   -0.499379
y_rs     -0.499269
r_rs     -0.343654
h_rs     -0.342557
c_rs     -0.283099
ux_rs    -0.248295
uc_rs    -0.230229
x_rs     -0.176308
a_rs     -0.061210
ua_rs     0.008665
Name: A, dtype: float64

CDF.AB.sort_values()

a_rs     -0.630587
uc_rs    -0.600451
n12_rs   -0.582329
n23_rs   -0.574065
n_rs     -0.556397
c_rs     -0.521978
x_rs     -0.511124
ux_rs    -0.498820
ua_rs    -0.475065
r_rs     -0.326349
h_rs     -0.307109
y_rs     -0.098194
Name: AB, dtype: float64

CDF.B.sort_values()

ua_rs    -0.093051
uc_rs    -0.082309
c_rs     -0.077122
a_rs     -0.061738
ux_rs    -0.009396
x_rs     -0.001361
r_rs      0.022767
y_rs      0.027334
h_rs      0.178924
n_rs      0.995988
n23_rs    0.996334
n12_rs    0.996501
Name: B, dtype: float64

NDVs = ['y', 'r', 'ua', 'a', 'uc', 'c', 'ux', 'x', 'h', 'N', 'N1/2', 'N2/3']
# smaller (more negative) is better
fig, ax = plt.subplots()
ax.scatter(NDVs, CDF.A.values, marker='^', edgecolor='green', facecolor='none', label='A')
ax.scatter(NDVs, CDF.AB.values, edgecolor='orange', lw=2, s=60, facecolor='none', label='AB')
ax.scatter(NDVs, CDF.B.values, marker='s', edgecolor='magenta', facecolor='none', label='B')
ax.set_xticks(NDVs); ax.set_xticklabels(NDVs, fontsize=14)
ax.set_xlabel("LE <--- priority ---> SE", fontsize=14)
ax.set_title("Kamagata (20): correlation with exp. fold rate by SS type.", fontsize=14)
ax.legend(title="SStype", fontsize=12, loc="upper left", bbox_to_anchor=(1.0, 1.0))
ax.grid()

dp = Kcorrs_dr.filter(regex="_ps").T
db = dp.apply(lambda x: x < 0.05)
db

As observed previously with UZ, 'y' and 'r' perform better (but insig.) for A chains, and 'a' produces strongest (and sig.) corr. for AB. However, no ND variant produces a sig. corr. for B.
Except for ND variants 'a' an 'uc' on AB chains, the corrs. by chain SStype are insignificant (p-value < 0.05).
Data problem: A and B classes have fewer than 10 chains each.

# combine y-a-ua
alpha_df = y_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = ua_df.query("SStype == 'b'")
yaua_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(yaua_df))
yaua_rs, yaua_ps = nd.analyze_peakE_by_SStype(yaua_df, ndv='y-a-ua') 
print(nd.mjes_range(yaua_df.peakE.values))

5 12 3 20
Pearson correlation with exp. fold rate: y-a-ua
All: [0.1176 0.6215]
A: [-0.4993  0.3918]
AB: [-0.6306  0.0279]
B: [-0.0931  0.9407]

3.5322812572893088

# combine ux-a-ua
alpha_df = ux_df.query("SStype == 'a'")
alphabeta_df = a_df.query("SStype == 'ab'")
beta_df = ua_df.query("SStype == 'b'")
uxaua_df = alpha_df.append(beta_df).append(alphabeta_df)
print(len(alpha_df), len(alphabeta_df), len(beta_df), len(uxaua_df))
uxaua_rs, uxaua_ps = nd.analyze_peakE_by_SStype(uxaua_df, ndv='ux-a-ua') 
print(nd.mjes_range(uxaua_df.peakE.values))

5 12 3 20
Pearson correlation with exp. fold rate: ux-a-ua
All: [-0.5784  0.0076]
A: [-0.2483  0.6871]
AB: [-0.6306  0.0279]
B: [-0.0931  0.9407]

3.0378335189840704

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 4), constrained_layout=True)
plt.suptitle("Kamagata(20): combined peak Es", fontsize=14)

ax1.set_title("y-a-ua " + str(np.round(yaua_rs[0], 3)))
nd.plot_peakE_by_SStype(yaua_df, ax1, yaua_rs) 
ax1.legend(title="SStype")

ax2.set_title("ux-a-ua " + str(np.round(uxaua_rs[0], 3)))
nd.plot_peakE_by_SStype(uxaua_df, ax2, uxaua_rs) 
ax2.legend(title="SStype")

ax3.set_title("a " + str(np.round(a_rs[0], 3)))
nd.plot_peakE_by_SStype(a_df, ax3, a_rs) 
_ = ax3.legend(title="SStype")

End of notebook¶

SKLC January 2021

	pid	N	SRIM	LRIM	k_f	SStype
0	1afi	72	-7.50	-6.25	0.26	ab
1	1aps	98	-9.41	-6.97	-0.64	ab
2	1ayi	86	-5.87	-5.75	3.13	a
3	1bf4	63	-5.77	-5.43	3.02	ab
4	1c9o	66	-7.54	-6.07	3.14	b

	n_rs	n_ps	n12_rs	n12_ps	n23_rs	n23_ps
SStype
Z	-0.435466	0.001253	-0.457243	0.000655	-0.450355	0.000808
A	-0.672115	0.006056	-0.684221	0.004902	-0.680697	0.005218
AB	-0.185007	0.434873	-0.198762	0.400861	-0.194397	0.411493
B	-0.742475	0.000641	-0.750195	0.000523	-0.747886	0.000556

	bid	fe	fsc	sc_mje	num_sc	num_nnsc
0	2	0.137536	0.000000	0.00	0	0
1	3	0.237822	0.371257	71.70	62	96
2	4	0.295129	0.479042	101.00	80	119
3	5	0.340974	0.485030	180.40	81	151
4	6	0.368195	0.491018	226.07	82	168

	pid	N	SRIM	LRIM	k_f	SStype	maxE	maxE_sd
0	1afi	72	-7.50	-6.25	0.26	ab	372.23	13
1	1aps	98	-9.41	-6.97	-0.64	ab	462.15	27
2	1ayi	86	-5.87	-5.75	3.13	a	213.28	10
3	1bf4	63	-5.77	-5.43	3.02	ab	95.47	9
4	1c9o	66	-7.54	-6.07	3.14	b	109.78	10

	bid	fe	fsc	sc_mje	num_sc	num_nnsc
0	2	0.109661	0.000000	0.00	0	0
1	3	0.210183	0.476821	24.92	72	83
2	4	0.297650	0.768212	-46.39	116	107
3	5	0.379896	0.814570	81.04	123	161
4	6	0.447781	0.821192	206.50	124	208

	H	S	T	hsr	SStype
2qjl	29	33	37	0.878788	ab
1ayi	50	0	36	50.000000	a
1e0m	0	13	24	0.000000	b
1div_c	34	37	22	0.918919	ab
1rlq	0	22	34	0.000000	b

	H	S	T	hsr	SStype
1ayi	50	0	36	50.000000	a
1enh	38	0	16	38.000000	a
1fex	31	0	28	31.000000	a
1hrc	43	2	59	21.500000	a
1idy	33	0	21	33.000000	a
1imq	46	0	40	46.000000	a
1lmb	62	0	25	62.000000	a
1ryk	45	0	24	45.000000	a
1ss1	43	0	17	43.000000	a
1w4e	19	4	22	4.750000	a
1w4j	23	3	25	7.666667	a
1yza	70	0	36	70.000000	a
2abd	52	0	34	52.000000	a
2bth	22	2	21	11.000000	a
2pdd	19	0	23	19.000000	a

	H	S	T	hsr	SStype
1c9o	3	41	22	0.073171	b
1csp	3	37	27	0.081081	b
1e0l	0	12	25	0.000000	b
1e0m	0	13	24	0.000000	b
1fnf_9	0	49	39	0.000000	b
1g6p	0	31	35	0.000000	b
1jo8	3	26	29	0.115385	b
1k9q	0	13	27	0.000000	b
1m9s	3	17	55	0.176471	b
1mjc	3	34	32	0.088235	b
1psf	0	21	48	0.000000	b
1rlq	0	22	34	0.000000	b
1shf	3	27	29	0.111111	b
1shg	3	29	25	0.103448	b
1ten	0	48	41	0.000000	b
1tit	0	33	56	0.000000	b
3ait	0	38	36	0.000000	b

	H	S	T	hsr	SStype
1afi	21	21	30	1.000000	ab
1aps	18	36	44	0.500000	ab
1bf4	15	33	15	0.454545	ab
1div_c	34	37	22	0.918919	ab
1div_n	26	13	18	2.000000	ab
1fkb	11	43	53	0.255814	ab
1n88	22	27	47	0.814815	ab
1o6x	25	16	40	1.562500	ab
1pgb	14	24	18	0.583333	ab
1poh	29	23	33	1.260870	ab
1rfa	15	21	42	0.714286	ab
1ris	28	47	22	0.595745	ab
1spr	18	31	54	0.580645	ab
1ubq	18	26	32	0.692308	ab
1urn	31	24	41	1.291667	ab
2acy	24	41	33	0.585366	ab
2ci2	14	18	33	0.777778	ab
2ptl	12	25	25	0.480000	ab
2qjl	29	33	37	0.878788	ab
2vik	27	30	69	0.900000	ab

SStype	Z	A	AB	B
y_rs	0.0664	-0.6905	-0.1091	-0.3274
r_rs	-0.2145	-0.5590	-0.1903	-0.3294
a_rs	-0.7267	-0.4121	-0.6481	-0.5744
c_rs	-0.7106	-0.4370	-0.5412	-0.6408
x_rs	-0.6995	-0.4700	-0.4938	-0.7052
h_rs	-0.4446	-0.5552	-0.3260	-0.5550
n_rs	-0.4355	-0.6721	-0.1850	-0.7425
n12_rs	-0.4572	-0.6842	-0.1988	-0.7502
n23_rs	-0.4504	-0.6807	-0.1944	-0.7479

SStype	Z	A	AB	B
y_rs	0.066354	-0.690474	-0.109097	-0.327428
r_rs	-0.214498	-0.559032	-0.190254	-0.329368
ua_rs	-0.429877	-0.220431	-0.390667	-0.201439
a_rs	-0.726708	-0.412113	-0.648068	-0.574378
uc_rs	-0.553336	-0.526444	-0.439166	-0.437884
c_rs	-0.710594	-0.437038	-0.541155	-0.640761
ux_rs	-0.541776	-0.572340	-0.354226	-0.454861
x_rs	-0.699528	-0.470023	-0.493830	-0.705247
h_rs	-0.444587	-0.555163	-0.325991	-0.555005
n12_rs	-0.457243	-0.684221	-0.198762	-0.750195

	pid	N	k_f	SStype
0	1pgb	56	2.78	ab
1	2cro	64	1.52	a
2	1cei	85	2.49	a
3	2abd	86	2.81	a
4	1hng	86	0.78	b

	H	S	T	hsr	SStype
1adw	23	45	55	0.511111	ab
1bni	24	25	59	0.960000	ab
2a5e	66	0	90	66.000000	a
1bta	37	16	36	2.312500	ab
2rn2	55	47	53	1.170213	ab

	H	S	T	hsr	SStype
2cro	40	0	24	40.0	a
1cei	49	2	34	24.5	a
2abd	52	0	34	52.0	a
1dwr	115	0	37	115.0	a
2a5e	66	0	90	66.0	a

	H	S	T	hsr	SStype
1pgb	14	24	18	0.583333	ab
1gxt	21	36	31	0.583333	ab
1bta	37	16	36	2.312500	ab
1bni	24	25	59	0.960000	ab
1adw	23	45	55	0.511111	ab
3chy	58	22	48	2.636364	ab
1joo	35	37	77	0.945946	ab
1i1b	8	72	71	0.111111	ab
2rn2	55	47	53	1.170213	ab
1ra9	40	50	69	0.800000	ab
1l63	107	15	40	7.133333	ab
1php	69	27	79	2.555556	ab

	S	T	SStype
1hng	41	45	b
1tit	33	56	b
1ttg	43	51	b

SStype	Z	A	AB	B
y_rs	-0.0909	-0.4993	-0.0982	0.0273
r_rs	-0.2780	-0.3437	-0.3263	0.0228
ua_rs	-0.3502	0.0087	-0.4751	-0.0931
a_rs	-0.5618	-0.0612	-0.6306	-0.0617
uc_rs	-0.4824	-0.2302	-0.6005	-0.0823
c_rs	-0.5525	-0.2831	-0.5220	-0.0771
ux_rs	-0.4338	-0.2483	-0.4988	-0.0094
x_rs	-0.5543	-0.1763	-0.5111	-0.0014
h_rs	-0.3505	-0.3426	-0.3071	0.1789
n_rs	-0.5623	-0.5364	-0.5564	0.9960
n12_rs	-0.5719	-0.4994	-0.5823	0.9965
n23_rs	-0.5691	-0.5124	-0.5741	0.9963

SStype	Z	A	AB	B
y_ps	False	False	False	False
r_ps	False	False	False	False
ua_ps	False	False	False	False
a_ps	True	False	True	False
uc_ps	True	False	True	False
c_ps	True	False	False	False
ux_ps	False	False	False	False
x_ps	True	False	False	False
h_ps	False	False	False	False
n_ps	True	False	False	False
n12_ps	True	False	True	False
n23_ps	True	False	False	False

	n_rs	n_ps	n12_rs	n12_ps	n23_rs	n23_ps
SStype
Z	-0.562293	0.009862	-0.571902	0.008422	-0.569059	0.008829
A	-0.536434	0.351332	-0.499379	0.391687	-0.512422	0.377363
AB	-0.556397	0.060275	-0.582329	0.046954	-0.574065	0.050948
B	0.995988	0.057047	0.996501	0.053271	0.996334	0.054532

	pid	N	k_f	SStype	maxE	maxE_sd
0	1pgb	56	2.78	ab	91.88	11
1	2cro	64	1.52	a	49.89	13
2	1cei	85	2.49	a	154.90	10
3	2abd	86	2.81	a	359.52	12
4	1hng	86	0.78	b	161.08	17