diffnets package

Submodules

diffnets.analysis module

class diffnets.analysis.Analysis(net, netdir, datadir)[source]

Bases: object

Core object for running analysis.

Parameters:
  • net (nnutils object) – Neural network to perform analysis with
  • netdir (str) – path to directory with neural network results
  • datadir (str) – path to directory with data required to train the data. Includes cm.npy, wm.npy, uwm.npy, master.pdb, an aligned_xtcs dir, and an indicators dir.
assign_labels_to_variants(plot_labels=False)[source]
Map DiffNet labels to each variant with option to plot
a histogram of the labels.
Parameters:plot_labels (optional, boolean) – Save a matplotlob figure of the label histogram.
Returns:lab_v – Dictionary mapping labels to their respective variants.
Return type:dictionary
encode_data()[source]

Calculate the latent space for all trajectory frames.

find_feats(inds, out_fn, n_states=2000, num2plot=100, clusters=None)[source]

Generate a .pml file that will show the distances that change in a way that is most with changes in the classifications score.

Parameters:
  • inds (np.ndarray,) – Indices of the topology file that are to be included in calculating what distances are most correlated with classification score.
  • out_fn (str) – Name of the output file.
  • n_states (int (default=2000)) – How many cluster centers to calculate and use for correlation measurement.
  • num2plot (int (default=100)) – Number of distances to be shown.
  • clusters (enspara cluster object) – Cluster object with center_indices attribute
get_labels()[source]

Calculate the classification score for all trajectory frames

get_rmsd()[source]

Calculate RMSD between actual trajectory frames and autoencoder reconstructed frames

morph(n_frames=10)[source]

Get representative structures for classification scores from 0 to 1.

Parameters:n_frames (int) – How many representative structures to output. Bins between 0 and 1 will be calculated with this number.
recon_traj()[source]

Reconstruct all trajectory frames using the trained neural network

run_core()[source]

Wrapper to run the analysis functions that should be run after training.

diffnets.data_processing module

exception diffnets.data_processing.ImproperlyConfigured[source]

Bases: Exception

The given configuration is incomplete or otherwise not usable.

class diffnets.data_processing.ProcessTraj(traj_dir_paths, pdb_fn_paths, outdir, atom_sel=None, stride=1)[source]

Bases: object

Process raw trajectory data to select a subset of atoms and align all
frames to a reference pdb. Results in a directory structure that the training relies on.
Parameters:
  • traj_dir_paths (list of str’s, required) – One string/path for each variant to a dir that contains ALL trajectory files for that variant. ORDER MATTERS – when training you will set a value “act_map” that depends on this order.
  • pdb_fn_paths (list of str’s, required) – One string/path for each variant to a dir that contains the starting pdb file. Variants must be in same order as traj_dir_paths.
  • outdir (str) – Name of dir to output processed data to. This dir will be used as input during DiffNet training.
  • atom_sel (str, or array-like, shape=(n_variants, n_inds)) – (default=”name CA or name CB or name N or name C”)
    If str, it should follow the selection syntax used in MDTraj. e.g. pdb.top.select(“name CA”) - “name CA” would be appropriate. If list, there should be a list of indices for each variant since choosing equivalent atoms may require different indexing for each variant.
    stride : integer (default=1)
    Subsample every nth data frame. Value of 1 means no subsampling.
extract_default_inds()[source]
make_master_pdb()[source]

Creates a reference pdb centered at the origin using the first variant pdb specified in self.pdb_fn_paths.

make_traj_list()[source]

Makes a list of all variant trajectories where each item is a list that contains 1) a path to the trajectory, 2) a path to the corresponding topology (pdb) file, 3) a trajectory number - from 0 to n where n is total number of trajectories, and 4) an integer to indicate which variant simulation the trajectory came from.

preprocess_traj(inputs)[source]
Strip all trajectories to a subset of atoms and align to a
reference pdb. Also, calculate and write out the mean center of mass of all atoms across all trajectories. Will write out new trajectory (.xtc files) and corresponding “inidcator” lists to indicate which variant simulation each data frame came from.
Parameters:inputs (array-like, shape=(n_trajectories,4)) – For each trajectory there should be 1) path to trajectory, 2) path to corresponding topology file, 3) output trajectory number, and 4) integer indicating which variant the trajectory came from.
run()[source]

Process raw trajectory data to select a subset of atoms and align all frames to a reference pdb. Results in a directory structure that the training relies on.

traj2samples()[source]

For every trajectory frame, write out a PyTorch tensor file, which will be used as input to the DiffNet

class diffnets.data_processing.WhitenTraj(data_dir)[source]

Bases: object

Normalize the trajectories with a data whitening procedure [1] that removes covariance between atoms in trajectories.

Parameters:data_dir (str) – Path to a directory that contains a topology file, a file with the mean center of mass of all atoms across all trajectories, and a dir named “aligned_xtcs” with all aligned trajectories.

References

[1] Wehmeyer C, Noé F. Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. J Chem Phys. 2018. doi:10.1063/1.5011399

apply_unwhitening(whitened, uwm, cm)[source]

Apply whitening to XYZ coordinates.

Parameters:
  • whitened (np.ndarray, shape=(n_frames,3*n_atoms)) – Whitened XYZ coordinates of a trajectory.
  • wm (np.ndarray, shape=(n_atoms*3,n_atoms*3)) – whitening matrix
  • cm (np.ndarray, shape=(3*n_atoms,)) – Avg. center of mass of each atom across all trajectories.
Returns:

coords – XYZ coordinates of a trajectory.

Return type:

np.ndarray, shape=(n_frames,3*n_atoms)

apply_whitening(coords, wm, cm)[source]

Apply whitening to XYZ coordinates.

Parameters:
  • coords (np.ndarray, shape=(n_frames,3*n_atoms)) – XYZ coordinates of a trajectory.
  • wm (np.ndarray, shape=(n_atoms*3,n_atoms*3)) – whitening matrix
  • cm (np.ndarray, shape=(3*n_atoms,)) – Avg. center of mass of each atom across all trajectories.
Returns:

whitened – Whitened XYZ coordinates of a trajectory.

Return type:

np.ndarray, shape=(n_frames,3*n_atoms)

apply_whitening_xtc_dir(xtc_dir, top, wm, cm, n_cores, outdir)[source]

Apply data whitening parallelized across many trajectories

Parameters:
  • xtc_fn (list of str’s) – Paths to trajectories.
  • top (md.Trajectory object) – Topology corresponding to the trajectories
  • outdir (str) – Directory to output whitened trajectory
  • wm (np.ndarray, shape=(n_atoms*3,n_atoms*3)) – whitening matrix
  • cm (np.ndarray, shape=(3*n_atoms,)) – Avg. center of mass of each atom across all trajectories.
  • n_cores (int) – Number of threads to parallelize task across.
get_c00(coords, cm, traj_num)[source]

Calculates the covariance matrix.

Parameters:
  • coords (np.ndarray, shape=(n_frames,3*n_atoms)) – XYZ coordinates of a trajectory.
  • cm (np.ndarray, shape=(3*n_atoms,)) – Avg. center of mass of each atom across all trajectories.
  • traj_num (integer) – Used to name the covariance matrix we are going to write out for a trajectory.
get_c00_xtc_list(xtc_fns, top, cm, n_cores)[source]

Calculate the covariance matrix across all trajectories.

Parameters:
  • xtc_fn (list of str’s) – Paths to trajectories.
  • top (md.Trajectory object) – Topology corresponding to the trajectories
  • cm (np.ndarray, shape=(3*n_atoms,)) – Avg. center of mass of each atom across all trajectories.
  • n_cores (int) – Number of threads to parallelize task across.
Returns:

c00 – Covariance matrix across all trajectories

Return type:

np.ndarray, shape=(n_atoms*3,n_atoms*3)

get_wuw_mats(c00)[source]
Calculate whitening matrix and unwhitening matrix.
Method adapted from deeptime (https://github.com/markovmodel/deeptime/blob/master/time-lagged-autoencoder/tae/utils.py)
Parameters:c00 (np.ndarray, shape=(n_atoms*3,n_atoms*3)) – Covariance matrix
Returns:
  • uwm (np.ndarray, shape=(n_atoms*3,n_atoms*3)) – unwhitening matrix
  • wm (np.ndarray, shape=(n_atoms*3,n_atoms*3)) – whitening matrix
run()[source]

Whiten existing processed trajectory data in self.data_dir to calculate and write out a covariance matrix (c00.npy), a whitening matrix (wm.npy) and an unwhitening matrix (uwm.npy).

diffnets.exmax module

Copyright 2015 by Washington University in Saint Louis. Authored by S. Joshua Swamidass. A license to use for strictly non-commerical use is granted. Derivative work is not permited without prior written authorization. All other rights reserved.

diffnets.exmax.distribution_of_sum(P, ignore_idx={})[source]

Given a set of binomial random variables parameterized by a vector P. Ignoring variables in ignore_idx… What is the distribution of their sum?

Output is the discreet distribution D, where the probability of a specific sum s is D[s]. O(N^2) time in length of P

For example, using this P… >>> P = [0.5, 0.25, 0.5]

>>> distribution_of_sum(P)
array([ 0.1875,  0.4375,  0.3125,  0.0625])

Ignoring the 2nd (1 in zero indexing) variable, we have… >>> distribution_of_sum(P, [1]) array([ 0.25, 0.5 , 0.25, 0. ])

diffnets.exmax.expectation_E_EXP(P, E_or)[source]

Given a set of binomial random variables parameterized by a vector P. Conditioned on E[at least one success] = E_or What is the expectation of each random variable?

Output is a vector E of expectations.

alternate, equivalent implementation for error checking the problem with this implementation is that it is exponential time

diffnets.exmax.expectation_or(P, E_or)

Given a set of binomial random variables parameterized by a vector P. Conditioned on E[at least one success] = E_or What is the expectation of each random variable?

Output is a vector E of expectations.

All the implementations should produce the same results.

>>> R = rand(10)
>>>
>>> EL = expectation_or_LINEAR(R, 1)
>>> EC = expectation_or_CUBIC(R, 1)
>>> EE = expectation_E_EXP(R, 1)
>>> correlation, pvalue = pearsonr(EL, EC)
>>> correlation > .99 and pvalue < .01
True
>>> allclose(EL, EE)
True
>>> correlation, pvalue = pearsonr(EL, EE)
>>> correlation > .99 and pvalue < .01
True

This shows that all versions yield results that are > 99% correlated.

And we know the results for some simple cases.

>>> expectation_or([0.5, 0.5], 1)
array([ 0.66666667,  0.66666667])
>>> expectation_or([0.5, 0.5], .75)
array([ 0.5,  0.5])
diffnets.exmax.expectation_or_CUBIC(P, E_or)[source]

Given a set of binomial random variables parameterized by a vector P. Conditioned on E[at least one success] = E_or What is the expectation of each random variable?

Output is a vector E of expectations.

alternate, equivalent implementation for error checking the problem with this implementation is that it is O(N^3) time

diffnets.exmax.expectation_or_LINEAR(P, E_or)[source]

Given a set of binomial random variables parameterized by a vector P. Conditioned on E[at least one success] = E_or What is the expectation of each random variable?

Output is a vector E of expectations.

All the implementations should produce the same results.

>>> R = rand(10)
>>>
>>> EL = expectation_or_LINEAR(R, 1)
>>> EC = expectation_or_CUBIC(R, 1)
>>> EE = expectation_E_EXP(R, 1)
>>> correlation, pvalue = pearsonr(EL, EC)
>>> correlation > .99 and pvalue < .01
True
>>> allclose(EL, EE)
True
>>> correlation, pvalue = pearsonr(EL, EE)
>>> correlation > .99 and pvalue < .01
True

This shows that all versions yield results that are > 99% correlated.

And we know the results for some simple cases.

>>> expectation_or([0.5, 0.5], 1)
array([ 0.66666667,  0.66666667])
>>> expectation_or([0.5, 0.5], .75)
array([ 0.5,  0.5])
diffnets.exmax.expectation_range(P, lower, upper)

Given a set of binomial random variables parameterized by a vector P. Conditioned on the number successes between lower and upper (inclusive). What is the expectation of each random variable?

Output is a vector E of expectations. O(N^3) time in length of P

>>> R = rand(10)  # a random vector of probabilities 10 elements long.
>>>
>>> lower, upper = 3, 6
>>>
>>> EC = expectation_range_CUBIC(R, lower, upper)
>>> EE = expectation_range_EXP(R, lower, upper)
>>> correlation, pvalue = pearsonr(EE, EC)
>>> correlation > .99 and pvalue < .01
True

This shows that both versions yield results that are > 99% correlated.

diffnets.exmax.expectation_range_CUBIC(P, lower, upper)[source]

Given a set of binomial random variables parameterized by a vector P. Conditioned on the number successes between lower and upper (inclusive). What is the expectation of each random variable?

Output is a vector E of expectations. O(N^3) time in length of P

>>> R = rand(10)  # a random vector of probabilities 10 elements long.
>>>
>>> lower, upper = 3, 6
>>>
>>> EC = expectation_range_CUBIC(R, lower, upper)
>>> EE = expectation_range_EXP(R, lower, upper)
>>> correlation, pvalue = pearsonr(EE, EC)
>>> correlation > .99 and pvalue < .01
True

This shows that both versions yield results that are > 99% correlated.

diffnets.exmax.expectation_range_EXP(P, lower, upper)[source]

Given a set of binomial random variables parameterized by a vector P. Condition on the number successes between lower and upper (inclusive). What is the expectation of each random variable?

This version is slow, but more conceptually clear.

Output is a vector E of expectations. O(2^N) time in length of P

This version suffers from floating point error, and should not be used for anything other than testing.

diffnets.exmax.rand()

scipy.rand is deprecated and will be removed in SciPy 2.0.0, use numpy.random.rand instead

diffnets.nnutils module

class diffnets.nnutils.ae(layer_sizes, wm, uwm)[source]

Bases: torch.nn.modules.module.Module

Unsupervised autoencoder

Parameters:
  • layer_sizes (list) – List of integers indicating the size of each layer in the encoder including the latent layer. First two must be identical.
  • wm (np.ndarray, shape=(n_inputs,n_inputs)) – Whitening matrix – is applied to input data
  • uwm (np.ndarray, shape=(n_inputs,n_inputs)) – unwhitening matrix
decode(x)[source]

Pass the latent space vector through the decoder

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
Returns:recon – Reconstruction of the original input data
Return type:torch.cuda.FloatTensor or torch.FloatTensor
encode(x)[source]

Pass the data through the encoder to the latent layer.

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data for a given sample
Returns:latent – Latent space vector associated with encoder1
Return type:torch.cuda.FloatTensor or torch.FloatTensor
forward(x)[source]

Pass data through the entire network

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data for a given sample
Returns:
  • recon (torch.cuda.FloatTensor or torch.FloatTensor) – Reconstruction of the original input data
  • latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
freeze_weights(old_net=None)[source]
Procedure to make the whitening matrix and unwhitening matrix
as untrainable layers. Additionally, freezes weights associated with a previously learned encoder layer.
Parameters:old_net (ae object) – Previously trained network with overlapping architecture. Weights learned in this previous networks encoder will be frozen in the new network.
unfreeze_weights()[source]

Makes all encoders weights trainable.

diffnets.nnutils.chunks(arr, chunk_size)[source]

Yield successive chunk_size chunks from arr.

class diffnets.nnutils.classify_ae(n_latent)[source]

Bases: torch.nn.modules.module.Module

Logistic Regression model

Parameters:n_latent (int) – Number of latent variables
classify(x)[source]

Perfom classification task using latent space representation

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
Returns:
Return type:Value between 0 and 1
forward(x)[source]

Perfom classification task using latent space representation

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
Returns:
Return type:Value between 0 and 1
diffnets.nnutils.my_l1(x, x_recon)[source]

Calculate l1 loss

Parameters:
  • x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data
  • x_recon (torch.cuda.FloatTensor or torch.FloatTensor) – Reconstructed input
Returns:

Return type:

torch.cuda.FloatTensor or torch.FloatTensor

diffnets.nnutils.my_mse(x, x_recon)[source]

Calculate mean squared error loss

Parameters:
  • x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data
  • x_recon (torch.cuda.FloatTensor or torch.FloatTensor) – Reconstructed input
Returns:

Return type:

torch.cuda.FloatTensor or torch.FloatTensor

class diffnets.nnutils.sae(layer_sizes, wm, uwm)[source]

Bases: diffnets.nnutils.ae

Supervised autoencoder

Parameters:
layer_sizes (list) – List of integers indicating the size of each layer in the

encoder including the latent layer. First two must be identical.

wm : np.ndarray, shape=(n_inputs,n_inputs)

Whitening matrix – is applied to input data

uwm : np.ndarray, shape=(n_inputs,n_inputs)

unwhitening matrix

classify(latent)[source]

Perfom classification task using latent space representation

Parameters:latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
Returns:
Return type:Value between 0 and 1
forward(x)[source]

Pass through the entire network

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
Returns:
  • recon (torch.cuda.FloatTensor or torch.FloatTensor) – Reconstruction of the original input data
  • latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
  • label – Value between 0 and 1
class diffnets.nnutils.split_ae(layer_sizes, inds1, inds2, wm, uwm)[source]

Bases: torch.nn.modules.module.Module

Unsupervised autoencoder with a split input (i.e. 2 encoders)

Parameters:
  • layer_sizes (list) – List of integers indicating the size of each layer in the encoder including the latent layer. First two must be identical.
  • inds1 (np.ndarray) – Indices in the training input array that go into encoder1.
  • inds2 (np.ndarray) – Indices in the training input array that go into encoder2.
  • wm (np.ndarray, shape=(n_inputs,n_inputs)) – Whitening matrix – is applied to input data
  • uwm (np.ndarray, shape=(n_inputs,n_inputs)) – unwhitening matrix
decode(latent)[source]

Pass the latent space vector through the decoder

Parameters:latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
Returns:recon – Reconstruction of the original input data
Return type:torch.cuda.FloatTensor or torch.FloatTensor
encode(x)[source]

Pass the data through the encoder to the latent layer.

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data for a given sample
Returns:
  • lat1 (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector associated with encoder1
  • lat2 (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector associated with encoder2
forward(x)[source]

Pass data through the entire network

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data for a given sample
Returns:
  • recon (torch.cuda.FloatTensor or torch.FloatTensor) – Reconstruction of the original input data
  • latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
freeze_weights(old_net=None)[source]
Procedure to make the whitening matrix and unwhitening matrix
as untrainable layers. Additionally, freezes weights associated with a previously learned encoder layer.
Parameters:old_net (split_ae object) – Previously trained network with overlapping architecture. Weights learned in this previous networks encoder will be frozen in the new network.
split_inds
unfreeze_weights()[source]

Makes all encoders weights trainable.

diffnets.nnutils.split_inds(pdb, resnum, focus_dist)[source]
Identify indices close and far from a residue of interest.
Each index corresponds to an X,Y, or Z coordinate of an atom in the pdb.
Parameters:
  • pdb (md.Trajectory object) – Structure used to find close/far indices.
  • resnum (integer) – The residue number of interest.
  • focus_dist (float (nannmeters)) – All indices within this distance of resnum will be selected as close indices.
Returns:

  • close_xyz_inds (np.ndarray) – Indices of x,y,z positions of atoms in pdb that are close to resnum.
  • non_close_xyz_inds (np.ndarray) – Indices of x,y,z positions of atoms in pdb that are not close to resnum.

class diffnets.nnutils.split_sae(layer_sizes, inds1, inds2, wm, uwm)[source]

Bases: diffnets.nnutils.split_ae

Supervised autoencoder with split architecture

classify(latent)[source]

Perfom classification task using latent space representation

Parameters:latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
Returns:
Return type:Value between 0 and 1
forward(x)[source]

Pass data through the entire network

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data for a given sample
Returns:
  • recon (torch.cuda.FloatTensor or torch.FloatTensor) – Reconstruction of the original input data
  • latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
  • label – Value between 0 and 1
class diffnets.nnutils.svae(layer_sizes)[source]

Bases: diffnets.nnutils.vae

classify(latent)[source]
forward(x)[source]

Pass data through the entire network

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data for a given sample
Returns:
  • recon (torch.cuda.FloatTensor or torch.FloatTensor) – Reconstruction of the original input data
  • latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
class diffnets.nnutils.vae(layer_sizes)[source]

Bases: diffnets.nnutils.ae

encode(x)[source]

Pass the data through the encoder to the latent layer.

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data for a given sample
Returns:latent – Latent space vector associated with encoder1
Return type:torch.cuda.FloatTensor or torch.FloatTensor
forward(x)[source]

Pass data through the entire network

Parameters:x (torch.cuda.FloatTensor or torch.FloatTensor) – Input data for a given sample
Returns:
  • recon (torch.cuda.FloatTensor or torch.FloatTensor) – Reconstruction of the original input data
  • latent (torch.cuda.FloatTensor or torch.FloatTensor) – Latent space vector
reparameterize(mu, logvar)[source]

diffnets.training module

class diffnets.training.Dataset(train_inds, labels, data)[source]

Bases: torch.utils.data.dataset.Dataset

Characterizes a dataset for PyTorch

class diffnets.training.Trainer(job)[source]

Bases: object

apply_exmax(inputs)[source]

Apply expectation maximization to a batch of data.

Parameters:inputs (list) – list where the 0th index is a list of current classification labels of length == batch_size. 1st index is a corresponding list of variant simulation indicators. 2nd index is em_bounds.
Returns:
Return type:Updated labels – length == batch size
em_parallel(net, em_generator, train_inds, em_batch_size, indicators, em_bounds, em_n_cores, label_str, epoch)[source]
Use expectation maximization to update all training classification
labels.
Parameters:
  • net (nnutils neural network object) – Neural network

  • em_generator (Dataset object) – Training data

  • train_inds (np.ndarray) – Indices in data that are to be trained on

  • em_batch_size (int) –

    Number of examples that are have their classification labels

    updated in a single round of expectation maximization.

  • indicators (np.ndarray, shape=(len(data),)) – Value to indicate which variant each data frame came from.

  • em_bounds (np.ndarray, shape=(n_variants,2)) – A range that sets what fraction of conformations you expect a variant to have biochemical property. Rank order of variants is more important than the ranges themselves.

  • em_n_cores (int) – CPU cores to use for expectation maximization calculation

Returns:

new_labels – Updated classification labels for all training examples

Return type:

np.ndarray, shape=(len(data),)

get_targets(act_map, indicators, label_spread=None)[source]

Convert variant indicators into classification labels.

Parameters:
  • act_map (np.ndarray, shape=(n_variants,)) – Initial classification labels to give each variant.
  • indicators (np.ndarray, shape=(len(data),)) – Value to indicate which variant each data frame came from.
Returns:

targets – Classification labels for training.

Return type:

np.ndarry, shape=(len(data),)

run(data_in_mem=False)[source]

Wrapper for running the training code

Parameters:data_in_mem (boolean) – If true, load all training data into memory. Training faster this way.
Returns:net – Trained DiffNet
Return type:nnutils neural network object
set_training_data(job, train_inds, test_inds, labels, data)[source]

Construct generators out of the dataset for training, validation, and expectation maximization.

Parameters:
  • job (dict) – See training_dict.tx for all keys.
  • train_inds (np.ndarray) – Indices in data that are to be trained on
  • test_inds (np.ndarray) – Indices in data that are to be validated on
  • labels (np.ndarray,) – classification labels used for training
  • data (np.ndarray, shape=(n_frames,3*n_atoms) OR str to path) – All data
split_test_train(n, frac_test)[source]

Split data into training and validation sets.

Parameters:
  • n (int) – number of data points
  • frac_test (float between 0 and 1) – Fraction of dataset to reserve for validation set
Returns:

  • train_inds (np.ndarray) – Indices in data that are to be trained on
  • test_inds (np.ndarray) – Indices in data that are to be validated on

train(data, training_generator, validation_generator, em_generator, targets, indicators, train_inds, test_inds, net, label_str, job, lr_fact=1.0)[source]

Core method for training

Parameters:
  • data (np.ndarray, shape=(n_frames,3*n_atoms) OR str to path) – Training data
  • training_generator (Dataset object) – Generator to sample training data
  • validation_generator (Dataset object) – Generator to sample validation data
  • em_generator (Dataset object) – Generator to sample training data in batches for expectation maximization
  • targets (np.ndarray, shape=(len(data),)) – classification labels used for training
  • indicators (np.ndarray, shape=(len(data),)) – Value to indicate which variant each data frame came from.
  • train_inds (np.ndarray) – Indices in data that are to be trained on
  • test_inds (np.ndarray) – Indices in data that are to be validated on
  • net (nnutils neural network object) – Neural network
  • label_str (int) – For file naming. Indicates what iteration of training we’re on. Training goes through several iterations where neural net architecture is progressively built deeper.
  • job (dict) – See training_dict.tx for all keys.
  • lr_fact (float) – Factor to multiply the learning rate by.
Returns:

  • best_nn (nnutils neural network object) – Neural network that has the lowest reconstruction error on the validation set.
  • targets (np.ndarry, shape=(len(data),)) – Classification labels after training.

diffnets.utils module

diffnets.utils.get_fns(dir_name, pattern)[source]
diffnets.utils.load_npy_dir(dir_name, pattern)[source]
diffnets.utils.load_traj_coords_dir(dir_name, pattern, top)[source]
diffnets.utils.mkdir(dir_name)[source]

Module contents

diffnets Supervised and self-supervised autoencoders to identify the mechanistic basis for biochemical differences between protein variants.