Feature customization#

Feature engineering is a crucial step in molecular science, utilizing the domain expertise to extract meaningful features from raw data using certain data mining techniques. Properly defining and selecting these features can significantly enhance the performance of machine learning algorithms. In this tutorial, we will delve into the process of feature customization within Nearl’s framework and present a case study on RFScore.

General concept#

When defining a new feature, you need to inherit the base class Features and implement the feature function. The cache method is designed to store the structural information (self.cached_props) that will be used during the following steps. It is designed to be called only once when processing each trajectory, unless the trajectory is reloaded from time to time. The returned variables from the query method are directly put to the run method. The following is a simple implementation of a feature.

import numpy as np
from nearl import commands, utils
from nearl.features import Features

class MyFirstFeature(Features):
  def __init__(self, **kwargs):
    super().__init__(**kwargs)
    # Add your own initialization

  def cache(self, trajectory):
    super().cache(trajectory)
    # this example is to generate a random number for each atom
    self.cached_props = np.full(trajectory.top.n_atoms, 1)

  def query(self, topology, frames, focus):
    # Retrieve the concerned substructure for further feature calculation
    mask_inbox, coords = super().query(topology, frame, focus)
    weights = self.cached_props[mask_inbox]
    return coords, weights

  def run(self, coords, weights):
    # Transform the frame-slice into a feature vector
    feature_vector = commands.frame_voxelize(coords, weights, self.dims, self.spacing, self.cutoff, self.sigma)
    return feature_vector

  def dump(self, feature_vector):
    # Put the output feature ``feature_vector`` into HDF file
    utils.append_hdf_data(self.outfile, self.outkey,
                          np.asarray([result], dtype=np.float32),
                          dtype=np.float32, maxshape=(None, *self.dims),
                          chunks=True, compression="gzip", compression_opts=4)

Case study: RFScore features#

The following code-block, we will customize the featurization process using the RFScore features (Ballester et al.). RFScore is a 36-dimensional feature that each element represents the count of specific protein–ligand atom type pairs within a certain distance cutoff. The original paper and the detailed description of RFScore is available at Bioinformatics.

Feature implementation#

The selected components (mask_interest) are designated as the ligand part (moiety of interest) and their counterparts (!mask_interest) are considered as the receptor part. In the cache method, the indices of ligand and receptor parts are stored in self.idx_interest and self.idx_counterpart. The resulting feature vector is a 4×9 matrix, where the rows represent the protein atoms (C, N, O, S) and the columns represent the ligand atoms (C, N, O, F, P, S, Cl, Br, I). Consequently, the hash map that maps all possible contacts to their internal indices if the resulting vector is stored in self.idx_hashmap. Since applying a bounding box is unnecessary in this example, the query method directly returns the frame-slice (frame_coords). The run method then receives the output coordinates from the query method and computes the contact map based on the cached properties and the cutoff.

Note that RFScore is a static feature, and we use only the first frame of each frame-slice. Since this is not a voxel-based feature, the outshape for feature initialization is manually set to [1, 36]. Hydrogen atoms are included during the structure processing, although they do not affect the final count of contacts.

import numpy as np
from scipy.spatial import KDTree
from nearl.features import Feature

# Result is a 4*9 (36 dimensions) contact map
# Rows (protein)   : C, N, O, S
# Columns (ligand) : C, N, O, F, P, S, Cl, Br, I
PROT_MAP = {6: 0, 7: 1, 8: 2, 16: 3}
LIG_MAP = {6: 0, 7: 1, 8: 2, 9: 3, 15: 4, 16: 5, 17: 6, 35: 7, 53: 8}

class RFScoreFeat(Feature):
  def __init__(self, moiety_of_interest, cutoff, **kwargs):
    super().__init__(outshape = [1, 36], **kwargs)
    self.moi = moiety_of_interest
    self.cutoff = cutoff

  def cache(self, trajectory):
    super().cache(trajectory)
    # Build the map of parts of interest and the counterpart
    self.idx_interest = trajectory.top.select(self.moi)  # The indices of the moiety of interest
    self.idx_counterpart = np.setdiff1d(np.arange(trajectory.top.n_atoms), self.idx_interest)
    self.atom_numbers = np.array([i.atomic_number for i in trajectory.top.atoms], dtype=int)

    # Construct a hashmap for fast lookup of all possible contacts
    self.idx_hashmap = {}
    for p, p_idx in PROT_MAP.items():
      for l, l_idx in LIG_MAP.items():
        self.idx_hashmap[f"{p}_{l}"] = (p_idx, l_idx)

  def query(self, topology, frame_coords, focal_point):
    return (frame_coords,)

  def run(self, coords):
    # Build a kd-tree for the counterpart coordinates
    kd_tree = KDTree(coords[0][self.idx_counterpart])
    # Initialize the feature vector
    rf_feature = np.zeros((4, 9), dtype=int)

    # Process atoms in the moiety of interest
    for idx in self.idx_interest:
      atom_number = self.atom_numbers[idx]
      atom_crd = coords[0][idx]
      inner_idxs = kd_tree.query_ball_point(atom_crd, self.cutoff)
      counterpart_indices = self.idx_counterpart[inner_idxs]
      for idx_prot in counterpart_indices:
        iidx = self.idx_hashmap.get(f"{self.atom_numbers[idx_prot]}_{atom_number}", None)
        if iidx is not None:
          rf_feature[iidx] += 1
    return rf_feature.reshape(-1)

Feature generation#

As in the Quick Start, we will use the simple trajectory in the example dataset for demonstration.

RFScore feature will focus on the ligand part (annotated as :LIG) and the cutoff distance is set to 5.5 Å. The resulting features will be dumped to the /tmp/rf_data.h5 file and the resulting vectors will be stored in the rf_feature key in the HDF5 file.

import nearl
import nearl.featurizer, nearl.io
# Initialize the trajectory loader and featurizer
EXAMPLE_DATA = nearl.get_example_data("/tmp/nearl_test")
loader = nearl.io.TrajectoryLoader(EXAMPLE_DATA["MINI_TRAJSET"])
featurizer = nearl.featurizer.Featurizer({
  "time_window": 10,
  "outfile": "/tmp/rf_data.h5",
})

# Register the feature and start the featurization
featurizer.register_feature(RFScoreFeat(":LIG", 5.5, outkey="rf_feature"))
featurizer.register_trajloader(loader)
featurizer.register_focus([":LIG"], "mask")
featurizer.run()

Result inspection#

The following code snippet retrieves the feature vectors under rf_feature we just generated.

import h5py
with h5py.File("/tmp/rf_data.h5", "r") as hdf:
  x_train = hdf["rf_feature"][:]
  print(x_train.shape)