Title: | A Toolkit for Using Peptide Sequences in Machine Learning |
---|---|
Description: | This toolkit is designed for manipulation and analysis of peptides. It provides functionalities to assist researchers in peptide engineering and proteomics. Users can manipulate peptides by adding amino acids at every position, count occurrences of each amino acid at each position, and transform amino acid counts based on probabilities. The package offers functionalities to select the best versus the worst peptides and analyze these peptides, which includes counting specific residues, reducing peptide sequences, extracting features through One Hot Encoding (OHE), and utilizing Quantitative Structure-Activity Relationship (QSAR) properties (based in the package 'Peptides' by Osorio et al. (2015) <doi:10.32614/RJ-2015-001>). This package is intended for both researchers and bioinformatics enthusiasts working on peptide-based projects, especially for their use with machine learning. |
Authors: | Josep-Ramon Codina [aut, cre] |
Maintainer: | Josep-Ramon Codina <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.0.2 |
Built: | 2025-02-22 04:55:39 UTC |
Source: | https://github.com/jrcodina/peptoolkit |
This function transforms the counts of amino acids to a -1, 0, 1 matrix based on a probability of appearance of each peptide in each position.
appearance_to_binary(x, threshold = 1.65, group = "Best", percentage = 0.05)
appearance_to_binary(x, threshold = 1.65, group = "Best", percentage = 0.05)
x |
A data frame containing peptide sequences. |
threshold |
The probability threshold to determine the transformation. |
group |
A character string indicating which part of the data to consider. Either 'Best' or 'Worst'. |
percentage |
The percentage of the data to consider, if group is specified. |
A matrix with the same dimensions as the input where each cell has been transformed to -1, 0, or 1 based on the probability threshold.
# Generate a mock data frame peptide_data <- data.frame(Sequence = c("ACGT", "TGCA", "GATC", "CGAT")) # Apply the function to the mock data appearance_to_binary(peptide_data, group = "Best", percentage = 0.5)
# Generate a mock data frame peptide_data <- data.frame(Sequence = c("ACGT", "TGCA", "GATC", "CGAT")) # Apply the function to the mock data appearance_to_binary(peptide_data, group = "Best", percentage = 0.5)
This function counts the occurrence of each of the 20 amino acids at each of the first 'n' positions across a vector of peptide sequences.
count_aa(peptides, n = 4)
count_aa(peptides, n = 4)
peptides |
A character vector of peptide sequences. |
n |
The number of initial positions to consider in each peptide sequence. |
A data frame with 'n' rows and 20 columns where each row represents a position in the peptide sequence and each column represents an amino acid. Each cell in the data frame contains the count of a particular amino acid at a particular position.
count_aa(c("ACDF", "BCDE", "ABCD"), n = 2)
count_aa(c("ACDF", "BCDE", "ABCD"), n = 2)
This function takes a data frame or a vector of peptide sequences and generates a one-hot encoded data frame representing each amino acid in the sequences. It can also include additional data (such as docking information), if provided. Furthermore, it can generate a peptide library of specified length n.
extract_features_OHE( df = NULL, sequence_col = "Sequence", docking_col = NULL, n = NULL )
extract_features_OHE( df = NULL, sequence_col = "Sequence", docking_col = NULL, n = NULL )
df |
A data frame or a vector of peptide sequences. If 'df' is provided, 'n' will be ignored. |
sequence_col |
A string representing the name of the column containing the peptide sequences. |
docking_col |
A string representing the name of the column containing the docking information. |
n |
An integer representing the length of the peptide library to be generated. If 'df' is provided, 'n' will be ignored. |
A data frame containing one-hot encoded peptide sequences and, if provided, docking information.
# Load required library caret library(caret) extract_features_OHE(df = c('ACA', 'EDE'))
# Load required library caret library(caret) extract_features_OHE(df = c('ACA', 'EDE'))
This function extracts various Quantitative Structure-Activity Relationship (QSAR) features from peptide sequences. The extraction is based on a variety of amino acid properties and functions from the "Peptides" package (https://github.com/dosorio/Peptides/).
extract_features_QSAR( df = NULL, n = NULL, sequence_col = "Sequence", docking_col = NULL, pH = 7.4, normalize = FALSE )
extract_features_QSAR( df = NULL, n = NULL, sequence_col = "Sequence", docking_col = NULL, pH = 7.4, normalize = FALSE )
df |
A data frame or a vector of peptide sequences. If 'df' is provided, 'n' will be ignored. |
n |
An integer representing the length of the peptide library to be generated. If 'df' is provided, 'n' will be ignored. |
sequence_col |
A string representing the name of the column containing the peptide sequences. |
docking_col |
A string representing the name of the column containing the docking information. |
pH |
The pH used for calculating charge (default is 7.4). |
normalize |
A boolean indicating if the data should be normalized (default is FALSE). |
A dataframe with the calculated peptide properties.
extract_features_QSAR(df = c('ACA', 'EDE'))
extract_features_QSAR(df = c('ACA', 'EDE'))
This function counts the number of specified residues in each peptide sequence and filters out the ones with more than the specified limit. It's defaults is for filtering out small alliphatic residues.
filter_residues( df, sequence_col = "Sequence", residues = c("A", "V", "I", "L", "G"), max_residues = 2 )
filter_residues( df, sequence_col = "Sequence", residues = c("A", "V", "I", "L", "G"), max_residues = 2 )
df |
A data frame containing peptide sequences. |
sequence_col |
The name of the column that contains the sequences. |
residues |
A character vector of residues to count. |
max_residues |
The maximum number of allowed residues. |
A filtered data frame.
# Generate a mock data frame peptide_data <- data.frame(Sequence = c("AVILG", "VILGA", "ILGAV", "LGAVI")) # Apply the function to the mock data filter_residues(peptide_data, residues = c("A", "V", "I", "L", "G"), max_residues = 2)
# Generate a mock data frame peptide_data <- data.frame(Sequence = c("AVILG", "VILGA", "ILGAV", "LGAVI")) # Apply the function to the mock data filter_residues(peptide_data, residues = c("A", "V", "I", "L", "G"), max_residues = 2)
This function generates new peptide sequences by adding each of the 20 amino acids to each position of the input peptide or peptides.
increment(peptides, num_added = 1)
increment(peptides, num_added = 1)
peptides |
A character vector of peptide sequences. |
num_added |
The number of amino acids to be added to each position of the peptide. |
A character vector of new peptide sequences.
increment(c("AC", "DE")) increment("ACDE", num_added = 2)
increment(c("AC", "DE")) increment("ACDE", num_added = 2)
This function takes a vector of peptide sequences and generates all possible sequences by removing one amino acid residue at a time. It can also associate each sequence with an ID, if provided.
reduce_sequences(peptides, id = NULL)
reduce_sequences(peptides, id = NULL)
peptides |
A character vector of peptide sequences. |
id |
A character vector of IDs that correspond to the peptides. |
A list of data frames, each containing all possible sequences resulting from removing one amino acid from the original sequence.
# Generate a mock vector of peptide sequences peptides <- c("AVILG", "VILGA", "ILGAV", "LGAVI") # Apply the function to the mock data reduce_sequences(peptides)
# Generate a mock vector of peptide sequences peptides <- c("AVILG", "VILGA", "ILGAV", "LGAVI") # Apply the function to the mock data reduce_sequences(peptides)
This function identifies the peptides from the function *appearance_to_binary* that are 1 in one group and 0 or -1 in another group, and expands the grid to all possible combinations.
select_best_vs_worst(appearance_best, appearance_worst)
select_best_vs_worst(appearance_best, appearance_worst)
appearance_best |
A matrix with transformed counts for the 'best' group. |
appearance_worst |
A matrix with transformed counts for the 'worst' group. |
A data frame with combinations of 'best' peptides.
# Generate some mock data appearance_best <- matrix(c(1, -1, 0, 1, -1), nrow = 5, ncol = 4) appearance_worst <- matrix(c(-1, 1, 0, -1, 1), nrow = 5, ncol = 4) # Call the function select_best_vs_worst(appearance_best, appearance_worst)
# Generate some mock data appearance_best <- matrix(c(1, -1, 0, 1, -1), nrow = 5, ncol = 4) appearance_worst <- matrix(c(-1, 1, 0, -1, 1), nrow = 5, ncol = 4) # Call the function select_best_vs_worst(appearance_best, appearance_worst)