tfclass_predict

Submodules

Classes

`PredictionManager`	Prediction manager class. Coordinates the prediction for a single bed file.
`Predictor`	Predictor class takes care about the prediction / model execution.
`SequenceProcessor`	Class for processing sequences.
`IOInterface`

Package Contents

class tfclass_predict.PredictionManager(bed_file, genome_file, res_dir, bert_model, tfclass_model, tfargs=None)

Prediction manager class. Coordinates the prediction for a single bed file.

iointerface

bed_data

predictor

_init_BERT(bert_model)

Initalizes the BERT tokenizer.

Parameters:: bert_model – Path to BERT model directory.

_init_devices(num_threads, num_gpus, num_cpus, memory_limit)

Initializes the devices, i.e. GPU and CPU usages.

Parameters:

num_threads – Number of threads used.
num_gpus – Number of GPUs.
num_cpus – Number of CPUs.
memory_limit – Memory limit per GPU.

_init_GPU(memory_limit=None): Initializes GPU usage and enables memory growth.

predict(subseq_length=15, stride_length=1, batch_size=2000): Start the prediction. :param subseq_length: Length in which a read should be split into subsequences. :param stride_length: Defines the number of basepairs a window will move in the next step. (=1 sliding window, =subseq_length k_mer splits) :param batch_size: Number of intervals that should be processed in one batch. :return: Count vectors and prediction dictionary.

save_results(): Saves the prediction results to disk.

class tfclass_predict.Predictor(bed_data, tokenizer, model_path, genome_file)

Predictor class takes care about the prediction / model execution.

tokenizer

bed_data

SequenceProcessor

model

_init_model(model_path): Load the TFClass model. Initializes the TFBert model. :param model_path: Path to TFClass model. :return: Initialized TFClass model.

predict_bed_data(subseq_length, stride_length, batch_size)

Processes genomic sequences from the bed_data DataFrame, extracts subsequences, converts them into tokenized k-mers, and uses the TFClass model to make predictions on these sequences. The predictions are aggregated and associated with their corresponding sequence indices.

Workflow: 1. Initializes lists to store aggregated predictions and their corresponding sequence indices. 2. Iterates over each row in the bed_data DataFrame. 3. For each row:

Extracts the genomic sequence based on ‘seqnames’, ‘start’, and ‘end’ with a desired length of 150.

Skips sequences that are empty or shorter than the desired length.

Generates subsequences from the full sequence.

Converts each subsequence into k-mers and then tokenizes them.

Accumulates tokenized sequences until the batch size is reached.

Uses a machine learning model to make predictions on the batch of tokenized sequences.

Stores the predictions and their corresponding indices in the aggregated lists.

Processes any remaining sequences that did not form a complete batch.

Parameters:

subseq_length – Length in which a read should be split into subsequences (i.e. K-mer size).
stride_length – Defines the number of basepairs a window will move in the next step. (=1 sliding window, =subseq_length k_mer splits)
batch_size – Number of intervals that should be processed in one batch.

Returns:

class tfclass_predict.SequenceProcessor(tokenizer, genome)

Class for processing sequences.

tokenizer

genome

extract_fasta_sequences(chromosome, start_str, end_str, desired_length=150)

Extracts a genomic sequence of a specified length from the provided chromosome coordinates. Standardizes chromosome names and adjusts coordinates to ensure the sequence meets the desired length.

Parameters:

chromosome – Chromosome coordinates in hg38.
start_str – Start of the sequence in bp.
end_str – End of the sequence in bp.
desired_length – Length of the sequence in bp.

Returns:

A genomic sequence of the specified length.

sequence_to_kmers(sequence, k=6): Splits a string into defined kmers. :param sequence: String to split. :param k: kmer size. :return: List of kmers.

kmers_to_tokens(kmers, max_length=15): Converts kmers into tokens using DNABERT. :param kmers: List of kmers. :param max_length: Max length of tokens. :return: List of tokens.

class tfclass_predict.IOInterface(bed_file: str, genome_file: str, res_dir: str)

bed_file

file_name

genome_file

res_dir

read_atac_seq_data(): Reads ATAC-seq regions from BED file that was given in the initalizer. :return: BED file input as pd.DataFrame.

write_predictions(counts_vec, pred_dict, bed_data): Writes predictions and count vectors to output files. :param counts_vec: Count vectors from Predictor.predict function. :param pred_dict: Dictionary from Predictor.predict function. :param bed_data: Dataframe from BED file. :return: