tfclass_predict.predictor

Classes

`ClassAUC`	Metric used in training steps - needs to be kept for model usage.
`Predictor`	Predictor class takes care about the prediction / model execution.

Module Contents

class tfclass_predict.predictor.ClassAUC(name='ClassAUC', **kwargs)

Bases: tensorflow.metrics.AUC

Metric used in training steps - needs to be kept for model usage.

class tfclass_predict.predictor.Predictor(bed_data, tokenizer, model_path, genome_file)

Predictor class takes care about the prediction / model execution.

tokenizer

bed_data

SequenceProcessor

model

_init_model(model_path): Load the TFClass model. Initializes the TFBert model. :param model_path: Path to TFClass model. :return: Initialized TFClass model.

predict_bed_data(subseq_length, batch_size)

Processes genomic sequences from the bed_data DataFrame, extracts subsequences, converts them into tokenized k-mers, and uses the TFClass model to make predictions on these sequences. The predictions are aggregated and associated with their corresponding sequence indices.

Workflow: 1. Initializes lists to store aggregated predictions and their corresponding sequence indices. 2. Iterates over each row in the bed_data DataFrame. 3. For each row:

Extracts the genomic sequence based on ‘seqnames’, ‘start’, and ‘end’ with a desired length of 150.

Skips sequences that are empty or shorter than the desired length.

Generates subsequences from the full sequence.

Converts each subsequence into k-mers and then tokenizes them.

Accumulates tokenized sequences until the batch size is reached.

Uses a machine learning model to make predictions on the batch of tokenized sequences.

Stores the predictions and their corresponding indices in the aggregated lists.

Processes any remaining sequences that did not form a complete batch.

Parameters:

subseq_length – Length in which a read should be split into subsequences.
batch_size – Number of intervals that should be processed in one batch.

Returns: