tfclass_predict.predictor

Classes

ClassAUC

Metric used in training steps - needs to be kept for model usage.

Predictor

Predictor class takes care about the prediction / model execution.

Module Contents

class tfclass_predict.predictor.ClassAUC(name='ClassAUC', **kwargs)

Bases: tensorflow.metrics.AUC

Metric used in training steps - needs to be kept for model usage.

class tfclass_predict.predictor.Predictor(bed_data, tokenizer, model_path, genome_file)

Predictor class takes care about the prediction / model execution.

tokenizer
bed_data
SequenceProcessor
model
_init_model(model_path)

Load the TFClass model. Initializes the TFBert model. :param model_path: Path to TFClass model. :return: Initialized TFClass model.

predict_bed_data(subseq_length, batch_size)

Processes genomic sequences from the bed_data DataFrame, extracts subsequences, converts them into tokenized k-mers, and uses the TFClass model to make predictions on these sequences. The predictions are aggregated and associated with their corresponding sequence indices.

Workflow: 1. Initializes lists to store aggregated predictions and their corresponding sequence indices. 2. Iterates over each row in the bed_data DataFrame. 3. For each row:

  • Extracts the genomic sequence based on ‘seqnames’, ‘start’, and ‘end’ with a desired length of 150.

  • Skips sequences that are empty or shorter than the desired length.

  • Generates subsequences from the full sequence.

  • Converts each subsequence into k-mers and then tokenizes them.

  • Accumulates tokenized sequences until the batch size is reached.

  • Uses a machine learning model to make predictions on the batch of tokenized sequences.

  • Stores the predictions and their corresponding indices in the aggregated lists.

  1. Processes any remaining sequences that did not form a complete batch.

Parameters:
  • subseq_length – Length in which a read should be split into subsequences.

  • batch_size – Number of intervals that should be processed in one batch.

Returns: