tfclass_predict

Submodules

Classes

PredictionManager

Prediction manager class. Coordinates the prediction for a single bed file.

Predictor

Predictor class takes care about the prediction / model execution.

SequenceProcessor

Class for processing sequences.

IOInterface

Package Contents

class tfclass_predict.PredictionManager(bed_file, genome_file, res_dir, bert_model, tfclass_model, tfargs=None)

Prediction manager class. Coordinates the prediction for a single bed file.

iointerface
bed_data
predictor
_init_BERT(bert_model)

Initalizes the BERT tokenizer.

Parameters:

bert_model – Path to BERT model directory.

_init_devices(num_threads, num_gpus, num_cpus, memory_limit)

Initializes the devices, i.e. GPU and CPU usages.

Parameters:
  • num_threads – Number of threads used.

  • num_gpus – Number of GPUs.

  • num_cpus – Number of CPUs.

  • memory_limit – Memory limit per GPU.

_init_GPU(memory_limit=None)

Initializes GPU usage and enables memory growth.

predict(subseq_length=15, stride_length=1, batch_size=2000)

Start the prediction. :param subseq_length: Length in which a read should be split into subsequences. :param stride_length: Defines the number of basepairs a window will move in the next step. (=1 sliding window, =subseq_length k_mer splits) :param batch_size: Number of intervals that should be processed in one batch. :return: Count vectors and prediction dictionary.

save_results()

Saves the prediction results to disk.

class tfclass_predict.Predictor(bed_data, tokenizer, model_path, genome_file)

Predictor class takes care about the prediction / model execution.

tokenizer
bed_data
SequenceProcessor
model
_init_model(model_path)

Load the TFClass model. Initializes the TFBert model. :param model_path: Path to TFClass model. :return: Initialized TFClass model.

predict_bed_data(subseq_length, stride_length, batch_size)

Processes genomic sequences from the bed_data DataFrame, extracts subsequences, converts them into tokenized k-mers, and uses the TFClass model to make predictions on these sequences. The predictions are aggregated and associated with their corresponding sequence indices.

Workflow: 1. Initializes lists to store aggregated predictions and their corresponding sequence indices. 2. Iterates over each row in the bed_data DataFrame. 3. For each row:

  • Extracts the genomic sequence based on ‘seqnames’, ‘start’, and ‘end’ with a desired length of 150.

  • Skips sequences that are empty or shorter than the desired length.

  • Generates subsequences from the full sequence.

  • Converts each subsequence into k-mers and then tokenizes them.

  • Accumulates tokenized sequences until the batch size is reached.

  • Uses a machine learning model to make predictions on the batch of tokenized sequences.

  • Stores the predictions and their corresponding indices in the aggregated lists.

  1. Processes any remaining sequences that did not form a complete batch.

Parameters:
  • subseq_length – Length in which a read should be split into subsequences (i.e. K-mer size).

  • stride_length – Defines the number of basepairs a window will move in the next step. (=1 sliding window, =subseq_length k_mer splits)

  • batch_size – Number of intervals that should be processed in one batch.

Returns:

class tfclass_predict.SequenceProcessor(tokenizer, genome)

Class for processing sequences.

tokenizer
genome
extract_fasta_sequences(chromosome, start_str, end_str, desired_length=150)

Extracts a genomic sequence of a specified length from the provided chromosome coordinates. Standardizes chromosome names and adjusts coordinates to ensure the sequence meets the desired length.

Parameters:
  • chromosome – Chromosome coordinates in hg38.

  • start_str – Start of the sequence in bp.

  • end_str – End of the sequence in bp.

  • desired_length – Length of the sequence in bp.

Returns:

A genomic sequence of the specified length.

sequence_to_kmers(sequence, k=6)

Splits a string into defined kmers. :param sequence: String to split. :param k: kmer size. :return: List of kmers.

kmers_to_tokens(kmers, max_length=15)

Converts kmers into tokens using DNABERT. :param kmers: List of kmers. :param max_length: Max length of tokens. :return: List of tokens.

class tfclass_predict.IOInterface(bed_file: str, genome_file: str, res_dir: str)
bed_file
file_name
genome_file
res_dir
read_atac_seq_data()

Reads ATAC-seq regions from BED file that was given in the initalizer. :return: BED file input as pd.DataFrame.

write_predictions(counts_vec, pred_dict, bed_data)

Writes predictions and count vectors to output files. :param counts_vec: Count vectors from Predictor.predict function. :param pred_dict: Dictionary from Predictor.predict function. :param bed_data: Dataframe from BED file. :return: