tfclass_predict =============== .. py:module:: tfclass_predict Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/tfclass_predict/cmd_interface/index /autoapi/tfclass_predict/constants/index /autoapi/tfclass_predict/io_interface/index /autoapi/tfclass_predict/prediction_manager/index /autoapi/tfclass_predict/predictor/index /autoapi/tfclass_predict/sequence_processor/index Classes ------- .. autoapisummary:: tfclass_predict.PredictionManager tfclass_predict.Predictor tfclass_predict.SequenceProcessor tfclass_predict.IOInterface Package Contents ---------------- .. py:class:: PredictionManager(bed_file, genome_file, res_dir, bert_model, tfclass_model) Prediction manager class. Coordinates the prediction for a single bed file. .. py:attribute:: iointerface .. py:attribute:: tokenizer .. py:attribute:: bed_data .. py:attribute:: predictor .. py:method:: _init_BERT(bert_model) Initalizes the BERT tokenizer. .. py:method:: _init_GPU() Initializes GPU usage and enables memory growth. .. py:method:: predict(subseq_length=15, batch_size=2000) Start the prediction. :param subseq_length: Length in which a read should be split into subsequences. :param batch_size: Number of intervals that should be processed in one batch. :return: Count vectors and prediction dictionary. .. py:method:: save_results() Saves the prediction results to disk. .. py:class:: Predictor(bed_data, tokenizer, model_path, genome_file) Predictor class takes care about the prediction / model execution. .. py:attribute:: tokenizer .. py:attribute:: bed_data .. py:attribute:: SequenceProcessor .. py:attribute:: model .. py:method:: _init_model(model_path) Load the TFClass model. Initializes the TFBert model. :param model_path: Path to TFClass model. :return: Initialized TFClass model. .. py:method:: predict_bed_data(subseq_length, batch_size) Processes genomic sequences from the bed_data DataFrame, extracts subsequences, converts them into tokenized k-mers, and uses the TFClass model to make predictions on these sequences. The predictions are aggregated and associated with their corresponding sequence indices. Workflow: 1. Initializes lists to store aggregated predictions and their corresponding sequence indices. 2. Iterates over each row in the bed_data DataFrame. 3. For each row: - Extracts the genomic sequence based on 'seqnames', 'start', and 'end' with a desired length of 150. - Skips sequences that are empty or shorter than the desired length. - Generates subsequences from the full sequence. - Converts each subsequence into k-mers and then tokenizes them. - Accumulates tokenized sequences until the batch size is reached. - Uses a machine learning model to make predictions on the batch of tokenized sequences. - Stores the predictions and their corresponding indices in the aggregated lists. 4. Processes any remaining sequences that did not form a complete batch. :param subseq_length: Length in which a read should be split into subsequences. :param batch_size: Number of intervals that should be processed in one batch. :return: .. py:class:: SequenceProcessor(tokenizer, genome) Class for processing sequences. .. py:attribute:: tokenizer .. py:attribute:: genome .. py:method:: extract_fasta_sequences(chromosome, start_str, end_str, desired_length=150) Extracts a genomic sequence of a specified length from the provided chromosome coordinates. Standardizes chromosome names and adjusts coordinates to ensure the sequence meets the desired length. :param chromosome: Chromosome coordinates in hg38. :param start_str: Start of the sequence in bp. :param end_str: End of the sequence in bp. :param desired_length: Length of the sequence in bp. :return: A genomic sequence of the specified length. .. py:method:: sequence_to_kmers(sequence, k=6) Splits a string into defined kmers. :param sequence: String to split. :param k: kmer size. :return: List of kmers. .. py:method:: kmers_to_tokens(kmers, max_length=15) Converts kmers into tokens using DNABERT. :param kmers: List of kmers. :param max_length: Max length of tokens. :return: List of tokens. .. py:class:: IOInterface(bed_file: str, genome_file: str, res_dir: str) .. py:attribute:: bed_file .. py:attribute:: _file_name .. py:attribute:: file_name .. py:attribute:: genome_file .. py:attribute:: res_dir .. py:method:: read_atac_seq_data() Reads ATAC-seq regions from BED file that was given in the initalizer. :return: BED file input as pd.DataFrame. .. py:method:: write_predictions(counts_vec, pred_dict, bed_data) Writes predictions and count vectors to output files. :param counts_vec: Count vectors from Predictor.predict function. :param pred_dict: Dictionary from Predictor.predict function. :param bed_data: Dataframe from BED file. :return: