tfclass_predict
===============

.. py:module:: tfclass_predict


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/tfclass_predict/cmd_interface/index
   /autoapi/tfclass_predict/constants/index
   /autoapi/tfclass_predict/io_interface/index
   /autoapi/tfclass_predict/prediction_manager/index
   /autoapi/tfclass_predict/predictor/index
   /autoapi/tfclass_predict/sequence_processor/index


Classes
-------

.. autoapisummary::

   tfclass_predict.PredictionManager
   tfclass_predict.Predictor
   tfclass_predict.SequenceProcessor
   tfclass_predict.IOInterface


Package Contents
----------------

.. py:class:: PredictionManager(bed_file, genome_file, res_dir, bert_model, tfclass_model)

   Prediction manager class. Coordinates the prediction for a single bed file.


   .. py:attribute:: iointerface


   .. py:attribute:: tokenizer


   .. py:attribute:: bed_data


   .. py:attribute:: predictor


   .. py:method:: _init_BERT(bert_model)

      Initalizes the BERT tokenizer.


   .. py:method:: _init_GPU()

      Initializes GPU usage and enables memory growth.


   .. py:method:: predict(subseq_length=15, batch_size=2000)

      Start the prediction.
      :param subseq_length: Length in which a read should be split into subsequences.
      :param batch_size: Number of intervals that should be processed in one batch.
      :return: Count vectors and prediction dictionary.


   .. py:method:: save_results()

      Saves the prediction results to disk.


.. py:class:: Predictor(bed_data, tokenizer, model_path, genome_file)

   Predictor class takes care about the prediction / model execution.


   .. py:attribute:: tokenizer


   .. py:attribute:: bed_data


   .. py:attribute:: SequenceProcessor


   .. py:attribute:: model


   .. py:method:: _init_model(model_path)

      Load the TFClass model. Initializes the  TFBert model.
      :param model_path: Path to TFClass model.
      :return: Initialized TFClass model.


   .. py:method:: predict_bed_data(subseq_length, batch_size)

      Processes genomic sequences from the bed_data DataFrame, extracts subsequences, converts them into tokenized k-mers,
      and uses the TFClass model to make predictions on these sequences. The predictions are aggregated and associated
      with their corresponding sequence indices.

      Workflow:
      1. Initializes lists to store aggregated predictions and their corresponding sequence indices.
      2. Iterates over each row in the bed_data DataFrame.
      3. For each row:
         - Extracts the genomic sequence based on 'seqnames', 'start', and 'end' with a desired length of 150.
         - Skips sequences that are empty or shorter than the desired length.
         - Generates subsequences from the full sequence.
         - Converts each subsequence into k-mers and then tokenizes them.
         - Accumulates tokenized sequences until the batch size is reached.
         - Uses a machine learning model to make predictions on the batch of tokenized sequences.
         - Stores the predictions and their corresponding indices in the aggregated lists.
      4. Processes any remaining sequences that did not form a complete batch.


      :param subseq_length: Length in which a read should be split into subsequences.
      :param batch_size: Number of intervals that should be processed in one batch.
      :return:


.. py:class:: SequenceProcessor(tokenizer, genome)

   Class for processing sequences.


   .. py:attribute:: tokenizer


   .. py:attribute:: genome


   .. py:method:: extract_fasta_sequences(chromosome, start_str, end_str, desired_length=150)

      Extracts a genomic sequence of a specified length from the provided chromosome coordinates.
      Standardizes chromosome names and adjusts coordinates to ensure the sequence meets the desired length.

      :param chromosome: Chromosome coordinates in hg38.
      :param start_str: Start of the sequence in bp.
      :param end_str: End of the sequence in bp.
      :param desired_length: Length of the sequence in bp.
      :return: A genomic sequence of the specified length.


   .. py:method:: sequence_to_kmers(sequence, k=6)

      Splits a string into defined kmers.
      :param sequence: String to split.
      :param k: kmer size.
      :return: List of kmers.


   .. py:method:: kmers_to_tokens(kmers, max_length=15)

      Converts kmers into tokens using DNABERT.
      :param kmers: List of kmers.
      :param max_length: Max length of tokens.
      :return: List of tokens.


.. py:class:: IOInterface(bed_file: str, genome_file: str, res_dir: str)

   .. py:attribute:: bed_file


   .. py:attribute:: _file_name


   .. py:attribute:: file_name


   .. py:attribute:: genome_file


   .. py:attribute:: res_dir


   .. py:method:: read_atac_seq_data()

      Reads ATAC-seq regions from BED file that was given in the initalizer.
      :return: BED file input as pd.DataFrame.


   .. py:method:: write_predictions(counts_vec, pred_dict, bed_data)

      Writes predictions and count vectors to output files.
      :param counts_vec: Count vectors from Predictor.predict function.
      :param pred_dict: Dictionary from Predictor.predict function.
      :param bed_data: Dataframe from BED file.
      :return: