tfclass_predict.sequence_processor ================================== .. py:module:: tfclass_predict.sequence_processor Classes ------- .. autoapisummary:: tfclass_predict.sequence_processor.SequenceProcessor Module Contents --------------- .. py:class:: SequenceProcessor(tokenizer, genome) Class for processing sequences. .. py:attribute:: tokenizer .. py:attribute:: genome .. py:method:: extract_fasta_sequences(chromosome, start_str, end_str, desired_length=150) Extracts a genomic sequence of a specified length from the provided chromosome coordinates. Standardizes chromosome names and adjusts coordinates to ensure the sequence meets the desired length. :param chromosome: Chromosome coordinates in hg38. :param start_str: Start of the sequence in bp. :param end_str: End of the sequence in bp. :param desired_length: Length of the sequence in bp. :return: A genomic sequence of the specified length. .. py:method:: sequence_to_kmers(sequence, k=6) Splits a string into defined kmers. :param sequence: String to split. :param k: kmer size. :return: List of kmers. .. py:method:: kmers_to_tokens(kmers, max_length=15) Converts kmers into tokens using DNABERT. :param kmers: List of kmers. :param max_length: Max length of tokens. :return: List of tokens.