tfclass_predict.sequence_processor
Classes
Class for processing sequences. |
Module Contents
- class tfclass_predict.sequence_processor.SequenceProcessor(tokenizer, genome)
Class for processing sequences.
- tokenizer
- genome
- extract_fasta_sequences(chromosome, start_str, end_str, desired_length=150)
Extracts a genomic sequence of a specified length from the provided chromosome coordinates. Standardizes chromosome names and adjusts coordinates to ensure the sequence meets the desired length.
- Parameters:
chromosome – Chromosome coordinates in hg38.
start_str – Start of the sequence in bp.
end_str – End of the sequence in bp.
desired_length – Length of the sequence in bp.
- Returns:
A genomic sequence of the specified length.
- sequence_to_kmers(sequence, k=6)
Splits a string into defined kmers. :param sequence: String to split. :param k: kmer size. :return: List of kmers.
- kmers_to_tokens(kmers, max_length=15)
Converts kmers into tokens using DNABERT. :param kmers: List of kmers. :param max_length: Max length of tokens. :return: List of tokens.