tfclass_predict.sequence_processor

Classes

SequenceProcessor

Class for processing sequences.

Module Contents

class tfclass_predict.sequence_processor.SequenceProcessor(tokenizer, genome)

Class for processing sequences.

tokenizer
genome
extract_fasta_sequences(chromosome, start_str, end_str, desired_length=150)

Extracts a genomic sequence of a specified length from the provided chromosome coordinates. Standardizes chromosome names and adjusts coordinates to ensure the sequence meets the desired length.

Parameters:
  • chromosome – Chromosome coordinates in hg38.

  • start_str – Start of the sequence in bp.

  • end_str – End of the sequence in bp.

  • desired_length – Length of the sequence in bp.

Returns:

A genomic sequence of the specified length.

sequence_to_kmers(sequence, k=6)

Splits a string into defined kmers. :param sequence: String to split. :param k: kmer size. :return: List of kmers.

kmers_to_tokens(kmers, max_length=15)

Converts kmers into tokens using DNABERT. :param kmers: List of kmers. :param max_length: Max length of tokens. :return: List of tokens.