The lps
data structure is designed for efficient parsing and representation of strings with advanced features such as reverse complement transformations, offset-based indexing, and binary file serialization/deserialization.
This document provides detailed instructions on initializing, using, and managing the lps
data structure.
The lps
structure is composed of two main components:
lps
structure: Represents the main object with attributes likelevel
,size
, and an array ofcore
objects.core
structure: Represents the individual elements stored in thelps
, with attributes for bit representation, labels, and positional information.
struct lps {
int level;
int size;
struct core *cores;
};
struct core {
ubit_size bit_size; // Size of the bit representation
ublock *bit_rep; // Pointer to the bit representation
ulabel label; // Unique label for the core
uint64_t start; // Start index in the string
uint64_t end; // End index in the string
};
Initializes an lps
object using a given string.
Parameters:
struct lps *lps_ptr
: Pointer to thelps
object to initialize.const char *str
: Input string to be parsed.int len
: Length of the input string.
Usage:
struct lps my_lps;
init_lps(&my_lps, "ACGTACGT", 8);
Initializes an lps
object with offset-based indexing. This function adds the offset
value to each cores' indices.
Parameters:
struct lps *lps_ptr
: Pointer to thelps
object to initialize.const char *str
: Input string to be parsed.int len
: Length of the input string.uint64_t offset
: Offset value for indexing.
Usage:
struct lps my_lps;
init_lps_offset(&my_lps, "ACGTACGT", 8, 100);
Initializes an lps
object and includes a reverse complement transformation.
Parameters:
struct lps *lps_ptr
: Pointer to thelps
object to initialize.const char *str
: Input string to be parsed.int len
: Length of the input string.
Usage:
struct lps my_lps;
init_lps2(&my_lps, "ACGTACGT", 8);
Initializes an lps
object by reading from a binary file.
Parameters:
struct lps *lps_ptr
: Pointer to thelps
object to initialize.FILE *in
: File pointer to the binary file containing serializedlps
data.
Usage:
struct lps my_lps;
FILE *input_file = fopen("lps_data.bin", "rb");
init_lps3(&my_lps, input_file);
fclose(input_file);
Initializes an lps
object using divide and conquer approach.
Parameters:
struct lps *lps_ptr
: Pointer to thelps
object to initialize.const char *str
: Input string to be parsed.int len
: Length of the input string.int chunk_size
: Size of the chunks to be processed.
Usage:
struct lps my_lps;
init_lps4(&my_lps, "ACGTACGT", 8, 100000);
Frees dynamically allocated memory associated with an lps
object.
Parameters:
struct lps *lps_ptr
: Pointer to thelps
object to free.
Usage:
free_lps(&my_lps);
Serializes and writes an lps
object to a binary file.
Parameters:
struct lps *lps_ptr
: Pointer to thelps
object to write.FILE *out
: File pointer to the binary file for writing.
Usage:
FILE *output_file = fopen("lps_data.bin", "wb");
write_lps(&my_lps, output_file);
fclose(output_file);
This section provides functions to manage the encoding of standard DNA bases (A, C, G, T) and their complements, used in the Locally Consistent Parsing (LCP) tool. Please note that any custom alphabet encoding can be provided to the program.
- Description: Displays the summary of the alphabet encoding, including coefficients and dictionary bit size. This function helps you understand the encoding setup and verify that the parameters are correctly configured.
- Usage: Simply call
LCP_SUMMARY()
to print the encoding details to the console.
void LCP_SUMMARY();
- Description: Initializes the encoding coefficients for the standard DNA bases (A, C, G, T) and their complements. The function sets default values for the coefficients and dictionary bit size.
- Usage: Call
LCP_INIT()
to initialize the encoding with the default values.
void LCP_INIT();
- Description: Initializes the encoding coefficients for standard DNA bases (A, C, G, T) and their reverse complements. Sets default values for coefficients and dictionary bit size. The verbosity of the output can be controlled by the
verbose
parameter:- If
verbose
is 0, no encoding summary will be printed. - If
verbose
is 1, the encoding summary is printed after initialization.
- If
- Usage: Call
LCP_INIT2(0)
orLCP_INIT2(1)
based on whether you want the encoding summary printed.
void LCP_INIT2(int verbose);
- Description: Initializes the encoding coefficients by reading them from a file. The file must contain three columns: the character (DNA base), the encoding value, and the complement encoding value for each base. After initializing the encoding coefficients, the function prints the encoding summary if
verbose
is set to 1. - Parameters:
filename
: The path to the file containing the character encodings. The file format should have the following columns: character (A, C, G, T), encoding value, and complement encoding value.verbose
: If true (1), the encoding summary will be printed after initialization. If false (0), no summary will be printed.
- Returns: Always returns 0 upon successful initialization.
- Throws:
std::invalid_argument
if any invalid data is found in the file (e.g., missing or incorrect entries). - Usage: Call
LCP_INIT_FILE("path/to/encoding_file.txt", 1)
to initialize the encoding from a file and print the summary.
int LCP_INIT_FILE(const char *filename, int verbose);
The file for initializing encoding should be in the following format:
A 0 1
C 1 0
G 2 3
T 3 2
Where:
- The first column represents the DNA base (A, C, G, T).
- The second column represents the encoding value.
- The third column represents the complement encoding value.
#include <stdio.h>
#include "lps.h"
int main() {
LCP_INIT();
struct lps my_lps;
// Initialize from a string
init_lps(&my_lps, "ACGTACGT", 8);
// Serialize to a file
FILE *output_file = fopen("lps_data.bin", "wb");
write_lps(&my_lps, output_file);
fclose(output_file);
// Free memory
free_lps(&my_lps);
// Deserialize from a file
FILE *input_file = fopen("lps_data.bin", "rb");
init_lps3(&my_lps, input_file);
fclose(input_file);
// Free memory again
free_lps(&my_lps);
return 0;
}
- The encoding coefficients define how each DNA base and its complement are represented internally. These values are essential for performing efficient parsing and comparison of DNA sequences.
- Ensure that the encoding file is properly formatted to avoid errors during initialization.
- Always ensure memory is freed after use to avoid leaks.
- Use valid and open binary files for serialization/deserialization functions.
- Offset and complement functions provide advanced features for genomic data analysis.