FASTA I/O in GWASLab
GWASLab provides efficient functions for reading and writing FASTA format files, with support for compressed formats and optimized memory usage.
Supported File Formats
GWASLab supports the following FASTA file extensions:
- .fa, .fasta - Plain text FASTA files
- .fa.gz, .fasta.gz - Gzip-compressed FASTA files
- .fa.bgz, .fasta.bgz - BGZF-compressed FASTA files (treated as gzip)
Reading FASTA Files
parse_fasta()
Parse a FASTA file and return records as a dictionary or iterator.
from gwaslab.io.io_fasta import parse_fasta
# Load as dictionary (default)
records = parse_fasta("reference.fasta")
print(records["chr1"][:10]) # First 10 bases of chr1
# Load as iterator (memory efficient for large files)
for title, seq in parse_fasta("reference.fasta", as_dict=False):
print(f"{title}: {len(seq)} bp")
| Parameter | DataType | Description | Default |
|---|---|---|---|
path |
string |
Path to FASTA file. Supports plain and gzipped formats | Required |
as_dict |
boolean |
If True, return a dictionary mapping titles to sequences. If False, return an iterator of (title, sequence) tuples | True |
Returns:
- If as_dict=True: dict[str, str] - Dictionary mapping record titles to sequences
- If as_dict=False: Iterator[tuple[str, str]] - Iterator yielding (title, sequence) tuples
Memory Efficiency
When as_dict=False, the file handle is kept open until iteration completes. This is more memory-efficient for large files.
load_fasta_auto()
Automatically load FASTA or gzipped FASTA files, returning FastaRecord objects or tuples.
from gwaslab.io.io_fasta import load_fasta_auto
# Get FastaRecord objects (default)
for record in load_fasta_auto("reference.fasta.gz"):
print(f"{record.id}: {len(record.seq._data)} bp")
# Get simple tuples
for title, seq in load_fasta_auto("reference.fasta.gz", as_seqrecord=False):
print(f"{title}: {len(seq)} bp")
| Parameter | DataType | Description | Default |
|---|---|---|---|
path |
string |
Path to FASTA file. Supports: .fa, .fasta, .fa.gz, .fasta.gz, .fa.bgz, .fasta.bgz | Required |
as_seqrecord |
boolean |
If True, return FastaRecord objects with .id and .seq._data attributes. If False, return (title, sequence) tuples | True |
Returns:
- Iterator[FastaRecord] or Iterator[tuple[str, str]] - Iterator yielding FastaRecord objects or (title, sequence) tuples
File Handle Management
The file handle is kept open until iteration completes. For proper resource management, ensure you consume the entire iterator or use it within a context manager.
load_fasta_filtered()
Load and filter FASTA records in a single pass for better performance. Only creates records for chromosomes that are needed.
from gwaslab.io.io_fasta import load_fasta_filtered
from gwaslab.info.g_Log import Log
chromlist_set = {"1", "2", "3"}
chroms_in_sumstats_set = {"1", "2"}
chr_dict = {"1": "1", "2": "2", "3": "3"}
chr_dict_keys = set(chr_dict.keys())
log = Log()
records = load_fasta_filtered(
path="reference.fasta.gz",
chromlist_set=chromlist_set,
chroms_in_sumstats_set=chroms_in_sumstats_set,
chr_dict=chr_dict,
chr_dict_keys=chr_dict_keys,
log=log,
verbose=True
)
| Parameter | DataType | Description | Default |
|---|---|---|---|
path |
string |
Path to FASTA file | Required |
chromlist_set |
set |
Set of valid chromosome identifiers | Required |
chroms_in_sumstats_set |
set |
Set of chromosomes present in the summary statistics | Required |
chr_dict |
dict |
Dictionary mapping chromosome names to standardized format | Required |
chr_dict_keys |
set |
Set of keys in chr_dict for fast lookup | Required |
log |
gwaslab.g_Log.Log |
Logging object | Log() |
verbose |
boolean |
If True, print progress messages | True |
Returns:
- dict[str, FastaRecord] - Dictionary mapping chromosome identifiers to FastaRecord objects
get_fasta_record()
Get a specific FASTA record by title.
from gwaslab.io.io_fasta import get_fasta_record
seq = get_fasta_record("reference.fasta", "chr1")
if seq:
print(f"chr1 length: {len(seq)}")
| Parameter | DataType | Description | Default |
|---|---|---|---|
path |
string |
Path to FASTA file | Required |
title |
string |
Title of the record to retrieve (without the '>' character) | Required |
Returns:
- str or None - Sequence for the requested record, or None if not found
Writing FASTA Files
write_fasta()
Write FASTA records to a file.
from gwaslab.io.io_fasta import write_fasta
# Write from dictionary
records = {"chr1": "ATCGATCG", "chr2": "GCTAGCTA"}
write_fasta(records, "output.fasta")
# Write from iterator
records = [("chr1", "ATCGATCG"), ("chr2", "GCTAGCTA")]
write_fasta(records, "output.fasta.gz", wrap=80)
| Parameter | DataType | Description | Default |
|---|---|---|---|
records |
dict[str, str] or Iterator[tuple[str, str]] |
FASTA records to write. Can be a dictionary mapping titles to sequences or an iterator of (title, sequence) tuples | Required |
path |
string |
Output file path. Supports plain and gzipped formats based on extension | Required |
wrap |
int |
Line length for sequence wrapping. Use 0 or None for no wrapping | 60 |
mode |
string |
File open mode. Use "w" for write or "a" for append | "w" |
Supported output formats:
- Plain text: .fa, .fasta
- Gzip-compressed: .fa.gz, .fasta.gz
- BGZF: .fa.bgz, .fasta.bgz (output will be standard gzip format)
Advanced Functions
load_and_build_fasta_records()
Load, filter, and build numpy FASTA records in a single pass for maximum performance. This function combines loading, filtering, and numpy array conversion in one pass, avoiding the creation of intermediate FastaRecord objects.
from gwaslab.io.io_fasta import load_and_build_fasta_records
from gwaslab.info.g_Log import Log
chromlist_set = {"1", "2"}
chroms_in_sumstats_set = {"1", "2"}
chr_dict = {"1": "1", "2": "2"}
chr_dict_keys = set(chr_dict.keys())
log = Log()
record, starting_positions, records_len_dict = load_and_build_fasta_records(
path="reference.fasta.gz",
chromlist_set=chromlist_set,
chroms_in_sumstats_set=chroms_in_sumstats_set,
chr_dict=chr_dict,
chr_dict_keys=chr_dict_keys,
pos_as_dict=True,
log=log,
verbose=True
)
| Parameter | DataType | Description | Default |
|---|---|---|---|
path |
string |
Path to FASTA file | Required |
chromlist_set |
set |
Set of valid chromosome identifiers | Required |
chroms_in_sumstats_set |
set |
Set of chromosomes present in the summary statistics | Required |
chr_dict |
dict |
Dictionary mapping chromosome names to standardized format | Required |
chr_dict_keys |
set |
Set of keys in chr_dict for fast lookup | Required |
pos_as_dict |
boolean |
If True, return starting_positions and records_len as dictionaries | True |
log |
gwaslab.g_Log.Log |
Logging object | Log() |
verbose |
boolean |
If True, print progress messages | True |
Returns:
- tuple - (record, starting_positions, records_len_dict) where:
- record: concatenated numpy array of uint8 integers
- starting_positions: dict or array of starting positions for each chromosome
- records_len_dict: dict or array of lengths for each chromosome
build_fasta_records()
Build numpy FASTA records from a dictionary of FastaRecord objects. This function converts FASTA records to a single numpy array of integers for fast lookup.
from gwaslab.io.io_fasta import build_fasta_records, load_fasta_filtered
from gwaslab.info.g_Log import Log
# First load records
fasta_records_dict = load_fasta_filtered(...)
# Then build numpy records
record, starting_positions, records_len_dict = build_fasta_records(
fasta_records_dict,
pos_as_dict=True,
log=Log(),
verbose=True
)
| Parameter | DataType | Description | Default |
|---|---|---|---|
fasta_records_dict |
dict |
Dictionary mapping chromosome names to FastaRecord objects | Required |
pos_as_dict |
boolean |
If True, return starting_positions and records_len as dictionaries | True |
log |
gwaslab.g_Log.Log |
Logging object | Log() |
verbose |
boolean |
If True, print progress messages | True |
Returns:
- tuple - (record, starting_positions, records_len_dict) where:
- record: concatenated numpy array of uint8 integers
- starting_positions: dict or array of starting positions for each chromosome
- records_len_dict: dict or array of lengths for each chromosome
FastaRecord Class
The FastaRecord class provides compatibility with code that expects FASTA record objects with .id and .seq._data attributes.
from gwaslab.io.io_fasta import FastaRecord
record = FastaRecord(id="chr1", seq="ATCGATCG")
print(record.id) # "chr1"
print(len(record.seq._data)) # 8
Attributes:
- id (str): Record identifier (title line without '>')
- seq._data (bytes): Sequence data as bytes
Examples
Basic FASTA reading
Memory-efficient iteration
from gwaslab.io.io_fasta import parse_fasta
# Process large file without loading everything into memory
total_length = 0
for title, seq in parse_fasta("large_reference.fasta.gz", as_dict=False):
total_length += len(seq)
print(f"Processed {title}: {len(seq)} bp")
print(f"Total length: {total_length} bp")
Writing FASTA files
Filtered loading for harmonization
from gwaslab.io.io_fasta import load_fasta_filtered
from gwaslab.info.g_Log import Log
# Only load chromosomes present in your sumstats
chroms_in_sumstats = {"1", "2", "3"}
valid_chroms = {"1", "2", "3", "4", "5"}
chr_dict = {str(i): str(i) for i in range(1, 23)}
chr_dict_keys = set(chr_dict.keys())
records = load_fasta_filtered(
path="reference.fasta.gz",
chromlist_set=valid_chroms,
chroms_in_sumstats_set=chroms_in_sumstats,
chr_dict=chr_dict,
chr_dict_keys=chr_dict_keys,
log=Log(),
verbose=True
)
High-performance numpy array conversion
from gwaslab.io.io_fasta import load_and_build_fasta_records
from gwaslab.info.g_Log import Log
# Load and convert to numpy arrays in one pass
record, positions, lengths = load_and_build_fasta_records(
path="reference.fasta.gz",
chromlist_set={"1", "2"},
chroms_in_sumstats_set={"1", "2"},
chr_dict={"1": "1", "2": "2"},
chr_dict_keys={"1", "2"},
pos_as_dict=True,
log=Log(),
verbose=True
)
# Access sequence data
chr1_start = positions["1"]
chr1_length = lengths["1"]
chr1_seq = record[chr1_start:chr1_start + chr1_length]
Performance Tips
-
Use iterators for large files: When processing large FASTA files, use
as_dict=Falseorload_fasta_auto()to avoid loading everything into memory. -
Filter during loading: Use
load_fasta_filtered()to only load chromosomes you need, avoiding unnecessary memory usage. -
Use numpy arrays for fast lookup: For repeated sequence lookups, use
load_and_build_fasta_records()orbuild_fasta_records()to convert to numpy arrays. -
Compressed files: GWASLab automatically handles gzip and BGZF compression, so you can work directly with compressed files without manual decompression.