Python Basics
Introduction
This section provides a minimum introduction to Python programming for handling genomic data and conducting GWAS analyses. Python is a versatile, high-level programming language that is widely used in bioinformatics, data science, and statistical genetics.
If you are a beginner with no background in programming, this tutorial will help you learn the fundamental concepts needed to work with genomic data in Python.
For beginners
This tutorial assumes no prior programming experience. We'll start with the basics and gradually build up to more complex concepts relevant to genomic data analysis.
Why Python for genomics?
- Readability: Python code is easy to read and write
- Rich ecosystem: Extensive libraries for data analysis (pandas, numpy, scipy)
- Bioinformatics tools: Many genomics tools have Python interfaces
- Integration: Works well with command-line tools and other languages
- Community: Large community and extensive documentation
Table of Contents
- Getting Started
- Installation and setup
- Running Python code
- Basic Data Types
- Numbers, strings, booleans
- Variables and Operators
- Data Structures
- Lists, dictionaries, tuples, sets
- Control Flow
- Conditional statements, loops
- Functions
- Defining and using functions
- File Input/Output
- Reading and writing files
- Working with Libraries
- NumPy, pandas basics
- Examples for Genomics
- Practical genomics examples
Getting Started
Installation
Python is typically pre-installed on Linux and Mac systems. You can check if Python is installed:
Check Python version
$ python --version
Python 3.9.5
# Or try python3
$ python3 --version
Python 3.9.5
Python 2 vs Python 3
Python 2 is deprecated. Always use Python 3 (3.9 or higher recommended). Most systems use python3 command for Python 3.
Running Python Code
There are several ways to run Python code:
- Interactive Python (REPL): Type
pythonorpython3in terminal - Python scripts: Save code in
.pyfiles and run withpython script.py - Jupyter notebooks: Interactive environment for data analysis
Interactive Python session
$ python3
Python 3.9.5 (default, ...)
Type "help", "copyright", "credits" or "license" for more information.
>>> print("Hello, World!")
Hello, World!
>>> exit()
Python script
Create a file hello.py:
# hello.py
print("Hello, World!")
Run it:
$ python3 hello.py
Hello, World!
Comments
Comments help document your code. Use # for single-line comments:
Comments in Python
# This is a comment
print("Hello") # This is also a comment
"""
This is a multi-line comment
(actually a docstring)
"""
Basic Data Types
Python has several built-in data types. The most common ones are:
Numbers
Python supports integers and floating-point numbers:
Numbers
# Integers
x = 42
y = -10
# Floating-point numbers
pi = 3.14159
height = 175.5
# Basic arithmetic
sum_result = 10 + 5 # 15
product = 3 * 4 # 12
division = 10 / 3 # 3.333...
floor_division = 10 // 3 # 3 (integer division)
remainder = 10 % 3 # 1 (modulo)
power = 2 ** 3 # 8 (2 to the power of 3)
Strings
Strings are sequences of characters, enclosed in single or double quotes:
Strings
# String creation
name = "Alice"
chromosome = 'chr1'
variant_id = "rs123456"
# String operations
full_name = name + " Smith" # Concatenation: "Alice Smith"
repeated = "AT" * 3 # "ATATAT"
length = len(name) # 5
# String methods
text = " hello world "
text.strip() # "hello world" (remove whitespace)
text.upper() # " HELLO WORLD "
text.lower() # " hello world "
text.replace("world", "Python") # " hello Python "
# String formatting (f-strings, Python 3.6+)
variant = "rs123456"
p_value = 0.0001
message = f"Variant {variant} has p-value {p_value}"
# "Variant rs123456 has p-value 0.0001"
Working with genomic sequences
# DNA sequence
sequence = "ATGCGATCG"
sequence_length = len(sequence) # 9
gc_count = sequence.count("G") + sequence.count("C") # 4
gc_content = gc_count / sequence_length # 0.444...
# Reverse complement (simple example)
complement = sequence.replace("A", "t").replace("T", "a").replace("G", "c").replace("C", "g")
reverse_complement = complement.upper()[::-1]
Booleans
Booleans represent truth values: True or False:
Booleans
is_significant = True
is_rare = False
# Boolean operations
result1 = True and False # False
result2 = True or False # True
result3 = not True # False
# Comparisons return booleans
p_value = 0.0001
is_significant = p_value < 0.05 # True
Variables and Operators
Variables
Variables store values. In Python, you don't need to declare variable types:
Variables
# Variable assignment
variant_id = "rs123456"
chromosome = 1
position = 12345
p_value = 0.0001
is_significant = p_value < 5e-8
# Variable names should be descriptive
sample_size = 10000
allele_frequency = 0.25
effect_size = 0.15
Variable naming conventions
- Use lowercase with underscores:
variant_id,p_value - Be descriptive:
chris better thanc,p_valueis better thanp - Avoid Python keywords:
if,for,def,class, etc.
Operators
Python supports various operators:
Common operators
| Operator | Description | Example |
|---|---|---|
+ |
Addition | 3 + 5 → 8 |
- |
Subtraction | 10 - 3 → 7 |
* |
Multiplication | 4 * 5 → 20 |
/ |
Division | 10 / 3 → 3.333... |
// |
Floor division | 10 // 3 → 3 |
% |
Modulo (remainder) | 10 % 3 → 1 |
** |
Exponentiation | 2 ** 3 → 8 |
== |
Equality | 5 == 5 → True |
!= |
Inequality | 5 != 3 → True |
< |
Less than | 3 < 5 → True |
> |
Greater than | 5 > 3 → True |
<= |
Less than or equal | 3 <= 3 → True |
>= |
Greater than or equal | 5 >= 3 → True |
and |
Logical AND | True and False → False |
or |
Logical OR | True or False → True |
not |
Logical NOT | not True → False |
Operators in genomics context
# Calculate odds ratio from beta
beta = 0.2
odds_ratio = 2.718 ** beta # e^beta
# Check if variant passes QC filters
maf = 0.01
call_rate = 0.98
hwe_p = 0.001
passes_qc = (maf > 0.01) and (call_rate > 0.95) and (hwe_p > 0.0001)
# Check genome-wide significance
p_value = 1e-8
is_gws = p_value < 5e-8 # True
Data Structures
Python provides several built-in data structures for organizing data:
Lists
Lists are ordered, mutable sequences of items:
Lists
# Create a list
chromosomes = [1, 2, 3, 4, 5]
variants = ["rs123", "rs456", "rs789"]
mixed = [1, "rs123", 0.05, True]
# Access elements (indexing starts at 0)
first_chr = chromosomes[0] # 1
last_variant = variants[-1] # "rs789" (negative index from end)
# Modify elements
variants[0] = "rs111" # Change first element
# List operations
len(chromosomes) # 5
chromosomes.append(6) # Add element: [1, 2, 3, 4, 5, 6]
chromosomes.extend([7, 8]) # Add multiple: [1, 2, 3, 4, 5, 6, 7, 8]
# Slicing
first_three = chromosomes[0:3] # [1, 2, 3]
last_two = chromosomes[-2:] # [7, 8]
# List methods
p_values = [0.1, 0.05, 0.01, 0.001]
p_values.sort() # Sort in place: [0.001, 0.01, 0.05, 0.1]
min_p = min(p_values) # 0.001
max_p = max(p_values) # 0.1
Working with variant data in lists
# Store variant information
variant_ids = ["rs123456", "rs234567", "rs345678"]
chromosomes = [1, 1, 2]
positions = [12345, 23456, 34567]
p_values = [0.05, 1e-8, 0.001]
# Find significant variants
significant_indices = []
for i, p in enumerate(p_values):
if p < 5e-8:
significant_indices.append(i)
# Get significant variant IDs
sig_variants = [variant_ids[i] for i in significant_indices]
Dictionaries
Dictionaries store key-value pairs. They're very useful for organizing data:
Dictionaries
# Create a dictionary
variant_info = {
"rsid": "rs123456",
"chromosome": 1,
"position": 12345,
"p_value": 0.0001,
"effect_size": 0.15
}
# Access values
rsid = variant_info["rsid"] # "rs123456"
pval = variant_info.get("p_value", 1.0) # 0.0001 (with default)
# Modify values
variant_info["p_value"] = 1e-8
variant_info["maf"] = 0.25 # Add new key-value pair
# Dictionary methods
keys = variant_info.keys() # dict_keys(['rsid', 'chromosome', ...])
values = variant_info.values() # dict_values(['rs123456', 1, ...])
items = variant_info.items() # dict_items([('rsid', 'rs123456'), ...])
# Check if key exists
if "rsid" in variant_info:
print(variant_info["rsid"])
Storing multiple variants
# Dictionary of variants
variants = {
"rs123456": {
"chr": 1,
"pos": 12345,
"p": 1e-8,
"beta": 0.15
},
"rs234567": {
"chr": 2,
"pos": 23456,
"p": 0.05,
"beta": 0.08
}
}
# Access nested dictionary
rs123_chr = variants["rs123456"]["chr"] # 1
rs123_p = variants["rs123456"]["p"] # 1e-8
Tuples
Tuples are ordered, immutable sequences. Use them when you need a fixed collection:
Tuples
# Create a tuple
coordinates = (1, 12345) # (chromosome, position)
variant = ("rs123456", 1, 12345, 0.0001) # (rsid, chr, pos, p)
# Access elements
chromosome = coordinates[0] # 1
position = coordinates[1] # 12345
# Tuples are immutable (cannot be changed)
# coordinates[0] = 2 # This would cause an error
# Unpacking
rsid, chr, pos, p = variant
When to use tuples vs lists
- Tuples: Use when the collection shouldn't change (e.g., coordinates, fixed parameters)
- Lists: Use when you need to modify the collection (e.g., adding/removing variants)
Sets
Sets are unordered collections of unique elements:
Sets
# Create a set
chromosomes = {1, 2, 3, 4, 5}
unique_variants = {"rs123", "rs456", "rs789"}
# Set operations
chromosomes.add(6) # Add element
chromosomes.remove(1) # Remove element
len(chromosomes) # Get size
# Set operations (union, intersection, difference)
set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6}
union = set1 | set2 # {1, 2, 3, 4, 5, 6}
intersection = set1 & set2 # {3, 4}
difference = set1 - set2 # {1, 2}
Finding unique variants
# Remove duplicates from a list
variant_list = ["rs123", "rs456", "rs123", "rs789", "rs456"]
unique_variants = set(variant_list) # {"rs123", "rs456", "rs789"}
unique_list = list(unique_variants) # Convert back to list
Control Flow
Control flow statements allow you to execute code conditionally or repeatedly.
Conditional Statements
Use if, elif, and else for conditional execution:
If statements
# Basic if statement
p_value = 0.0001
if p_value < 5e-8:
print("Genome-wide significant!")
elif p_value < 0.05:
print("Nominally significant")
else:
print("Not significant")
# Multiple conditions
maf = 0.01
call_rate = 0.98
if maf > 0.01 and call_rate > 0.95:
print("Passes QC")
else:
print("Fails QC")
Filtering variants by significance
variants = [
{"rsid": "rs123", "p": 1e-8, "chr": 1},
{"rsid": "rs456", "p": 0.05, "chr": 2},
{"rsid": "rs789", "p": 1e-9, "chr": 3}
]
significant_variants = []
for variant in variants:
if variant["p"] < 5e-8:
significant_variants.append(variant["rsid"])
print(significant_variants) # ["rs123", "rs789"]
Loops
Loops allow you to repeat code. Python has for and while loops:
For loops
# Iterate over a list
chromosomes = [1, 2, 3, 4, 5]
for chr in chromosomes:
print(f"Processing chromosome {chr}")
# Iterate with index
variants = ["rs123", "rs456", "rs789"]
for i, variant in enumerate(variants):
print(f"Index {i}: {variant}")
# Iterate over dictionary
variant_info = {"rsid": "rs123", "chr": 1, "pos": 12345}
for key, value in variant_info.items():
print(f"{key}: {value}")
# Range function
for i in range(5): # 0, 1, 2, 3, 4
print(i)
for i in range(1, 6): # 1, 2, 3, 4, 5
print(i)
While loops
# While loop
count = 0
while count < 5:
print(count)
count += 1
# Process until condition is met
p_value = 1.0
iterations = 0
while p_value > 0.05 and iterations < 100:
# Some calculation that updates p_value
p_value = p_value * 0.9
iterations += 1
List comprehensions
# List comprehension (concise way to create lists)
p_values = [0.1, 0.05, 0.01, 1e-8, 0.001]
# Traditional approach
significant = []
for p in p_values:
if p < 5e-8:
significant.append(p)
# List comprehension (more Pythonic)
significant = [p for p in p_values if p < 5e-8]
# More complex example
variants = ["rs123", "rs456", "rs789"]
variant_lengths = [len(v) for v in variants] # [5, 5, 5]
# With transformation
chromosomes = [1, 2, 3]
chr_strings = [f"chr{chr}" for chr in chromosomes] # ["chr1", "chr2", "chr3"]
Processing genomic data with loops
# Process variants from a file-like structure
variant_data = [
("rs123", 1, 12345, 0.05),
("rs456", 1, 23456, 1e-8),
("rs789", 2, 34567, 0.001)
]
# Count significant variants per chromosome
chr_counts = {}
for rsid, chr, pos, p in variant_data:
if p < 5e-8:
if chr in chr_counts:
chr_counts[chr] += 1
else:
chr_counts[chr] = 1
print(chr_counts) # {1: 1}
Functions
Functions allow you to organize code into reusable blocks:
Defining functions
# Simple function
def greet(name):
return f"Hello, {name}!"
result = greet("Alice")
print(result) # "Hello, Alice!"
# Function with multiple parameters
def calculate_odds_ratio(beta):
"""Calculate odds ratio from beta coefficient."""
import math
return math.exp(beta)
or_value = calculate_odds_ratio(0.2)
print(or_value) # ~1.22
Function with default parameters
def check_significance(p_value, threshold=5e-8):
"""Check if p-value is significant."""
return p_value < threshold
# Use default threshold
is_sig1 = check_significance(1e-8) # True (uses 5e-8)
# Use custom threshold
is_sig2 = check_significance(0.01, threshold=0.05) # True
Genomics-related functions
def calculate_gc_content(sequence):
"""Calculate GC content of a DNA sequence."""
sequence = sequence.upper()
gc_count = sequence.count("G") + sequence.count("C")
return gc_count / len(sequence) if len(sequence) > 0 else 0
def filter_variants(variants, min_maf=0.01, max_p=0.05):
"""Filter variants by MAF and p-value."""
filtered = []
for variant in variants:
if variant["maf"] >= min_maf and variant["p"] <= max_p:
filtered.append(variant)
return filtered
# Usage
seq = "ATGCGATCG"
gc = calculate_gc_content(seq)
print(f"GC content: {gc:.2%}") # GC content: 44.44%
variants = [
{"rsid": "rs123", "maf": 0.05, "p": 0.01},
{"rsid": "rs456", "maf": 0.005, "p": 0.03},
{"rsid": "rs789", "maf": 0.02, "p": 0.1}
]
filtered = filter_variants(variants, min_maf=0.01, max_p=0.05)
print([v["rsid"] for v in filtered]) # ["rs123"]
File Input/Output
Reading from and writing to files is essential for working with genomic data:
Reading files
# Read entire file
with open("variants.txt", "r") as f:
content = f.read()
# Read line by line
with open("variants.txt", "r") as f:
for line in f:
print(line.strip()) # strip() removes newline
# Read all lines into a list
with open("variants.txt", "r") as f:
lines = f.readlines()
# More Pythonic: read lines directly
with open("variants.txt", "r") as f:
lines = [line.strip() for line in f]
Writing files
# Write to file
variants = ["rs123", "rs456", "rs789"]
with open("output.txt", "w") as f:
for variant in variants:
f.write(f"{variant}\n")
# Write multiple lines at once
data = ["rs123\t1\t12345", "rs456\t2\t23456"]
with open("output.txt", "w") as f:
f.write("\n".join(data))
Processing tab-separated files (common in genomics)
# Read TSV file (like sumstats)
significant_variants = []
with open("sumstats.txt", "r") as f:
header = f.readline().strip().split("\t")
print(f"Columns: {header}")
for line in f:
fields = line.strip().split("\t")
variant_dict = dict(zip(header, fields))
# Convert p-value to float and check significance
p_value = float(variant_dict["P"])
if p_value < 5e-8:
significant_variants.append(variant_dict)
print(f"Found {len(significant_variants)} significant variants")
Writing TSV files
# Write variant data to TSV
variants = [
{"rsid": "rs123", "chr": 1, "pos": 12345, "p": 1e-8},
{"rsid": "rs456", "chr": 2, "pos": 23456, "p": 1e-9}
]
with open("significant_variants.txt", "w") as f:
# Write header
f.write("RSID\tCHR\tPOS\tP\n")
# Write data
for variant in variants:
f.write(f"{variant['rsid']}\t{variant['chr']}\t{variant['pos']}\t{variant['p']}\n")
File modes
"r": Read mode (default)"w": Write mode (overwrites existing file)"a": Append mode (adds to end of file)"r+": Read and write mode
Working with Libraries
Python's power comes from its extensive library ecosystem. For genomics and data analysis, key libraries include:
NumPy
NumPy provides arrays and numerical operations:
NumPy basics
import numpy as np
# Create arrays
p_values = np.array([0.1, 0.05, 0.01, 1e-8, 0.001])
# Array operations
log_p = -np.log10(p_values) # -log10(p)
significant = p_values < 5e-8 # Boolean array
sig_count = np.sum(significant) # Count of True values
# Statistical functions
mean_p = np.mean(p_values)
min_p = np.min(p_values)
max_p = np.max(p_values)
# Array indexing
first_three = p_values[0:3] # [0.1, 0.05, 0.01]
significant_p = p_values[significant] # Get only significant p-values
Pandas
Pandas provides DataFrames for working with tabular data (like Excel spreadsheets):
Pandas basics
import pandas as pd
# Read TSV file into DataFrame
df = pd.read_csv("sumstats.txt", sep="\t")
# Basic operations
print(df.head()) # First 5 rows
print(df.shape) # (rows, columns)
print(df.columns) # Column names
print(df.dtypes) # Data types
# Access columns
p_values = df["P"]
chromosomes = df["CHROM"]
# Filtering
significant = df[df["P"] < 5e-8]
chr1_variants = df[df["CHROM"] == 1]
# Multiple conditions
filtered = df[(df["P"] < 5e-8) & (df["CHROM"] == 1)]
# Add new column
df["MLOG10P"] = -np.log10(df["P"])
# Group operations
chr_counts = df.groupby("CHROM").size()
mean_p_by_chr = df.groupby("CHROM")["P"].mean()
# Write to file
significant.to_csv("significant.txt", sep="\t", index=False)
Common pandas operations for GWAS
import pandas as pd
import numpy as np
# Read sumstats
df = pd.read_csv("sumstats.txt", sep="\t")
# Basic QC filters
qc_passed = df[
(df["MAF"] > 0.01) & # Minor allele frequency > 1%
(df["INFO"] > 0.8) & # Imputation quality > 0.8
(df["P"] > 0) & # Valid p-values
(df["P"] <= 1)
]
# Calculate -log10(p)
qc_passed["MLOG10P"] = -np.log10(qc_passed["P"])
# Find genome-wide significant variants
gws = qc_passed[qc_passed["P"] < 5e-8]
# Summary statistics
print(f"Total variants: {len(df)}")
print(f"After QC: {len(qc_passed)}")
print(f"Genome-wide significant: {len(gws)}")
print(f"Mean p-value: {qc_passed['P'].mean():.2e}")
# Save results
gws.to_csv("gws_variants.txt", sep="\t", index=False)
Installing libraries
Use pip to install Python packages:
pip install numpy pandas
# or
pip3 install numpy pandas
Examples for Genomics
Here are some practical examples combining the concepts above:
Parse VCF-like data
def parse_vcf_line(line):
"""Parse a line from a VCF file."""
fields = line.strip().split("\t")
if len(fields) < 8:
return None
return {
"chr": fields[0],
"pos": int(fields[1]),
"id": fields[2],
"ref": fields[3],
"alt": fields[4],
"qual": fields[5],
"filter": fields[6],
"info": fields[7]
}
# Read and parse VCF
variants = []
with open("variants.vcf", "r") as f:
for line in f:
if line.startswith("#"):
continue # Skip header lines
variant = parse_vcf_line(line)
if variant:
variants.append(variant)
print(f"Found {len(variants)} variants")
Calculate allele frequencies
def calculate_maf(genotype_counts):
"""
Calculate minor allele frequency from genotype counts.
genotype_counts: dict with keys 'AA', 'Aa', 'aa' and counts as values
"""
total = sum(genotype_counts.values())
if total == 0:
return 0.0
# Count alleles
a_count = 2 * genotype_counts.get('AA', 0) + genotype_counts.get('Aa', 0)
total_alleles = 2 * total
maf = a_count / total_alleles
# Return the minor (less frequent) allele frequency
return min(maf, 1 - maf)
# Example usage
counts = {'AA': 100, 'Aa': 50, 'aa': 10}
maf = calculate_maf(counts)
print(f"Minor allele frequency: {maf:.3f}")
Filter and summarize GWAS results
def process_gwas_results(filename, p_threshold=5e-8):
"""Process GWAS summary statistics."""
results = {
"total": 0,
"significant": 0,
"by_chromosome": {}
}
with open(filename, "r") as f:
header = f.readline().strip().split("\t")
chr_idx = header.index("CHROM")
p_idx = header.index("P")
for line in f:
fields = line.strip().split("\t")
if len(fields) <= max(chr_idx, p_idx):
continue
results["total"] += 1
chromosome = fields[chr_idx]
p_value = float(fields[p_idx])
if p_value < p_threshold:
results["significant"] += 1
if chromosome not in results["by_chromosome"]:
results["by_chromosome"][chromosome] = 0
results["by_chromosome"][chromosome] += 1
return results
# Usage
results = process_gwas_results("sumstats.txt")
print(f"Total variants: {results['total']}")
print(f"Genome-wide significant: {results['significant']}")
print("By chromosome:")
for chr, count in sorted(results["by_chromosome"].items()):
print(f" Chr {chr}: {count}")
Convert between data formats
def convert_sumstats_to_bed(sumstats_file, output_file):
"""Convert sumstats to BED format (chr, start, end, name)."""
with open(sumstats_file, "r") as infile, open(output_file, "w") as outfile:
header = infile.readline().strip().split("\t")
chr_idx = header.index("CHROM")
pos_idx = header.index("POS")
id_idx = header.index("ID")
# BED format: chr, start, end, name
outfile.write("chr\tstart\tend\tname\n")
for line in infile:
fields = line.strip().split("\t")
if len(fields) <= max(chr_idx, pos_idx, id_idx):
continue
chromosome = fields[chr_idx]
position = int(fields[pos_idx])
variant_id = fields[id_idx]
# BED uses 0-based coordinates, end is exclusive
# For point variants, use position-1 as start, position as end
outfile.write(f"{chromosome}\t{position-1}\t{position}\t{variant_id}\n")
# Usage
convert_sumstats_to_bed("sumstats.txt", "variants.bed")
Best Practices
Code organization
- Use meaningful variable names
- Add comments for complex logic
- Break code into functions
- Keep functions focused on one task
Error handling
# Use try-except for error handling
try:
p_value = float(variant["P"])
except (KeyError, ValueError) as e:
print(f"Error processing variant: {e}")
p_value = None
Code style
- Follow PEP 8 style guide
- Use 4 spaces for indentation (not tabs)
- Keep lines under 79-99 characters
- Use descriptive function and variable names
Next Steps
Now that you understand Python basics, you can:
- Learn pandas in depth: Essential for working with genomic data tables
- Explore bioinformatics libraries: Biopython, pysam, pyvcf
- Practice with real data: Work with actual GWAS summary statistics
- Learn visualization: Matplotlib, seaborn for plotting
- Advanced topics: Object-oriented programming, modules, packages
References
- Official Python documentation: https://docs.python.org/3/
- Pandas documentation: https://pandas.pydata.org/docs/
- NumPy documentation: https://numpy.org/doc/
- Python for Data Analysis (book by Wes McKinney)