Skip to content

Introduction

This section is intended to provide a minimum introduction of the command line in Linux system for handling genomic data. (If you are already familiar with Linux commands, it is completely ok to skip this section.)

If you are a beginner with no background in programming, it would be helpful if you could learn some basic commands first before any analysis. In this section, we will introduce the most basic commands which enable you to handle genomic files in the terminal using command lines in a linux system.

For Mac users

This tutorial will probably work with no problems. Just simply open your terminal and follow the tutorial. (Note: A few commands might be different on MacOS.)

For Windows users

You can simply install WSL to get a linux environment. Please check here for how to install WSL.

Table of Contents

Linux System Introduction

What is Linux?

Term Description
Linux refers to a family of open-source Unix-like operating systems based on the Linux kernel.
Linux kernel a free and open-source Unix-like operating system kernel, which controls the software and hardware of the computer.
Linux distributions refer to operating systems made from a software collection that is based upon the Linux kernel.

Main functions of the Linux kernel

  • System memory management
  • Software process management
  • Hardware device drivers
  • File system management

Some of the most common linux distributions

image

Linux and Linus

Linux is named after Linus Benedict Torvalds, who is a legendary Finnish software engineer who lead the development of the Linux kernel. He also developed the amazing version control software - Git.

Reference: https://en.wikipedia.org/wiki/Linux

How do we interact with computers?

  • Graphical User Interface (GUI): allows users to interact with computers through graphical icons
  • Character User Interface (CUI): allows users to interact with computers through command lines

GUI and CUI

image

Shell

  • A Shell provided the actual interface for you to interact with the Linux system. When you type commands in a shell, it will collect and execute the commands.
  • $ is the prompt for bash shell, which indicate that you can type commands after the $ sign.
  • Different shells might use other signs for the prompt. For example, the defaul zsh in Mac uses % , and C shell uses > as the prompt sign.
  • There are multiple available shells which differ in their features. For a typical linux system, the default shell is bash.

A general comparison between CUI and GUI

GUI CUI
Interaction Graphics Command line
Precision LOW HIGH
Speed LOW HIGH
Memory required HIGH LOW
Ease of operation Easier More difficult
Flexibility MORE flexible LESS flexible

Tip

The reason why we want to use CUI for large-scale data analysis is that CUI is better in term of precision, memory usage and processing speed.

Overview of the basic commands in Linux

Unlike clicking and dragging files in Windows or MacOS, in Linux, we usually handle files by typing commands in the terminal.

image

Here is a list of the basic commands we are going to cover in this brief tutorial:

Basic Linux commands

Function group Commands Description
Directories pwd, ls, mkdir, rmdir Commands for checking, creating and removing directories
Files touch,cp,mv,rm Commands for creating, copying, moving and removing files
Checking files cat,zcat,head,tail,less,more,wc Commands for inspecting files
Archiving and compression tar,gzip,gunzip,zip,unzip Commands for Archiving and Compressing files
Manipulating text sort,uniq,cut,join,tr Commands for manipulating text files
Modifying permission chmod,chown, chgrp Commands for changing the permissions of files and directories
Links ln Commands for creating symbolic and hard links
Pipe, redirect and others pipe, >,>>,*,.,.. A group of miscellaneous commands
Advance text editing awk, sed Commands for more complicated text manipulation and editing

How to check the usage of a command using man:

The first command we might want to learn is man, which shows the manual for a certain command. When you forget how to use a command, you can always use man to check.

man : Check the manual of a command (e.g., man chmod) or --help option (e.g., chmod --help)

For example, we want to check the usage of pwd:

Use man to get the manual for commands

$ man pwd
Then you will see the manual of pwd in your terminal.
PWD(1)                                              User     Commands                                              PWD(1)

NAME
       pwd - print name of current/working directory

SYNOPSIS
       pwd [OPTION]...

DESCRIPTION
       Print the full filename of the current working directory.
....

Explain shell

Or you can use this wonderful website to get explanations for your commands.

URL : https://explainshell.com/

image

Using LLMs in the AI era

In the AI era, Large Language Models (LLMs) like ChatGPT, Claude, or GitHub Copilot can be powerful tools for understanding and learning Linux commands. You can ask questions like:

  • "What does the command chmod 755 file.txt do?"
  • "How do I find all files modified in the last 7 days?"
  • "Explain the difference between grep and awk"

LLMs can provide explanations, examples, and even help troubleshoot command errors. However, always verify LLM responses, especially for critical operations, as they may occasionally provide incorrect or outdated information.

Commands

Directories

The first set of commands are: pwd , cd , ls, mkdir and rmdir, which are related to directories (like the folders in a Windows system).


pwd

pwd : Print working directory, which means printing the path of the current directory (working directory)

Use pwd to print the current directory you are in

$ pwd
/home/he/work/GWASTutorial/02_Linux_basics

This command prints the absolute path.

An example of Linux file system and file paths

image

Type Description Example
Absolute path path starting from root (the orange path) /home/User3/GWASTutorial/02_Linux_basics/README.md
Relative path path starting from the current directory (the blue path) ./GWASTutorial/02_Linux_basics/README.md

Tip: use readlink to obtain the absolute path of a file

To get the absolute path of a file, you can use readlink -f [filename].

$ readlink -f README.md 
/home/he/work/GWASTutorial/02_Linux_basics/README.md

cd

cd: Change the current working directory.

Use cd to change directory to 02_Linux_basics and then print the current directory

$ cd 02_Linux_basics
$ pwd
/home/he/work/GWASTutorial/02_Linux_basics

ls

ls : List the contents in the working directory

Some frequently used options for ls :

  • -l: in a list-like format
  • -h: convert file size into a human readable format (KB,MB,GB...)
  • -a: list all files (including hidden files, namely those files with a period at the beginning of the filename)

Simply list the files and directories in the current directory

$ ls
README.md  sumstats.txt

List the files and directories with options -lha

$ ls -lha
drwxr-xr-x   4 he  staff   128B Dec 23 14:07 .
drwxr-xr-x  17 he  staff   544B Dec 23 12:13 ..
-rw-r--r--   1 he  staff     0B Oct 17 11:24 README.md
-rw-r--r--   1 he  staff    31M Dec 23 14:07 sumstats.txt

Tip: use tree to visualize the structure of a directory

You can use tree command to visualize the structure of a directory.

$ tree ./02_Linux_basics/
./02_Linux_basics/
├── README.md
└── sumstats.txt

0 directories, 2 files

mkdir & rmdir

  • mkdir : Create a new empty directory
  • rmdir: Delete an empty directory

Make a directory and delete it

$ mkdir new_directory
$ ls
new_directory  README.md  sumstats.txt
$ rmdir new_directory/
$ ls
README.md  sumstats.txt

Manipulating files

This set of commands includes: touch, mv , rm and cp


touch

touch command is used to create a new empty file.

Create an empty text file called newfile.txt in this directory

$ ls -l
total 64048
-rw-r--r--  1 he  staff         0 Oct 17 11:24 README.md
-rw-r--r--  1 he  staff  32790417 Dec 23 14:07 sumstats.txt

touch newfile.txt

$ touch newfile.txt
$ ls -l
total 64048
-rw-r--r--  1 he  staff         0 Oct 17 11:24 README.md
-rw-r--r--  1 he  staff         0 Dec 23 14:14 newfile.txt
-rw-r--r--  1 he  staff  32790417 Dec 23 14:07 sumstats.txt

mv

mv has two functions:

  • (1) move files to another paths
  • (2) rename files

The following command will create a new directory called new_directory, and move sumstats.txt into that directory. Just like dragging a file into a folder in window system.

Move a file to a different directory

# make a new directory
$ mkdir new_directory

#move sumstats to the new directory
$ mv sumstats.txt new_directory/

# list the item in new_directory
$ ls new_directory/
sumstats.txt

Now, let's move it back to the current directory and rename it to sumstats_new.txt.

Rename a file using mv

$ mv ./new_directory/sumstats.txt ./
Note: ./ means the current directory You can also use mv to rename a file:
#rename
$ mv sumstats.txt sumstats_new.txt 


rm

rm : Remove files or directories

Remove a file and a directory

# remove a file
$ rm file

#remove files in a directory (recursive mode)
$ rm -r directory/

There is no trash can in Linux command-line interface

If you delete a file with rm , it will be very difficult to restore it. Please be careful when using rm.


cp

cp command is used to copy files or directories.

Copy a file and a directory

#cp files
$ cp file1 file2

# copy directory
$ cp -r directory1/ directory2/

Finding files

find

find is a powerful command for locating files and directories in the file system. It's especially useful when working with large genomic datasets spread across multiple directories.

Basic syntax: find [path] [expression]

Common options: - -name: search by filename (supports wildcards) - -type: specify file type (f for files, d for directories) - -size: search by file size - -mtime: search by modification time - -exec: execute a command on found files

Find files by name

# Find all .txt files in current directory and subdirectories
$ find . -name "*.txt"

# Find files named exactly "sumstats.txt"
$ find . -name "sumstats.txt"

# Case-insensitive search
$ find . -iname "*.TXT"

Find files by type and size

# Find all files (not directories) larger than 100MB
$ find . -type f -size +100M

# Find all directories
$ find . -type d

# Find files between 1MB and 10MB
$ find . -type f -size +1M -size -10M

Find and execute commands

# Find all .gz files and show their sizes
$ find . -name "*.gz" -exec ls -lh {} \;

# Find and delete old log files (be careful!)
$ find . -name "*.log" -mtime +30 -delete

Finding files vs locate

  • find: searches in real-time, more accurate but slower
  • locate: uses a database, faster but may be outdated Use find when you need current results, especially for recently created files.

Symbolic link is like a shortcut on window system, which is a special type of file that points to another file.

It is very useful when you want to organize your tool box or working space.

You can use ln -s pathA pathB to create such a link.

Create a symbolic link for plink

Let's create a symbolic link for plink first.

# /home/he/tools/plink/plink is the original file
# /home/he/tools/bin is the path for the symbolic link 
ln -s /home/he/tools/plink/plink /home/he/tools/bin

And then check the link.

cd /home/he/tools/bin
ls -lha
lrwxr-xr-x  1 he  staff    27B Aug 30 11:30 plink -> /home/he/tools/plink/plink

Archiving and Compression

Results for millions of variants are usually very large, sometimes >10GB, or consists of multiple files.

To save space and make it easier to transfer, we need to archive and compress these files.

  • Archiving: combine multiple files in a single file.
  • Compression: make the file size smaller without losing any information by converting the file to binary forms.

Archiving and Compression

image

Commonly used commands for archiving and compression:

Extensions Create Extract Functions
file.gz gzip gunzip compress
files.tar tar -cvf tar -xvf archive
files.tar.gz or files.tgz tar -czvf tar -xvzf archive and compress
file.zip zip unzip archive and compress

Compress and decompress a file using gzip and gunzip

$ ls -lh
-rw-r--r--  1 he  staff    31M Dec 23 14:07 sumstats.txt

$ gzip sumstats.txt
$ ls -lh
-rw-r--r--  1 he  staff   9.9M Dec 23 14:07 sumstats.txt.gz

$ gunzip sumstats.txt.gz
$ ls -lh
-rw-r--r--   1 he  staff    31M Dec 23 14:07 sumstats.txt

Create and extract a tar archive

# Create a tar archive (archive only, no compression)
$ tar -cvf archive.tar file1.txt file2.txt directory/

# Extract a tar archive
$ tar -xvf archive.tar

# Create a compressed tar archive (.tar.gz)
$ tar -czvf archive.tar.gz file1.txt file2.txt directory/

# Extract a compressed tar archive
$ tar -xzvf archive.tar.gz

# List contents of a tar archive without extracting
$ tar -tzvf archive.tar.gz

tar options explained

  • -c: create archive
  • -x: extract archive
  • -v: verbose (show files being processed)
  • -f: specify filename
  • -z: use gzip compression
  • -t: list contents

Read and check files

We have a group of handy commands to check part of or the entire file, including cat, zcat, less, more, head, tail, wc, grep


cat

cat command can print the contents of files or concatenate the files.

Create and then cat the file a_text_file.txt

$ ls -lha > a_text_file.txt
$ cat a_text_file.txt 
total 32M
drwxr-x---  2 he staff 4.0K Apr  2 00:37 .
drwxr-x--- 29 he staff 4.0K Apr  1 22:20 ..
-rw-r-----  1 he staff    0 Apr  2 00:37 a_text_file.txt
-rw-r-----  1 he staff 5.0K Apr  1 22:20 README.md
-rw-r-----  1 he staff  32M Mar 30 18:17 sumstats.txt

Warning

Be careful not to cat a text file with a huge number of lines. You can try to cat sumstats.txt and see what happends.

By the way, > a_text_file.txt here means redirect the output to file a_text_file.txt.


zcat

zcat is similar to cat, but can only applied to compressed files.

cat and zcat a gzipped text file

$ gzip a_text_file.txt 
$ cat a_text_file.txt.gz
(Binary output - compressed file content)

$ zcat a_text_file.txt.gz 
total 32M
drwxr-x---  2 he staff 4.0K Apr  2 00:37 .
drwxr-x--- 29 he staff 4.0K Apr  1 22:20 ..
-rw-r-----  1 he staff    0 Apr  2 00:37 a_text_file.txt
-rw-r-----  1 he staff 5.0K Apr  1 22:20 README.md
-rw-r-----  1 he staff  32M Mar 30 18:17 sumstats.txt

gzcat

Use gzcat instead of zcat if your device is running MacOS.


less and more

less and more are file viewers that allow you to view files page by page, which is much better than cat for large files. less is more advanced and recommended.

  • less: View file content page by page (can scroll up and down)
  • more: View file content page by page (can only scroll forward)

Common operations in less: - Space or Page Down: move forward one page - b or Page Up: move backward one page - q: quit - /pattern: search for a pattern (press n for next match) - G: go to end of file - g: go to beginning of file

View a large file using less

$ less sumstats.txt
This will open the file in less viewer. Use q to quit.

Why use less instead of cat?

For large genomic files, cat will print everything to the terminal, which can be overwhelming and slow. less allows you to navigate through the file efficiently without loading everything at once.


head: Print the first 10 lines.

-n: option to change the number of lines.

Check the first 10 lines and only the first line of the file sumstats.txt

$ head sumstats.txt 
CHROM   POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE
1   319 17  2   1   1   ADD 10000   1.04326 0.0495816   0.854176    0.393008    .
1   319 22  1   2   2   ADD 10000   1.03347 0.0493972   0.666451    0.505123    .
1   418 23  1   2   2   ADD 10000   1.02668 0.0498185   0.528492    0.597158    .
1   537 30  1   2   2   ADD 10000   1.01341 0.0498496   0.267238    0.789286    .
1   546 31  2   1   1   ADD 10000   1.02051 0.0336786   0.60284 0.546615    .
1   575 33  2   1   1   ADD 10000   1.09795 0.0818305   1.14199 0.25346 .
1   752 44  2   1   1   ADD 10000   1.02038 0.0494069   0.408395    0.682984    .
1   913 50  2   1   1   ADD 10000   1.07852 0.0493585   1.53144 0.12566 .
1   1356    77  2   1   1   ADD 10000   0.947521    0.0339805   -1.5864 0.112649    .

$ head -n 1 sumstats.txt 
CHROM   POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE

tail

Similar to head, you can use tail to check the last 10 lines. -n works in the same way.

Check the last 10 lines of the file sumstats.txt

$ tail sumstats.txt 
22  99996057    9959945 2   1   1   ADD 10000   1.03234 0.0335547   0.948413    0.342919    .
22  99996465    9959971 2   1   1   ADD 10000   1.04755 0.0337187   1.37769 0.1683  .
22  99997041    9960013 2   1   1   ADD 10000   1.01942 0.0937548   0.205195    0.837419    .
22  99997608    9960051 2   1   1   ADD 10000   0.969928    0.0397711   -0.767722   0.442652    .
22  99997629    9960055 2   1   1   ADD 10000   0.986949    0.0395305   -0.332315   0.739652    .
22  99997742    9960061 2   1   1   ADD 10000   0.990829    0.0396614   -0.232298   0.816307    .
22  99998121    9960086 2   1   1   ADD 10000   1.04448 0.0335879   1.29555 0.19513 .
22  99998455    9960106 2   1   1   ADD 10000   0.880953    0.152754    -0.829771   0.406668    .
22  99999208    9960146 2   1   1   ADD 10000   0.944604    0.065187    -0.874248   0.381983    .
22  99999382    9960164 2   1   1   ADD 10000   0.970509    0.033978    -0.881014   0.37831 .

wc

wc: short for word count, which count the lines, words, and characters in a file.

For example,

Count the lines, words, and characters in sumstats.txt

$ wc sumstats.txt 
  445933  5797129 32790417 sumstats.txt
This means that sumstats.txt has 445933 lines, 5797129 words, and 32790417 characters.


grep

grep is one of the most powerful and frequently used commands for searching patterns in files. It's essential for filtering genomic data.

Basic syntax: grep [options] pattern [file(s)]

Common options: - -i: case-insensitive search - -v: invert match (show lines that don't match) - -n: show line numbers - -c: count matching lines - -r or -R: recursive search in directories - -l: show only filenames with matches

Search for a pattern in a file

# Search for lines containing "CHROM" in sumstats.txt
$ grep "CHROM" sumstats.txt
CHROM   POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE

# Search with line numbers
$ grep -n "CHROM" sumstats.txt
1:CHROM POS ID  REF ALT A1  TEST    OBS_CT  OR  LOG(OR)_SE  Z_STAT  P   ERRCODE

# Count how many lines contain "ADD"
$ grep -c "ADD" sumstats.txt
445932

# Case-insensitive search
$ grep -i "chrom" sumstats.txt

Search in multiple files

# Search for "variant" in all .txt files
$ grep "variant" *.txt

# Recursive search in all subdirectories
$ grep -r "pattern" /path/to/directory/

Combine grep with pipes

# Find lines containing "ADD" and count them
$ grep "ADD" sumstats.txt | wc -l

# Find lines with P-value less than 0.05 (assuming tab-separated)
$ grep -E "0\.0[0-4][0-9]" sumstats.txt | head

Regular expressions with grep

grep supports regular expressions. Use -E for extended regex:

# Find lines starting with "1\t" (chromosome 1)
$ grep -E "^1\t" sumstats.txt | head

Edit files

Vim is a handy text editor for command line.

Vim - text editor

vim README.md

image

Simple workflow using Vim

  1. vim file_to_edit.txt
  2. Press i to enter the INSERT mode.
  3. Edit the file.
  4. When finished, just press Esc key to escape the INSERT mode.
  5. Then enter :wq to quit and also save the file.

Vim is a little bit hard to learn for beginners, but when you get familiar with it, it will be a mighty and convenient tool. For more detailed tutorials on Vim, you can check: Learn-Vim

Other common command line text editors

Permission

The permissions of a file or directory are represented as a 10-character string (1+3+3+3) :

For example, this represents a directory(the initial d) which is readable, writable and executable for the owner(the first 3: rwx), users in the same group(the 3 characters in the middle: rwx) and others (last 3 characters: rwx).

drwxrwxrwx

-> d (directory or file) rwx (permissions for owner) rwx (permissions for users in the same group) rwx (permissions for other users)

Notation Description
r readable
w writable
x executable
d directory
- file

Command for checking the permissions of files in the current directory: ls -l

Command for changing permissions: chmod, chown, chgrp

Syntax:

chmod [3-digit Binary notation] [path]

Number notation Permission 3-digit Binary notation
7 rwx 111
6 rw- 110
5 r-x 101
4 r-- 100
3 -wx 011
2 -w- 010
1 --x 001
0 --- 000

Change the permissions of the file README.md to 660

# there is a readme file in the directory, and its permissions are -rw-r----- 
$ ls -lh
total 4.0K
-rw-r----- 1 he staff 2.1K Feb 24 01:16 README.md

# let's change the permissions to 660, which is a numeric notation of -rw-rw---- based on the     table above
$ chmod 660 README.md 

# check again, and it was changed.
$ ls -lh
total 4.0K
-rw-rw---- 1 he staff 2.1K Feb 24 01:16 README.md

Note

These commands are very important because we use genome data, which could raise severe ethical and privacy issues if there is data leak.

Warning

Please always be cautious when handling human genomic data.

Disk space management

Genomic files can be very large, so it's important to monitor disk space usage. Two essential commands for this are df and du.

df

df (disk free) shows disk space usage for file systems.

Common options: - -h: human-readable format (KB, MB, GB) - -T: show file system type

Check disk space usage

# Show disk space in human-readable format
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       500G  200G  275G  43% /
/dev/sdb1       1.0T  800G  150G  80% /data

# Show file system types
$ df -hT

du

du (disk usage) shows the size of directories and files.

Common options: - -h: human-readable format - -s: summarize (show total only) - -d or --max-depth: limit depth of directory traversal - -a: show files as well as directories

Check directory sizes

# Show size of current directory and subdirectories
$ du -h

# Show total size of current directory only
$ du -sh .

# Show sizes up to 2 levels deep
$ du -h -d 2

# Find largest directories
$ du -h | sort -hr | head -10

Managing large genomic files

Regularly check disk space with df -h and identify large directories with du -sh * to avoid running out of space during analysis.

Process management

When running long genomic analyses, you need to manage processes effectively. Here are essential commands for process management.

ps

ps shows information about running processes.

Common options: - ps aux: show all processes with detailed info - ps -ef: show all processes in full format - ps -u username: show processes for a specific user

View running processes

# Show all processes
$ ps aux

# Show processes for current user
$ ps -u $USER

# Find a specific process
$ ps aux | grep plink

top and htop

top displays real-time information about running processes, CPU usage, and memory.

  • top: built-in process monitor
  • htop: enhanced version (may need installation: sudo apt install htop)

Common operations in top: - q: quit - k: kill a process (enter PID) - M: sort by memory usage - P: sort by CPU usage

Monitor system resources

$ top
# or
$ htop

Running processes in the background

For long-running analyses, you can run processes in the background.

  • &: run command in background
  • nohup: run command immune to hangups (continues after logout)
  • jobs: list background jobs
  • fg: bring background job to foreground
  • bg: resume suspended job in background

Run processes in background

# Run a command in background
$ plink --file data --out results &
[1] 12345  # job number and process ID

# Run with nohup (survives terminal closure)
$ nohup plink --file data --out results > plink.log 2>&1 &

# Check background jobs
$ jobs

# Bring job to foreground
$ fg %1

# Check if process is still running
$ ps aux | grep plink

Best practices for long analyses

  • Use nohup for analyses that take hours or days
  • Redirect output to log files: nohup command > output.log 2>&1 &
  • Check process status periodically with ps or top
  • Monitor disk space to ensure you have enough storage

Others

There are a group of very handy and flexible commands which will greatly improve your efficiency. These include | , >, >>,*,.,..,~,and -.

| (pipe)

Pipe basically is used to pass the output of the previous command to the next command as input, instead of printing is in terminal. Using pipe you can do very complicated manipulations of the files.

An example of Pipe

cat sumstats.txt | sort | uniq | wc
This means (1) print sumstats, (2) sort the output, (3) then keep the unique lines and finally (4) count the lines and words.

>

> redirects output to a new file (if the file already exist, it will be overwritten)

Redirects the output of cat sumstats.txt | sort | uniq | wc to count.txt

cat sumstats.txt | sort | uniq | wc > count.txt

>>

>> redirects output to a file by appending to the end of the file (if the file already exist, it will not be overwritten)

Redirects the output of cat sumstats.txt | sort | uniq | wc to count.txt by appending

cat sumstats.txt | sort | uniq | wc >> count.txt

Other useful commands include :

Command Description Example Code Example code meaning
* represent zero or more characters - -
? represent a single character - -
. the current directory - -
.. the parent directory of the current directory. cd .. change to the parent directory of the current directory
~ the home directory cd ~ change to the current user's home directory
- the last directory you are working in. cd - change to the last directory you are working in.

Wildcards

The asterisk * and the question mark ? are called wildcard characters or wildcards in Linux, which are special symbols that can represent other normal characters. Wildcards are especially useful when handling multiple files with similar pattern in their names.

which

which shows the full path of a command executable. This is useful to find where a program is installed or to check if a command is available in your PATH.

Find the location of a command

# Find where plink is located
$ which plink
/usr/local/bin/plink

# Find where python is located
$ which python
/usr/bin/python

# Check if a command exists
$ which nonexistent_command
# (no output means command not found)

which vs whereis

  • which: shows the executable that will be run (first in PATH)
  • whereis: shows all locations (binary, source, manual pages)

md5sum and checksums

Checksums are used to verify file integrity, which is crucial when downloading or transferring large genomic files. md5sum calculates the MD5 hash of a file.

Common checksum commands:

Command Description Key Difference
md5sum Calculate MD5 checksum Fast, 128-bit hash, but has known collision vulnerabilities. Good for quick integrity checks.
sha256sum Calculate SHA-256 checksum More secure, 256-bit hash, slower than MD5 but recommended for security-sensitive applications.
sha1sum Calculate SHA-1 checksum 160-bit hash, faster than SHA-256 but also has known vulnerabilities. Less commonly used now.

Calculate and verify checksums

# Calculate MD5 checksum of a file
$ md5sum sumstats.txt
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6  sumstats.txt

# Save checksum to a file
$ md5sum sumstats.txt > sumstats.txt.md5

# Verify file integrity using saved checksum
$ md5sum -c sumstats.txt.md5
sumstats.txt: OK

# Calculate SHA-256 checksum (more secure)
$ sha256sum sumstats.txt

Verify downloaded files

# Download a file and its checksum
$ wget https://example.com/data.vcf.gz
$ wget https://example.com/data.vcf.gz.md5

# Verify the download
$ md5sum -c data.vcf.gz.md5
data.vcf.gz: OK

Why use checksums?

  • Verify file integrity after download or transfer
  • Detect corruption in large genomic files
  • Ensure files haven't been modified
  • Compare files without comparing entire contents

Warning

Be extremely careful when you use rm and *. It is disastrous when you mistakenly type rm *

Bash scripts

If you have a lot of commands to run, or if you want to automate some complex manipulations, bash scripts are a good way to address this issue.

We can use vim to create a bash script called hello.sh

A simple example of bash scripts:

Example

hello.sh
#!/bin/bash
echo "Hello, world1"
echo "Hello, world2"

#! is called shebang, which tells the system which interpreter to use to execute the shell script.

Then use chmod to give it permission to execute.

chmod +x hello.sh 

Now we can run the script by ./hello.sh:

./hello.sh
"Hello, world1" 
"Hello, world2" 

Advanced text editing

(optional: awk, sed, cut, sort, join, uniq)

  • cut : cutting out columns from files.
  • sort: sorting the lines of a file.
  • uniq: filter the duplicated lines in a file.
  • join: join two tabular files based on specified keys.

Advanced commands:

Git and Github

Git is a powerful version control software and github is a platform where you can share your codes.

Currently you just need to learn git clone, which simply downloads an existing repository.

git clone https://github.com/Cloufield/GWASTutorial.git

You can also check here for more information.

Download

We can use wget [option] [url] command to download files to local machine.

-O option specify the file name you want to change for the downloaded file.

Use wget to download the hg19 reference genome from UCSC

# Download hg19 reference genome from UCSC
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

# Download hg19 reference genome from UCSC and rename it to  my_refgenome.fa.gz
wget -O my_refgenome.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

Exercise

The questions are generated by Microsoft Bing!

What is the command to list all files and directories in your current working directory?

  • A) ls
  • B) cd
  • C) pwd
  • D) mkdir

What is the command to create a new directory named “test”?

  • A) cd test
  • B) pwd test
  • C) mkdir test
  • D) ls test

What is the command to copy a file named “data.txt” from your current working directory to another directory named “backup”?

  • A) cp data.txt backup/
  • B) mv data.txt backup/
  • C) rm data.txt backup/
  • D) cat data.txt backup/

What is the command to display the first 10 lines of a file named “results.csv”?

  • A) head results.csv
  • B) tail results.csv
  • C) less results.csv
  • D) more results.csv

What is the command to count the number of lines, words, and characters in a file named “report.txt”?

  • A) wc report.txt
  • B) count report.txt
  • C) size report.txt
  • D) stat report.txt

What is the command to search for a pattern in a file named “log.txt” and print only the matching lines?

  • A) grep pattern log.txt
  • B) find pattern log.txt
  • C) locate pattern log.txt
  • D) search pattern log.txt

What is the command to sort the contents of a file named “names.txt” in alphabetical order and save the output to a new file named “sorted_names.txt”?

  • A) sort names.txt > sorted_names.txt
  • B) sort names.txt < sorted_names.txt
  • C) sort names.txt >> sorted_names.txt
  • D) sort names.txt << sorted_names.txt

What is the command to display the difference between two files named “old_version.py” and “new_version.py”?

  • A) diff old_version.py new_version.py
  • B) cmp old_version.py new_version.py
  • C) diffy old_version.py new_version.py
  • D) compare old_version.py new_version.py

What is the command to change the permissions of a file named “script.sh” to make it executable by everyone?

  • A) chmod +x script.sh
  • B) chmod 777 script.sh
  • C) chmod ugo+x script.sh
  • D) All of the above

What is the command to run a program named “program.exe” in the background and redirect its output to a file named “output.log”?

  • A) program.exe & > output.log
  • B) program.exe > output.log &
  • C) program.exe < output.log &
  • D) program.exe & < output.log