AWK
AWK Introduction
'awk' is one of the most powerful text processing tools for tabular text files.
AWK syntax
awk OPTION 'CONDITION {PROCESS}' FILENAME
Some special variables in awk:
$0
: all columns$n
: column n. For example, $1 means the first column. $4 means column 4.NR
: Row number.
Examples
Using the sample sumstats, we will demonstrate some simple but useful one-liners.
# sample sumstats
head ../02_Linux_basics/sumstats.txt
#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE
1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .
1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .
1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .
1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .
1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .
1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .
1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .
1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .
1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .
Example 1
Select variants on chromosome 2 (keeping the headers)
awk 'NR==1 || $1==2 {print $0}' ../02_Linux_basics/sumstats.txt | head
#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE
2 22398 2:22398:C:T C T T ADD 503 1.287540.161017 1.56962 0.116503 .
2 24839 2:24839:C:T C T T ADD 503 1.318170.179754 1.53679 0.124344 .
2 26844 2:26844:C:T C T T ADD 503 1.3173 0.161302 1.70851 0.0875413 .
2 28786 2:28786:T:C T C C ADD 503 1.3043 0.161184 1.64822 0.0993082 .
2 30091 2:30091:C:G C G G ADD 503 1.3043 0.161184 1.64822 0.0993082 .
2 30762 2:30762:A:G A G A ADD 503 1.099560.158614 0.598369 0.549594 .
2 34503 2:34503:G:T G T T ADD 503 1.323720.179789 1.55988 0.118789 .
2 39340 2:39340:A:G A G G ADD 503 1.3043 0.161184 1.64822 0.0993082 .
2 55237 2:55237:T:C T C C ADD 503 1.314860.161988 1.68983 0.0910614 .
The NR
here means row number. The condition here NR==1 || $1==2
means if it is the first row or the first column is equal to 2, conduct the process print $0
, which mean print all columns.
Example 2
Select all genome-wide significant variants (p<5e-8)
awk 'NR==1 || $13 <5e-8 {print $0}' ../02_Linux_basics/sumstats.txt | head
#CHROM POS ID REF ALT A1 TEST OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE
1 13273 1:13273:G:C G C C ADD 503 0.7461490.282904 -1.03509 0.300628 .
1 14599 1:14599:T:A T A A ADD 503 1.676930.240899 2.14598 0.0318742 .
1 14604 1:14604:A:G A G G ADD 503 1.676930.240899 2.14598 0.0318742 .
1 14930 1:14930:A:G A G G ADD 503 1.643590.242872 2.04585 0.0407708 .
1 69897 1:69897:T:C T C T ADD 503 1.691420.200238 2.62471 0.00867216 .
1 86331 1:86331:A:G A G G ADD 503 1.418870.238055 1.46968 0.141649 .
1 91581 1:91581:G:A G A A ADD 503 0.9313040.123644 -0.575598 0.564887 .
1 122872 1:122872:T:G T G G ADD 503 1.048280.182036 0.259034 0.795609 .
1 135163 1:135163:C:T C T T ADD 503 0.6766660.242611 -1.60989 0.107422 .
Example 3
Create a bed-like format for annotation
awk 'NR>1 {print $1,$2,$2,$4,$5}' ../02_Linux_basics/sumstats.txt | head
1 13273 13273 G C
1 14599 14599 T A
1 14604 14604 A G
1 14930 14930 A G
1 69897 69897 T C
1 86331 86331 A G
1 91581 91581 G A
1 122872 122872 T G
1 135163 135163 C T
1 233473 233473 C G
AWK workflow
The workflow of awk can be summarized in the following figure:
awk workflow
AWK variables
Frequently used awk variables
Variable | Desciption |
---|---|
NR | The number of input records |
NF | The number of input fields |
FS | The input field separator. The default value is " " |
OFS | The output field separator. The default value is " " |
RS | The input record separator. The default value is "\n" |
ORS | The output record separator.The default value is "\n" |
FILENAME | The name of the current input file. |
FNR | The current record number in the current file |
Handle csv and tsv files
head ../03_Data_formats/sample_data.csv
#CHROM,POS,ID,REF,ALT,A1,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,ERRCODE
1,13273,1:13273:G:C,G,C,C,N,ADD,503,0.750168,0.280794,-1.02373,0.305961,.
1,14599,1:14599:T:A,T,A,A,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.
1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.
1,14930,1:14930:A:G,A,G,G,N,ADD,503,1.70139,0.240245,2.21209,0.0269602,.
1,69897,1:69897:T:C,T,C,T,N,ADD,503,1.58002,0.194774,2.34855,0.0188466,.
1,86331,1:86331:A:G,A,G,G,N,ADD,503,1.47006,0.236102,1.63193,0.102694,.
1,91581,1:91581:G:A,G,A,A,N,ADD,503,0.924422,0.122991,-0.638963,0.522847,.
1,122872,1:122872:T:G,T,G,G,N,ADD,503,1.07113,0.180776,0.380121,0.703856,.
1,135163,1:135163:C:T,C,T,T,N,ADD,503,0.711822,0.23908,-1.42182,0.155079,.
awk -v FS=',' -v OFS="\t" '{print $1,$2}' sample_data.csv
#CHROM POS
1 13273
1 14599
1 14604
1 14930
1 69897
1 86331
1 91581
1 122872
1 135163
convert csv to tsv
awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' sample_data.csv
Skip and replace headers
awk -v FS=',' -v OFS="\t" 'BEGIN{print "CHR\tPOS"} NR>1 {print $1,$2}' sample_data.csv
CHR POS
1 13273
1 14599
1 14604
1 14930
1 69897
1 86331
1 91581
1 122872
1 135163
Extract a line
awk 'NR==4' sample_data.csv
1,14604,1:14604:A:G,A,G,G,N,ADD,503,1.80972,0.231595,2.56124,0.0104299,.
Print the last two columns
awk -v FS=',' '{print $(NF-1),$(NF)}' sample_data.csv
P ERRCODE
0.305961 .
0.0104299 .
0.0104299 .
0.0269602 .
0.0188466 .
0.102694 .
0.522847 .
0.703856 .
0.155079 .
AWK operators
Arithmetic Operators
Arithmetic Operators | Desciption |
---|---|
+ |
add |
- |
subtract |
* |
multiply |
\ |
divide |
% |
modulus division |
** |
x**y : x raised to the y-th power |
Logical Operators
Logical Operators | Desciption |
---|---|
\|\| |
or |
&& |
and |
! |
not |
AWK functions
Numeric functions in awk
- int(x) : truncate x to integer
- log(x) : the natural logarithm of x
- exp(x) : natural exponential function
- sqrt(x) : square root of x
Convert OR and P to BETA and -log10(P)
awk -v FS=',' -v OFS="\t" 'BEGIN{print "SNPID\tBETA\tMLOG10P"}NR>1{print $3,log($10),-log($13)/log(10)}' sample_data.csv
SNPID BETA MLOG10P
1:13273:G:C -0.287458 0.514334
1:14599:T:A 0.593172 1.98172
1:14604:A:G 0.593172 1.98172
1:14930:A:G 0.531446 1.56928
1:69897:T:C 0.457438 1.72477
1:86331:A:G 0.385303 0.988455
1:91581:G:A -0.0785866 0.281625
1:122872:T:G 0.0687142 0.152516
1:135163:C:T -0.339927 0.809447
String manipulating functions in awk
- length([string])
- split(string, array [, fieldsep [, seps ] ])
- sub(regexp, replacement [, target])
- gsub(regexp, replacement [, target])
- substr(string, start [, length ])
- tolower(string)
- toupper(string)
AWK options
$ awk --help
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options: GNU long options: (standard)
-f progfile --file=progfile
-F fs --field-separator=fs
-v var=val --assign=var=val
Short options: GNU long options: (extensions)
-b --characters-as-bytes
-c --traditional
-C --copyright
-d[file] --dump-variables[=file]
-D[file] --debug[=file]
-e 'program-text' --source='program-text'
-E file --exec=file
-g --gen-pot
-h --help
-i includefile --include=includefile
-l library --load=library
-L[fatal|invalid] --lint[=fatal|invalid]
-M --bignum
-N --use-lc-numeric
-n --non-decimal-data
-o[file] --pretty-print[=file]
-O --optimize
-p[file] --profile[=file]
-P --posix
-r --re-interval
-S --sandbox
-t --lint-old
-V --version
To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.
gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.
Examples:
gawk '{ sum += $1 }; END { print sum }' file
gawk -F: '{ print $1 }' /etc/passwd
Reference
- https://www.gnu.org/software/gawk/manual/gawk.html