Grep, sed and awk are really powerful Linux tools that are worth to be known to any developer. Basically, these tools provide very flexible and strong text processing mechanisms. Most benefit of these tools is gained when working with large files. This does not mean 30’000 lines but millions of lines. For the most part, interactive text editors can not handle these large files very efficiently. I will dive into each of them very pragmatically to get you started.
First of all, defining the number of lines in a file can be performed as follows:
wc -l filename
grep
Grep means Global Regular Expression and Print and basically allows you to search output or file. It is a text search utility used from Linux command line to globally search a file or STFIN for a given regular expression. It will print matching lines to STDOUT. The basic syntax looks as follows:
grep [options] regex [filename]
Here are some examples of using grep:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Show grep version | |
grep –version | |
# Print lines containing sys | |
grep sys /etc/passwd | |
# Print lines containing SYS case insensitive | |
grep SYS -i /etc/passwd | |
# Count lines containing sys | |
grep -c sys /etc/passwd | |
# Print last 10 lines containing sys | |
tail -10 /etc/passwd | grep sys | |
# Print lines fullfilling regex (put regex in single quotes) | |
grep '^[a-z]' /etc/passwd | |
# Print non empty lines | |
grep -v '^$' /etc/passwd | |
# Print 2 lines after the match (including the match) | |
# use B instead for before, C for before and after | |
grep proxy -A2 /etc/passwd |
Working with grep requires basic knowledge of regular expressions. Therefore, a short summary of regex is given in the following:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Anchors | |
^ Start of string | |
$ End of string | |
# Ranges | |
[A-Za-z] any letter | |
[0-9] any digit | |
[349] matches 3, 4 or 9 | |
[^5] any character except 5 (negation) | |
# Boundaries | |
\s whitespace | |
\S non-whitespace | |
\b word boundary | |
\B non-word boundary | |
# Quantifiers | |
* zero or more times | |
? zero or one time | |
+ one or more times | |
{n} exactly n times |
sed
sed (stream editor) is a command-line based text editor. It is one of the „veterans“ in the Linux world and is virtually in every Linux installation included. It allows to perform common text editing tasks like printing, substituting, inserting, deleting, appending lines etc. The basic syntax looks as follows:
sed [options] sed-script [filename]
Here are some examples of using sed:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Print (note that print lines are duplicated as the match line and the standard output is printed) | |
sed 'p' /etc/passwd | |
# Print lines but standard output is suppressed | |
sed -n 'p' /etc/passwd | |
# Print lines 1 to 5 | |
sed -n '1,5 p' /etc/passwd | |
# Print lines fullfilling regex | |
sed -n '/^root/ p' /etc/passwd | |
# Substitute bin through binary | |
# An optional range can be specified in front | |
sed 's/bin/binary/' /etc/passwd | |
# Substitute bin through binary but also replace multiple matches in one line | |
sed 's/bin/binary/g' /etc/passwd | |
# Substitute bin/bash through bin/sh | |
# As the search and replacement string contain / another delimiter can be chossen, here @ | |
sed 's@/bin/bash@/bin/sh@' /etc/passwd | |
# Substitute and print only changes | |
sed -n 's/bin/binary/p' /etc/passwd | |
# Write changes to file and backup original file with i-option | |
sed -i.bak 's/bin/binary/' /etc/passwd | |
# Insert line before line starting with 'root' | |
sed '/^root/ i line to be inserted' /etc/passwd | |
# Insert line after line starting with 'root' | |
sed '/^root/ a line to be appended' /etc/passwd | |
# Delete line starting with 'root' | |
sed '/^root/ d' /etc/passwd | |
# Multiple sed expressions in the command line | |
sed '{ | |
/^root/ i line to be inserted | |
/^root/ a line to be appended | |
/^root/ d | |
}' /etc/passwd | |
# Using a sed script file | |
sed -f myscript.sed /etc/passwd | |
# Uppercase 1st column, lowercase 2nd column in comma-separted file with substituting grouping | |
# Substitution group \([^,]*\) means everything else than a comma | |
sed 's/\([^,]*\),\([^,]*\)/\U\1,\L\2/ file.csv | |
# Substitute and subsequently execute | |
sed 's/^/sudo useradd/e user.list |
awk
awk is a scripting language for editing and analyzing texts. Input data is always processed line by line. The name awk comes from the initials of the developer.
awk [options] awk-script filename
There are a number of variables coming with awk:
- FS: field separator
- OFS: output field separator
- RS: record separator
- ORS: output record separator
- NR: number of records in file
- NF: number of fields in record
- FILENAME: name of file being read
- FNR: number of records relative to current file
Here are some examples of using awk:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# execute awk file | |
awk -f users.awk /etc/passwd | |
# print specific column and show total lines processed | |
BEGIN { FS=":" ; print "Username"} | |
{print $1} | |
END {print "Total users= " NR} | |
# print column where it meets critera | |
BEGIN { FS=":" ; print "Username"} | |
$3>499 {print $1} | |
# count lines beginning with 'root' and print total users | |
BEGIN { FS=":" ; print "Username"} | |
/^root/ {print $1 ; count++} | |
END {print "Total users= " count} | |
# Uppercase 1st column, lowercase 2nd column in comma-separted file with substituting grouping | |
# compare to same sed command above. this is much easier | |
awk -F"," {print toupper($1), tolower($2), $3} file.csv | |
# extract xml records which are separated by two new lines | |
BEGIN { RS="\n\n"} | |
$0 ~ search {print} | |
awk -f xml.awk search=example xmlfile | |
# Count number of specific element | |
BEGIN { FS=" "; print "Log access"} | |
{ip[$1]++} // value $1 is the key (associative array) | |
END { for (i in ip) | |
print i, " has accesed ", ip[i], "times." | |
} | |
# print max number of specific element | |
BEGIN { FS=" "; print "Most popular browser"} | |
{browser[$1]++} | |
END { for (b in browser) | |
if (max < browser[b]) { | |
max = browser[b]; | |
maxbrwoser = b; | |
} | |
print "Most access was from ", maxbrowser, " and ", max, " times." | |
} |