How to Split a File of Strings with Awk

29/12/2020
Chưa phân loại
The Linux awk command(abbreviated from the names of the developers; Aho, Weinberger, and Kernighan) is a great way to process and analyze a file of strings. In order for the files to be more informative, they have to be organized in the form of rows and columns. Then, you can use awk on these files to:

  • Scan the files, line by line.
  • Split each line into fields/columns.
  • Specify patterns and compare the lines of the file to those patterns
  • Perform various actions on the lines that match a given pattern

In this article, we will explain the basic usage of the awk command and how it can be used to split a file of strings. We have performed the examples from this article on a Debian 10  Buster system but they can be easily replicated on most Linux distros.

The sample file we will be using

The sample file of strings that we will be using in order to demonstrate the usage of the awk command is as follows:

This is what each column of the sample file indicates:

  • The first column contains the name of employees/teachers in a school
  • The second column contains the subject that the employee teaches
  • The third column indicates whether the employee is a professor or assistant professor
  • The fourth column contains the pay of the employee

Example 1: Use Awk to print all lines of a file

Printing each and every line of a specified file is the default behavior of the awk command. In the following syntax of the awk command, we are not specifying any pattern that awk should print, thus the command is supposed to apply the “print” action to all the lines of the file.

Syntax:

$ awk ‘{print}’ filename.txt

Example:

In this example, I am telling the awk command to print the contents of my sample file, line by line.

$ awk ‘{print}’ sample_file.txt

Example 2:  Use awk to print only the lines that match a given pattern

With awk, you can specify a pattern and the command will print only the lines matching that pattern.

Syntax:

$ awk ‘/pattern_to_be_matched/ {print}’ filename.txt

Example:

From the sample file, if I want to print only the line(s) that contain the variable ‘B’, I can use the following command:

$ awk ‘/B/ {print}’ sample_file.txt

To make the example more meaningful, let me print only the information about employees that are ‘professor’s.

$ awk ‘/professor/ {print}’ sample_file.txt

The command only prints the lines/entries that contain the string “professor” thus we have more valuable information derived from the data.

Example 3. Use awk to split the file so that only specific fields/columns are printed

Instead of printing the whole file, you can make awk to print only specific columns of the file. Awk treats all words, separated by white space, in a line as a column record by default. It stores the record in a $N variable. Where $1 represents the first word, $2 stores the second word, $3 the fourth, and so on. $0 stores the whole line so the who line is printed, as explained in example 1.

Syntax:

$ awk ‘{print $N,….}’ filename.txt

Example:

The following command will print only the first column(name) and the second column(subject) of my sample file:

$ awk ‘{print $1, $2}’ sample_file.txt

Example 4: Use Awk to count and print the number of lines in which a pattern is matched

You can tell awk to count the number of lines in which a specified pattern is matched and then output that ‘count’.

Syntax:

$ awk ‘/pattern_to_be_matched/{++cnt} END {print "Count = ", cnt}’
filename.txt

Example:

In this example, I want to count the number of persons teaching the subject “english”. Therefore I will tell the awk command to match the pattern “english” and print the number of lines in which this pattern is matched.

$ awk ‘/english/{++cnt} END {print "Count = ", cnt}’ sample_file.txt

The count here suggests that 2 people are teaching english from the sample file records.

Example 5: Use awk to print only lines with more than a specific number of characters

For this task, we will be using the built-in awk function called “length”. This function returns the length of the input string. Thus, if we want awk to print only lines with more than, or even less than, the number of characters, we can use the length function in the following manner:

For printing lines with characters greater than a number:

$ awk ‘length($0) > n’ filename.txt

For printing lines with characters less than a number:

$ awk ‘length($0) < n’ filename.txt

Where n is the number of characters you want to specify for a line.

Example:

The following command will print only the lines from my sample file who have characters more than 30:

$ awk ‘length($0) > 30’ sample_file.txt

Example 6: Use awk to save the command output to another file

By using the redirection operator ‘>’, you can use the awk command to print its output to another file. This is the way you can use it:

$ awk ‘criteria_to_print’’ filename.txt > outputfile.txt

Example:

In this example, I will be using the redirection operator with my awk command to print only the names of the employees(column 1) to a new file:

$ awk ‘{print $1}’ sample_file.txt > employee_names.txt

I verified through the cat commands that the new file only contains the names of the employees.

Example 7: Use awk to print only non-empty lines from a file

Awk has some built-in commands that you can use to filter the output. For example, the NF command is used to keep a count of the fields within the current input record. Here, we will use the NF command to print only the non-empty lines of the file:

$ awk ‘NF > 0’ sample_file.txt

Obviously, you can use the following command to print the empty lines:

$ awk ‘NF < 0’ sample_file.txt

Example 8: Use awk to count the total lines in a file

Another built-in function called NR keeps a count of the number of input records(usually lines) of a given file. You can use this function in awk as following to count the number of lines in a file:

$ awk ‘END { print NR }’ sample_file.txt

This was the basic information you need to start with splitting files with the awk command. You can use the combination of these examples to fetch more meaningful information from your file of strings through awk.

Sandclock IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, Sandclock IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.
Bài viết liên quan

Wine for Arch Linux

Do you sometimes need to use a Windows-only application but hate the idea of having to restart your computer to boot into...
28/12/2020

Giải pháp đào tạo trực tuyến đơn giản với Wirecast

Giải pháp đào tạo trực tuyến đơn giản Sơ đồ khối hệ thống Mô hình giải pháp cơ bản được...
24/12/2020

Best Security Focused Linux Distros for Ethical Hacking and Pentesting

A hacker needs a security focused operating system to help discover the weakness in computer systems or network. Among...
28/12/2020
Bài Viết

Bài Viết Mới Cập Nhật

Hướng dẫn chuyển đổi windows server windows evaluation to standard và active windows server 2008 + 2012 + 2016 + 2019
26/10/2021

How to Update Ubuntu Linux
24/10/2021

Squid Proxy Manager cài đặt và quản lý Proxy Squid tự động trên ubuntu
20/10/2021

Hướng dẫn cài đặt Apache CloudStack 4.15.2.0
19/10/2021

Hướng dẫn ký file PDF bằng chữ ký số (chữ ký điện tử) và sửa lỗi mới nhất 2021 foxit reader
19/10/2021