You are on page 1of 15

How to read a file line by line

This article introduces the concept of playing a file line by line in Linux with the help of examples and tips along with a guided tour of initiating a loop. The article discusses the errors committed while reading a file line by line on the Linux platform. With samples and illustrations, it shows how the 'for loop' and 'while loop' differ in their respective outputs. It also provides tips on how to use the while loop and depicts its syntax. It concludes with the process behind initiating a loop along with the side effects the while loops can exhibit. One of the most common errors when using scripts bash on GNU / Linux is to read a file line by line by using a for loop (for line in $ (cat file.txt) do. ..), which in this example leads to an assessment for each line and not every word of the file. It is possible to change the value of the variable $ IFS (Internal Field Separator, internal field separator) with a for loop before starting the loop. Sample output with a for loop: for line in $ (cat file.txt) do echo "$ line" done

This is row No 1 This is row No 2 This [...] The solution is to use a while loop coupled with the internal read. It is possible to get the result with a for loop provided to change the value of the variable $ IFS (Internal Field Separator, internal field separator) before starting the loop. While loop The while loop remains the most appropriate and easiest way to read a file line by line. Syntax

while read line

do command done <file

Example
The starting file:

This is line 1

This is line 2

This is line 3

This is line 4 This is line 5

The instructions in the command line:

while read line; do echo -e "$line\n"; done < file.txt

or in a "bash" script:

#! / bin / bash while read line do echo-e "$ line \ n" done <file.txt The output on the screen (stdout): This is line 1 This is line 2

This is line 3 This is line 4 This is line 5

Tips
It is entirely possible from a structured file (like an address book or /etc/passwd, for example), to retrieve the values of each field and assign them to several variables with the command 'read'. Be careful to properly assign the IFS variable with good field separators (space by default). Example:

#! /bin/bash while IFS=: read user pass uid gid full home shell do echo -e "$full :\n\ Pseudo : $user\n\ UID :\t $uid\n\ GID :\t $gid\n\ Home :\t $home\n\ Shell :\t $shell\n\n" done < /etc/passwd

Bonus
while read i; do echo -e "Paramtre : $i"; done < <(echo -e "a\nab\nc")

Initiate a Loop
Although the while loop is the easiest method, it has its side effects. It obliterates the formatting of lines including spaces and tabs.

Moreover, the for loop coupled with a change of IFS helps keep the structure of the document output.

Syntax

old_IFS=$IFS IFS=$'\n' do command done IFS=$old_IFS

# save the field separator # new field separator, the end of line

for line in $(cat fichier)

# restore default field separator

How to Read a File Line by Line in a Shell Script


There are many ways to handle any task on a Unix platform, but some techniques that are used to process a file waste a lot of CPU time. Most of the wasted time is spent in unnecessary variable assignment and continuously opening and closing the same file over and over. Using a pipe also has a negative impact on the timing. In this article I will explain various techniques for parsing a file line by line. Some techniques are very fast and some make you wait for half a day. The techniques used in this article are measurable, and I tested each technique with time command so that you can see which techniques suits your needs. I don't explain in depth every thing, but if you know basic shell

scripting, I hope you can understand easily. I extracted last five lines from my /etc/passwd file, and stored in a file "file_passwd". [root@www blog]# tail -5 /etc/passwd > file_passwd [root@www blog]# cat file_passwd venu:x:500:500:venu madhav:/home/venu:/bin/bash padmin:x:501:501:Project Admin:/home/project:/bin/bash king:x:502:503:king:/home/project:/bin/bash user1:x:503:501::/home/project/:/bin/bash user2:x:504:501::/home/project/:/bin/bash I use this file whenever a sample file required.

Method 1:

PIPED while-read loop

#!/bin/bash # SCRIPT: method1.sh # PURPOSE: Process a file line by line with PIPED while-read loop. FILENAME=$1 count=0 cat $FILENAME | while read LINE do let count++ echo "$count $LINE" done echo -e "\nTotal $count Lines read" With catting a file and piping the file output to a while read loop a single line of text is read into a variable named LINE on each loop iteration. This continuous loop will run until all of the lines in the file have been processed one at a time. Bash can sometimes start a subshell in a PIPED "while-read"

loop. So the variable set within the loop will be lost (unset) outside of the loop. Therefore, $count would return 0, the initialized value outside the loop. Output: [root@www blog]# sh method1.sh file_passwd 1 venu:x:500:500:venu madhav:/home/venu:/bin/bash 2 padmin:x:501:501:Project Admin:/home/project:/bin/bash 3 king:x:502:503:king:/home/project:/bin/bash 4 user1:x:503:501::/home/project/:/bin/bash 5 user2:x:504:501::/home/project/:/bin/bash Total 0 Lines read

Method 2:

Redirected "while-read" loop

#!/bin/bash #SCRIPT: method2.sh #PURPOSE: Process a file line by line with redirected whileread loop. FILENAME=$1 count=0 while read LINE do let count++ echo "$count $LINE" done < $FILENAME echo -e "\nTotal $count Lines read" We still use the while read LINE syntax, but this time we feed the loop from the bottom (using file redirection) instead of using a pipe. You will find that this is one of the fastest ways to

process each line of a file. The first time you see this it looks a little unusual, but it works very well. Unlike method 1, with method 2 you will get total number of lines out side of the loop. Output: [root@www blog]# sh method2.sh file_passwd 1 venu:x:500:500:venu madhav:/home/venu:/bin/bash 2 padmin:x:501:501:Project Admin:/home/project:/bin/bash 3 king:x:502:503:king:/home/project:/bin/bash 4 user1:x:503:501::/home/project/:/bin/bash 5 user2:x:504:501::/home/project/:/bin/bash Total 5 Lines read Note: In some older shell scripting languages, the redirected loop would also return as a subshell.

Method 3:while read LINE Using File Descriptors


A file descriptor is simply a number that the operating system assigns to an open file to keep track of it. Consider it a simplified version of a file pointer. It is analogous to a file handle in C. There are always three default "files" open, stdin (the keyboard), stdout (the screen), and stderr (error messages output to the screen). These, and any other open files, can be redirected. Redirection simply means capturing output from a file, command, program, script, or even code block within a script and sending it as input to

another file, command, program, or script. Each open file gets assigned a file descriptor. The file descriptors for stdin,stdout, and stderr are 0,1, and 2, respectively. For opening additional files, there remain descriptors 3 to 9 (may be vary depending on OS). It is sometimes useful to assign one of these additional file descriptors to stdin, stdout, or stderr as a temporary duplicate link. This simplifies restoration to normal after complex redirection and reshuffling . There are two steps in the method we are going to use. The first step is to close file descriptor 0 by redirecting everything to our new file descriptor 3. We use the following syntax for this step: exec 3<&0 Now all of the keyboard and mouse input is going to our new file descriptor 3. The second step is to send our input file, specified by the variable $FILENAME, into file descriptor 0 (zero), which is standard input. This second step is done using the following syntax: exec 0<$FILENAME At this point any command requiring input will receive the input from the $FILENAME file. Now is a good time for an example. #!/bin/bash #SCRIPT: method3.sh #PURPOSE: Process a file line by line with while read LINE Using #File Descriptors

FILENAME=$1 count0= exec 3<&0 exec 0< $FILENAME while read LINE do let count++ echo "$count $LINE" done exec 0<&3 echo -e "\nTotal $count Lines read" while loop reads one line of text at a time.But the beginning of this script does a little file descriptor redirection. The first exec command redirects stdin to file descriptor 3. The second exec command redirects the $FILENAME file into stdin, which is file descriptor 0. Now the while loop can just execute without our having to worry about how we assign a line of text to the LINE variable. When the while loop exits we redirect the previously reassigned stdin, which was sent to file descriptor 3, back to its original file descriptor 0. exec 0<&3 In other words we set it back to the systems default value. Output: [root@www tempdir]# sh method3.sh file_passwd 1 venu:x:500:500:venu madhav:/home/venu:/bin/bash 2 padmin:x:501:501:Project Admin:/home/project:/bin/bash 3 king:x:502:503:king:/home/project:/bin/bash 4 user1:x:503:501::/home/project/:/bin/bash 5 user2:x:504:501::/home/project/:/bin/bash

Total 5 Lines read

Method 4: Process file line by line using awk


awk is pattern scanning and text processing language. It is useful for manipulation of data files, text retrieval and processing. Good for manipulating and/or extracting fields (columns) in structured text files. Its name comes from the surnames of its authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. I am not going to explain everything here.To know more about awk just Google it. At the command line, enter the following command: $ awk '{ print }' /etc/passwd You should see the contents of your /etc/passwd file appear before your eyes.Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order.All output is sent to stdout, and we get a result identical to catting /etc/passwd. Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code,we have a single print command. In awk,

when a print command appears by itself, the full contents of the current line are printed. Here is another awk example that does exactly the same thing: $ awk '{ print $0 }' /etc/passwd In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. Now is a good time for an example. #!/bin/bash #SCRIPT: method4.sh #PURPOSE: Process a file line by line with awk FILENAME=$1 awk '{kount++;print kount, $0} END{print "\nTotal " kount " lines read"}' $FILENAME Output: [root@www blog]# sh method4.sh file_passwd 1 venu:x:500:500:venu madhav:/home/venu:/bin/bash 2 padmin:x:501:501:Project Admin:/home/project:/bin/bash 3 king:x:502:503:king:/home/project:/bin/bash 4 user1:x:503:501::/home/project/:/bin/bash 5 user2:x:504:501::/home/project/:/bin/bash Total 5 lines read Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:

awk -F":" '{ print $1 "\t " $3

}' /etc/passwd

Above, when we called awk, we use the -F option to specify ":" as the field separator. By default white space (blank line) act as filed separator. You can set new filed separator with -F option. When awk processes the print $1 "\t " $3 command, it will print out the first and third fields that appears on each line in the input file. "\t" is used to separate field with tab.

Method 5: Little tricky with head and tail commands


#!/bin/bash #SCRIPT: method5.sh #PURPOSE: Process a file line by line with head and tail commands FILENAME=$1 Lines=`wc -l < $FILENAME` count=0 while [ $count -lt $Lines ] do let count++ LINE=`head -n $count $FILENAME | tail -1` echo "$count $LINE" done echo -e "\nTotal $count lines read" On each iteration head command extracts top $count lines, then tail command extracts bottom line from that lines. A very stupid method, but some people still using it. Output:

[root@www blog]# sh method5.sh file_passwd 1 venu:x:500:500:venu madhav:/home/venu:/bin/bash 2 padmin:x:501:501:Project Admin:/home/project:/bin/bash 3 king:x:502:503:king:/home/project:/bin/bash 4 user1:x:503:501::/home/project/:/bin/bash 5 user2:x:504:501::/home/project/:/bin/bash Total 5 lines read

Time Comparison for the Five Methods


Now take a long breath, we are going test each technique. Before you get into test each method of parsing a file line by line create a large file that has the exact number of lines that you want to process. Use bigfile.sh script to create a large file. $ sh bigfile.sh 900000 bigfile.sh with 900000 lines as an argument,it has taken more than two hours to generate bigfile.4227. I don't know exactly how much time it has taken. This file is extremely large to parse a file line by line, but I needed a large file to get the timing data greater than zero. [root@www blog]# du -h bigfile.4227 70M bigfile.4227 [root@www blog]# wc -l bigfile.4227 900000 bigfile.4227

[root@www blog]# time ./method1.sh bigfile.4227 >/dev/null real 6m2.911s user 2m58.207s sys 2m58.811s [root@www blog]# time ./method2.sh bigfile.4227 > /dev/null

real 2m48.394s user 2m39.714s sys 0m8.089s [root@www blog]# time ./method3.sh bigfile.4227 > /dev/null real 2m48.218s user 2m39.322s sys 0m8.161s [root@www blog]# time ./method4.sh bigfile.4227 > /dev/null real 0m2.054s user 0m1.924s sys 0m0.120s [root@www blog]# time ./method5.sh bigfile.4227 > /dev/null I waited more than half day, still i didn't get result, then I created a 10000-line file to test this method. [root@www tempdir]# time ./method5.sh file.10000 > /dev/null real user sys 2m25.739s 0m21.857s 1m12.705s

Method 4 came in first place,it has taken very less time 2.05 seconds, but we can't compare Method 4 with other methods, because awk is not just a command, but a programming language too. Method 2 and method 3 are tied for second place, they produce mostly the same real execution time at 2 minutes and 48 seconds . Method 1 came in third at 6 minutes and 2.9 seconds. Method 5 has taken more than half a day. 2 minutes 25 seconds to process just a 10000 line file, how stupid it is. Note: If file contain escape characters, use read -r instead of read, then Backslash does not act as an escape character. The back-slash is

considered to be part of the line. In particular, a backslash-newline pair may not be used as a line continuation.
The script accepts the file name as first argument. It supposes that: 1. All the files are named *.csv 2. All the fields are splitted with a whitespace
view plaincopy to clipboardprint?

1. #!/bin/sh 2. 3. grep "\.csv" $1|while read LINE; do 4. FILENAME=`echo $LINE|cut -d ' ' -f 9` 5. SIZE=`echo $LINE|cut -d ' ' -f 5` 6. echo "File: " $FILENAME ", size: " $SIZE 7. done

EX script to read: 8. FILE=/home/file.txt 9. 10. if [ -f $FILE ]; 11. then 12. echo "File $FILE exists" 13. cnt=$(cat $FILE | wc -l) # deliberate UUOC 14. if [ $cnt -gt 3 ] ; 15. then 16. echo "$FILE is larger than 3 lines" 17. fi 18. else 19. echo "File $FILE does not exist" 20.Fi Another one:
21. awk '{x++}END{ print x}' filename

You might also like