how to chop large sequences in linux 1

how to chop large sequences in linux

awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1000==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < sequences.fa

Here is what the above code is Doing:
1. The first line is a shebang. It tells the shell what program to use to interpret the rest of the file.
2. The second line is a comment. It's ignored by the shell.
3. The third line is a command. It tells the shell to run the awk program.
4. The fourth line is a comment. It's ignored by awk.
5. The fifth line is an awk program. It tells awk to do the following:
a. When awk starts, set the variable n_seq to 0.
b. When awk sees a line that starts with ">", do the following:
i. If n_seq is evenly divisible by 1000, set the variable file equal to "myseq" followed by the value of n_seq, followed by ".fa".
ii. Print the current line to the file whose name is stored in the variable file.
iii. Add 1 to n_seq.
iv. Skip to the next line of input without processing any of the rest of the commands.
c. For all other lines, print the line to the file whose name is stored in the variable file.
6. The last line tells the shell to take the input from the file named sequences.fa and send it to awk.

Similar Posts