UNIX Commands (3)
=================

.. note:: It is reccomended to first follow Tutorial 2 before reading this.

Modify the data with sed
------------------------

Overview
^^^^^^^^

``sed`` stands for "stream editor". It is used to perform basic text transformations on an input stream (a file or input from a pipeline). It is very powerful (turing complete) but you mostly use it to work on a "line by line" basis.

The syntax is a bit exotic, but once you get used to it, it is not very difficult as it never changes:

.. code-block:: bash

  $ sed [options] 'location action' [files]

Or generally you use it through piping:

.. code-block:: bash

  $ (do output) | sed [options] 'location action'

The way it work is that it reads a line from the input (file or pipeline), and then applies the action to the line if the location matches. It then repeats this process for every line in the input.

For example, to delete the second line of the input, you specify "2" as the location and "d" as the action:

.. code-block:: bash

  $ (do output) | sed '2d'

More examples will be shown once the syntax is explained in more details.

Locations
^^^^^^^^^

Locations can be

- a line number: ``2`` (Your can also use ^ and $ to specify the first and last line)

- a range of line numbers: ``2,5``, ``2,$``, ``2,+3``

- a regular expression: ``/pattern/``

- nothing: this means the action is applied to every line.

Locations can also be negated by adding a ``!`` after the location. For example, ``2!`` means "every line except the second line".

Actions
^^^^^^^

Some simple actions include

- ``p``: print the line (generally used with the ``-n`` option)

- ``d``: delete the line

- ``q``: quit the program without processing the rest of the input

- ``a text``: append a line with the specified text after the current line

- ``i text``: insert a line with the specified text before the current line

- ``c text``: change the current line to the specified text

The most powerful action is probably the ``s`` action. It is used to substitute a pattern with another pattern. The syntax is ``s/pattern/replacement/flag``. For example, to replace "foo" with "bar" in the input:

.. code-block:: bash

  $ (do output) | sed 's/foo/bar/g'

The ``g`` flag is used to replace all occurrences of the pattern in the line. Without this flag, only the first occurrence of each line is replaced. You can also use ``N`` to replace the N-th occurrence.

The replacement pattern can also contain backreferences to the pattern:

- ``&``: the matched pattern:

.. code-block:: bash

  $ (do output) | sed 's/foo/&tball/g'

- ``\N``: the N-th backreference, which are the patterns enclosed in parentheses in the pattern:

.. code-block:: bash

  $ (do output) | sed 's/\(foo\) \(bar\)/\2 \1/g'

.. note:: If you use vim, you can use the same syntax in the substitute command to perform a "search and replace" operation. You just need to add a ``%`` at the beginning of the command to specify that you want to apply the command to the whole file: ``:%s/foo/bar/g``.

If you need to use a slash in the pattern or the replacement, you can escape it with a backslash. For example, to comment out the 12th line of a C file:

.. code-block:: bash

  $ (do output) | sed '12 s/^/\/\//'

However, it is generally more readable to use a different delimiter for the ``s`` command. For example here, we use a dash as the delimiter:

.. code-block:: bash

  $ (do output) | sed '12 s-^-//-'


Options
^^^^^^^

The most useful options in our opinions are

- ``-n``: without this option, sed will print every line of the input after applying the action. With this option, it will not print anything unless you explicitly tell it to do so.

- ``-e``: this option is used to specify multiple actions. You can also specify multiple actions without this option, but it is more readable to use this option.

- ``-r``: this option is used to enable extended regular expressions. Without this option, you have to escape some characters in the regular expressions.

- ``-i``: this option is used to modify the input file in place. Without this option, sed will print the modified input to the standard output.

More examples
^^^^^^^^^^^^^

Print the 10th line of the input:

.. code-block:: bash

  $ (do output) | sed -n '10p'

Delete all the empty lines of the input:

.. code-block:: bash

  $ (do output) | sed '/^$/d'

Print the lines longer than 80 characters:

.. code-block:: bash

  $ (do output) | sed -n '/.{81,}/p'

Add "Comment" after the 7th line:

.. code-block:: bash

  $ (do output) | sed '7a Comment'

Delete leading whitespaces from each line:

.. code-block:: bash

  $ (do output) | sed 's/^[ \t]*//'

Delete everything after the first blank line:

.. code-block:: bash

  $ (do output) | sed '/^$/q'

Modify the data with awk
------------------------

Overview
^^^^^^^^

``sed`` was useful to perform operations on a "line by line" basis. ``awk``, on the other hand, is useful to perform operations on a "column by column" basis. It is also a turing complete, and it is even more powerful than ``sed`` as it is a fully fletched scripting language (with variables, loops etc).

This means that the syntax is a bit more complex, but the general idea is similar to ``sed``: for each *record* that matches a *pattern*, apply an *action*.

The syntax is:

.. code-block:: bash

  $ awk [options] 'pattern {action}' [files]

The pattern or the action can be omitted. If the pattern is omitted, the action is applied to every record. If the action is omitted, the default action is to print the record.

``awk`` uses *records* and *fields* to represent the input. By default, a record is a line and a field is a word. All the records are processed one by one, and for each record, some operations are performed on the fields. 

Special variables
^^^^^^^^^^^^^^^^^

Fields and records are specified by special variables:

- ``$0``: the whole record

- ``$1``: the first field

- ``$2``: the second field

- ``NF``: the number of fields in the current record

- ``NR``: the number of records (until now)

- ``FS``: the field separator (default is any whitespace)

- ``RS``: the record separator (default is a newline)

For example, to print the first field (word) of each record (line), you can use the following command:

.. code-block:: bash

  $ (do output) | awk '{print $1}'

For each line that contains more than 5 words, print the line number and the line:

.. code-block:: bash

  $ (do output) | awk 'NF > 5 {print NR, $0}'

Patterns
^^^^^^^^

Patterns can be

- a regular expression: ``/pattern/``

- a condition: ``NR > 5``, ``NF == 3``, ``$1 ~ /pattern/``, ``/pattern1/ && /pattern2/``

- ``BEGIN``: a special pattern that is applied before the first record is read

- ``END``: a special pattern that is applied after the last record is read

Actions
^^^^^^^

Actions can be

- a statement: ``{print $1}``

- a block of statements: ``{print $1; print $2}``

- a control statement: ``{if (condition) {action} else {action}}``, ``{while (condition) {action}}``, ``{for (i = 1; i <= 10; i++) {action}}``

``awk`` also support most of the C mathematical operations: ``+``, ``-``, ``*``, ``/``, ``%``, ``^``, ``+=``, ``++``.

You can also use variables, which are initialized to either 0 or an empty string if they are used without being initialized. For example, to print the sum of the first field of each record:

.. code-block:: bash

  $ (do output) | awk '{s += $1} END {print s}'

You have different printing mecanisms:

- ``print item1, item2, ...``: print the items separated by the Output Field Separator (= OFS, default is a space)

- ``print item1 item2 ...``: print the items without any separator

- ``printf format, item1, item2, ...``: like in C

Finally, you can also use some predifined functions, such as ``tolower``, ``toupper``, ``length``, ``gsub(pattern, replacement, string)``, ``substr(string, start, length)``, etc.

Examples
^^^^^^^^

Change the first field to “>”:

.. code-block:: bash

  $ (do output) | awk '{$1 = ">"; print $0}'

Print every line with more than 4 fields:

.. code-block:: bash

  $ (do output) | awk 'NF > 4'

Right align all text on a 79-column width:

.. code-block:: bash

  $ (do output) | awk '{printf "%79s\n", $0}'

Print the even-numbered lines:

.. code-block:: bash

  $ (do output) | awk 'NR % 2 == 0'

Swap the first two fields:

.. code-block:: bash

  $ (do output) | awk '{t=$1;$1=$2;$2=t;print}'

Add a new field at the end of each line:

.. code-block:: bash

  $ (do output) | awk '{$(NF+1)="new"; print}'

Print the sum of each field:

.. code-block:: bash

  $ (do output) | awk '{for(i=1;i<=NF;i++) s+=$i; print s}'

Example with multiline records:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Say you have this file:

.. code-block:: text

  Jimmy the Weasel
  100 Pleasant Drive
  San Francisco, CA 12345
  
  Big Tony
  200 Incognito Ave.
  Suburbia, WA 67890

and you want to print it like this:

.. code-block:: text

  Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
  Big Tony, 200 Incognito Ave., Suburbia, WA 67890

You can use the RS variable to specify that the records are separated by a blank line:

.. code-block:: bash

  $ awk 'BEGIN {RS=""; FS="\n"; OFS=","} {print $1,$2,$3}' file.txt