This is the note while learning the course Missing Semester.

Ref: https://missing.csail.mit.edu/

Shell

Environment Variable

If the shell is asked to execute a command that doesn’t match one of its programming keywords, it consults an environment variable called $PATH that lists which directories the shell should search for programs when it is given a command:

1
2
3
4
5
6
missing:~$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
missing:~$ which echo
/bin/echo
missing:~$ /bin/echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

When we run the echo command, the shell sees that it should execute the program echo, and then searches through the :-separated list of directories in $PATH for a file by that name. When it finds it, it runs it (assuming the file is executable; more on that later). We can find out which file is executed for a given program name using the which program. We can also bypass $PATH entirely by giving the path to the file we want to execute.

Permission of directory

1
2
missing:~$ ls -l /home
drwxr-xr-x 1 missing users 4096 Jun 15 2019 missing

Above, only the owner is allowed to modify (w) the missing directory (i.e., add/remove files in it). To enter a directory, a user must have “search” (represented by “execute”: x) permissions on that directory (and its parents). To list its contents, a user must have read (r) permissions on that directory.

Connect programs

The simplest form of redirection is < file and > file. These let you rewire the input and output streams of a program to a file respectively:

1
2
3
4
5
6
7
8
missing:~$ echo hello > hello.txt
missing:~$ cat hello.txt
hello
missing:~$ cat < hello.txt
hello
missing:~$ cat < hello.txt > hello2.txt
missing:~$ cat hello2.txt
hello

You can also use >> to append to a file.

The | operator lets you “chain” programs such that the output of one is the input of another:

1
2
3
4
missing:~$ ls -l / | tail -n1
drwxr-xr-x 1 root root 4096 Jun 20 2019 var
missing:~$ curl --head --silent google.com | grep --ignore-case content-length | cut --delimiter=' ' -f2
219

Shell Script

Variable and String

To assign variables in bash, use the syntax foo=bar and access the value of the variable with $foo. Note that foo = bar will not work since it is interpreted as calling the foo program with arguments = and bar. In general, in shell scripts the space character will perform argument splitting.

Strings in bash can be defined with ' and " delimiters, but they are not equivalent. Strings delimited with ' are literal strings and will not substitute variable values whereas " delimited strings will.

1
2
3
4
5
foo=bar
echo "$foo"
# prints bar
echo '$foo'
# prints $foo

Function and Argument

As with most programming languages, bash supports control flow techniques including if, case, while and for. Similarly, bash has functions that take arguments and can operate with them. Here is an example of a function that creates a directory and cds into it.

1
2
3
4
mcd () {
mkdir -p "$1"
cd "$1"
}

Here $1 is the first argument to the script/function. Unlike other scripting languages, bash uses a variety of special variables to refer to arguments, error codes, and other relevant variables. Below is a list of some of them. A more comprehensive list can be found here.

  • $0 - Name of the script
  • $1 to $9 - Arguments to the script. $1 is the first argument and so on.
  • $@ - All the arguments
  • $# - Number of arguments
  • $? - Return code of the previous command
  • $$ - Process identification number (PID) for the current script
  • !! - Entire last command, including arguments. A common pattern is to execute a command only for it to fail due to missing permissions; you can quickly re-execute the command with sudo by doing sudo !!
  • $_ - Last argument from the last command. If you are in an interactive shell, you can also quickly get this value by typing Esc followed by .

Return Value

Commands will often return output using STDOUT, errors through STDERR, and a Return Code to report errors in a more script-friendly manner. The return code or exit status is the way scripts/commands have to communicate how execution went. A value of 0 usually means everything went OK; anything different from 0 means an error occurred.

Exit codes can be used to conditionally execute commands using && (and operator) and || (or operator), both of which are short-circuiting operators. Commands can also be separated within the same line using a semicolon ;. The true program will always have a 0 return code and the false command will always have a 1 return code. Let’s see some examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
false || echo "Oops, fail"
# Oops, fail

true || echo "Will not be printed"
#

true && echo "Things went well"
# Things went well

false && echo "Will not be printed"
#

true ; echo "This will always run"
# This will always run

false ; echo "This will always run"
# This will always run

Temporary file

Another common pattern is wanting to get the output of a command as a variable. This can be done with command substitution. Whenever you place $( CMD ) it will execute CMD, get the output of the command and substitute it in place. For example, if you do for file in $(ls), the shell will first call ls and then iterate over those values. A lesser known similar feature is process substitution, <( CMD ) will execute CMD and place the output in a temporary file and substitute the <() with that file’s name. This is useful when commands expect values to be passed by file instead of by STDIN. For example, diff <(ls foo) <(ls bar) will show differences between files in dirs foo and bar.

An example of shell script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash

echo "Starting program at $(date)" # Date will be substituted

echo "Running program $0 with $# arguments with pid $$"

for file in "$@"; do
grep foobar "$file" > result.txt
# When pattern is not found, grep has exit status 1
# We redirect STDOUT and STDERR to a null register since we do not care about them
if [[ $? -ne 0 ]]; then
echo "File $file does not have any foobar, adding one"
echo "# foobar" >> "$file"
fi
done

When performing comparisons in bash, try to use double brackets [[ ]] in favor of simple brackets [ ].

Shell globbing

When launching scripts, you will often want to provide arguments that are similar. Bash has ways of making this easier, expanding expressions by carrying out filename expansion. These techniques are often referred to as shell globbing.

  • Wildcards - Whenever you want to perform some sort of wildcard matching, you can use ? and * to match one or any amount of characters respectively. For instance, given files foo, foo1, foo2, foo10 and bar, the command rm foo? will delete foo1 and foo2 whereas rm foo* will delete all but bar.
  • Curly braces {} - Whenever you have a common substring in a series of commands, you can use curly braces for bash to expand this automatically. This comes in very handy when moving or converting files.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
convert image.{png,jpg}
# Will expand to
convert image.png image.jpg

cp /path/to/project/{foo,bar,baz}.sh /newpath
# Will expand to
cp /path/to/project/foo.sh /path/to/project/bar.sh /path/to/project/baz.sh /newpath

# Globbing techniques can also be combined
mv *{.py,.sh} folder
# Will move all *.py and *.sh files


mkdir foo bar
# This creates files foo/a, foo/b, ... foo/h, bar/a, bar/b, ... bar/h
touch {foo,bar}/{a..h}
touch foo/x bar/y
# Show differences between files in foo and bar
diff <(ls foo) <(ls bar)
# Outputs
# < x
# ---
# > y

Shell check

https://www.shellcheck.net/

Finding files

One of the most common repetitive tasks that every programmer faces is finding files or directories. All UNIX-like systems come packaged with find, a great shell tool to find files. find will recursively search for files matching some criteria. Some examples:

1
2
3
4
5
6
7
8
# Find all directories named src
find . -name src -type d
# Find all python files that have a folder named test in their path
find . -path '*/test/*.py' -type f
# Find all files modified in the last day
find . -mtime -1
# Find all zip files with size in range 500k to 10M
find . -size +500k -size -10M -name '*.tar.gz'

Beyond listing files, find can also perform actions over files that match your query. This property can be incredibly helpful to simplify what could be fairly monotonous tasks.

1
2
3
4
# Delete all files with .tmp extension
find . -name '*.tmp' -exec rm {} \;
# Find all PNG files and convert them to JPG
find . -name '*.png' -exec convert {} {}.jpg \;

Despite find’s ubiquitousness, its syntax can sometimes be tricky to remember. For instance, to simply find files that match some pattern PATTERN you have to execute find -name '*PATTERN*' (or -iname if you want the pattern matching to be case insensitive).

my example:

1
2
3
find . -name 'source*'
./source2.txt
./source1.txt

Finding code

Finding files by name is useful, but quite often you want to search based on file content. A common scenario is wanting to search for all files that contain some pattern, along with where in those files said pattern occurs. To achieve this, most UNIX-like systems provide grep, a generic tool for matching patterns from the input text. grep is an incredibly valuable shell tool that we will cover in greater detail during the data wrangling lecture.

For now, know that grep has many flags that make it a very versatile tool. Some I frequently use are -C for getting Context around the matching line and -v for inverting the match, i.e. print all lines that do not match the pattern. For example, grep -C 5 will print 5 lines before and after the match. When it comes to quickly searching through many files, you want to use -R since it will Recursively go into directories and look for files for the matching string.

But grep -R can be improved in many ways, such as ignoring .git folders, using multi CPU support, &c. Many grep alternatives have been developed, including ack, ag and rg. All of them are fantastic and pretty much provide the same functionality. For now I am sticking with ripgrep (rg), given how fast and intuitive it is. Some examples:

1
2
3
4
5
6
7
8
# Find all python files where I used the requests library
rg -t py 'import requests'
# Find all files (including hidden files) without a shebang line
rg -u --files-without-match "^#!"
# Find all matches of foo and print the following 5 lines
rg foo -A 5
# Print statistics of matches (# of matched lines and files )
rg --stats PATTERN

Vim

Vim’s design is based on the idea that a lot of programmer time is spent reading, navigating, and making small edits, as opposed to writing long streams of text. For this reason, Vim has multiple operating modes.

  • Normal: for moving around a file and making edits
  • Insert: for inserting text
  • Replace: for replacing text
  • Visual (plain, line, or block): for selecting blocks of text
  • Command-line: for running a command

Keystrokes have different meanings in different operating modes. For example, the letter x in Insert mode will just insert a literal character ‘x’, but in Normal mode, it will delete the character under the cursor, and in Visual mode, it will delete the selection.

In its default configuration, Vim shows the current mode in the bottom left. The initial/default mode is Normal mode. You’ll generally spend most of your time between Normal mode and Insert mode.

You change modes by pressing <ESC> (the escape key) to switch from any mode back to Normal mode. From Normal mode, enter Insert mode with i, Replace mode with R, Visual mode with v, Visual Line mode with V, Visual Block mode with <C-v> (Ctrl-V, sometimes also written ^V), and Command-line mode with :.

Command-line

Command mode can be entered by typing : in Normal mode. Your cursor will jump to the command line at the bottom of the screen upon pressing :. This mode has many functionalities, including opening, saving, and closing files, and quitting Vim.

  • :q quit (close window)

  • :w save (“write”)

  • :wq save and quit

  • :e {name of file} open file for editing

  • :ls show open buffers

  • :help {topic}
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78



    open help

    - `:help :w` opens help for the `:w` command
    - `:help w` opens help for the `w` movement

    ### Movement

    You should spend most of your time in Normal mode, using movement commands to navigate the buffer. Movements in Vim are also called “nouns”, because they refer to chunks of text.

    - Basic movement: `hjkl` (left, down, up, right)
    - Words: `w` (next word), `b` (beginning of word), `e` (end of word)
    - Lines: `0` (beginning of line), `^` (first non-blank character), `$` (end of line)
    - Screen: `H` (top of screen), `M` (middle of screen), `L` (bottom of screen)
    - Scroll: `Ctrl-u` (up), `Ctrl-d` (down)
    - File: `gg` (beginning of file), `G` (end of file)
    - Line numbers: `:{number}<CR>` or `{number}G` (line {number})
    - Misc: `%` (corresponding item)
    - Find: `f{character}`, `t{character}`, `F{character}`, `T{character}`
    - find/to forward/backward {character} on the current line
    - `,` / `;` for navigating matches
    - Search: `/{regex}`, `n` / `N` for navigating matches

    ### Selection

    Visual modes:

    - Visual
    - Visual Line
    - Visual Block

    Can use movement keys to make selection.

    ### Edits

    Everything that you used to do with the mouse, you now do with the keyboard using editing commands that compose with movement commands. Here’s where Vim’s interface starts to look like a programming language. Vim’s editing commands are also called “verbs”, because verbs act on nouns.

    - `i`enter Insert mode
    - but for manipulating/deleting text, want to use something more than backspace
    - `o` / `O` insert line below / above
    - `d{motion}`delete {motion}
    - e.g. `dw` is delete word, `d$` is delete to end of line, `d0` is delete to beginning of line
    - `c{motion}`change {motion}
    - e.g. `cw` is change word
    - like `d{motion}` followed by `i`
    - `x` delete character (equal do `dl`)
    - `s` substitute character (equal to `xi`)
    - Visual mode + manipulation
    - select text, `d` to delete it or `c` to change it
    - `u` to undo, `<C-r>` to redo
    - `y` to copy / “yank” (some other commands like `d` also copy)
    - `p` to paste
    - Lots more to learn: e.g. `~` flips the case of a character

    ### Counts

    You can combine nouns and verbs with a count, which will perform a given action a number of times.

    - `3w` move 3 words forward
    - `5j` move 5 lines down
    - `7dw` delete 7 words



    > To be continued



    # Data Wrangling



    ### Filtering in remote server

    ```shell
    ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' | less

Why the additional quoting? Well, our logs may be quite large, and it’s wasteful to stream it all to our computer and then do the filtering. Instead, we can do the filtering on the remote server, and then massage the data locally. less gives us a “pager” that allows us to scroll up and down through the long output. To save some additional traffic while we debug our command-line, we can even stick the current filtered logs into a file so that we don’t have to access the network while developing:

1
2
$ ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' > ssh.log
$ less ssh.log

sed

sed is a “stream editor” that builds on top of the old ed editor. In it, you basically give short commands for how to modify the file, rather than manipulate its contents directly (although you can do that too). There are tons of commands, but one of the most common ones is s: substitution. For example, we can write:

1
2
3
4
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed 's/.*Disconnected from //'

What we just wrote was a simple regular expression; a powerful construct that lets you match text against patterns. The s command is written on the form: s/REGEX/SUBSTITUTION/, where REGEX is the regular expression you want to search for, and SUBSTITUTION is the text you want to substitute matching text with.