Introduction to Unix Command-line - Part 2

Natalie Elphick

March 12th 2024

Press the ? key for tips on navigating these slides

Introductions

Natalie Elphick
Bioinformatician I


Yihang Xin (TA)
Software Engineer III


Setup

Run the following commands if you did not attend part 1:

mkdir unix_workshop
cd unix_workshop
curl -L -o unix_workshop_2024.tar.gz 'https://www.dropbox.com/scl/fi/o8msrl3a1k986jvjll4mv/unix_workshop_2024.tar.gz?rlkey=m7jfkvpz0iq12zdzphq7013l5&dl=0'
tar -xzf unix_workshop_2024.tar.gz
cd unix_workshop_2024
curl -o part_2/homo_sapiens.refseq.tsv.gz https://ftp.ensembl.org/pub/current_tsv/homo_sapiens/Homo_sapiens.GRCh38.111.refseq.tsv.gz

File Compression

Command-line tools for compression

  • Compression reduces the size of a file
  • gzip : compresses a file and replaces it with a compressed version (.gz)
  • tar : create and manipulate archive files


Archive: a single file that contains one or more files and/or folders that have been compressed

gzip/gunzip: compress/uncompress a file

gunzip part_2/homo_sapiens.refseq.tsv.gz
du -h part_2/homo_sapiens.refseq.tsv
 33M    part_2/homo_sapiens.refseq.tsv
  • The uncompressed file is 33 megabytes
gzip part_2/homo_sapiens.refseq.tsv
du -h part_2/homo_sapiens.refseq.tsv.gz
3.2M    part_2/homo_sapiens.refseq.tsv.gz
  • Compressing it makes it a 10th of the size

Note

  • The magnitude of the compression depends on type of data
  • The units for file sizes are not the same across all systems
    • Some systems define a kilobyte as 1000 bytes, while others define it as 1024 bytes

tar: compressing folders into archives

  • Does not provide compression on its own, it uses gzip to create compressed archive files
tar -czf part_1.tar.gz part_1
ls -l
total 8
drwx---rw-@ 4 nelphick  staff  128 Mar 12 09:36 part_1
-rw-r--r--  1 nelphick  staff  803 Mar 12 12:52 part_1.tar.gz
drwxr-xr-x@ 4 nelphick  staff  128 Mar 12 12:52 part_2
  • -c: create a new archive
  • -f: specify the name of the archive file
  • -z: compress the archive with gzip

Unarchiving

  • We did this in part 1 to unarchive the workshop folders
tar -xzf part_1.tar.gz
  • -x: extract an archive
  • -z: uncompress the archive with gzip
  • -f: specify the name of the archive file

gunzip -c: cat compressed files

  • To avoid uncompressing a large file just to read its contents, we can use gunzip -c
  • This will output the the file to the terminal
gunzip -c part_2/homo_sapiens.refseq.tsv.gz | head
gene_stable_id  transcript_stable_id    protein_stable_id   xref    db_name info_type   source_identity xref_identity   linkage_type
ENSG00000228037 ENST00000424215 -   NR_121638   RefSeq_ncRNA    DIRECT  -   -   -
ENSG00000142611 ENST00000378391 ENSP00000367643 NP_955533   RefSeq_peptide  DIRECT  100 100 -
ENSG00000142611 ENST00000378391 ENSP00000367643 NM_199454   RefSeq_mRNA DIRECT  99  62  -
ENSG00000142611 ENST00000270722 ENSP00000270722 NP_071397   RefSeq_peptide  DIRECT  100 100 -
ENSG00000142611 ENST00000270722 ENSP00000270722 NM_022114   RefSeq_mRNA DIRECT  100 100 -
ENSG00000157911 ENST00000288774 ENSP00000288774 NP_001361354    RefSeq_peptide  INFERRED_PAIR   -   -   -
ENSG00000157911 ENST00000288774 ENSP00000288774 NP_001361355    RefSeq_peptide  INFERRED_PAIR   -   -   -
ENSG00000157911 ENST00000288774 ENSP00000288774 NP_722540   RefSeq_peptide  DIRECT  100 100 -
ENSG00000157911 ENST00000288774 ENSP00000288774 NM_001374425    RefSeq_mRNA DIRECT  99  100 -

System Variables

What are system variables?

  • Special variables that contain information about the system’s configuration and state
  • Used by the OS and programs to change their behavior based on the system’s state

Example:

echo $HOME
/Users/nelphick

Common System Variables

  • $PWD : The working directory
  • $HOME : The current user’s home directory
  • $PS1 : the shell prompt string
  • $TMPDIR : location of temporary files

These can change depending on the specific OS or program, TMPDIR can also be TEMP, TEMPDIR and TMP.

PATH: locations of executable files

  • When you enter a command, the OS searches the directories in the $PATH to find its associated executable file
echo $PATH
/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/go/bin:/usr/local/mysql/bin
  • The OS will check these directories in the order they appear and use the first executable it finds

export: set system variables

  • Useful for setting variables you want to be used across programs
  • You can add new software to your $PATH like this:
export PATH="/path/to/new/software:$PATH"
  • This will modify the $PATH for the current terminal session

Modifying the PATH for all future terminal sessions

  • Add the export line to your ~/.bashrc or ~/.zshrc
  • Proceed with caution
    • Make backups of these and read this guide
    • Changing the $PATH incorrectly can break system functionality

which: locate the executable associated with a command

  • This command shows the location of the executable that the OS finds
which ls
/bin/ls
  • Useful to check if there are multiple versions of a software installed

Shell Scripting

What is a script?

  • Scripts are executable files for reusing code
  • By convention scripts end in .sh
  • This first line of the script is called the shebang
nano part_2/example_script.sh
#!/bin/bash
  • The text that follows #! tells the OS where the interpreter is
which bash
/bin/bash

chmod: making a script executable

  • By default, files are not executable
ls -l part_2/example_script.sh
-rw-r--r--  1 nelphick  staff  287 Mar 12 12:52 part_2/example_script.sh
  • We can set the execute bit like this
chmod u+x part_2/example_script.sh
ls -l part_2/example_script.sh
-rwxr--r--  1 nelphick  staff  287 Mar 12 12:52 part_2/example_script.sh

Example

#!/bin/bash

# This is a comment. Comments are ignored by the shell.

# $1 is the first argument passed to the script
echo "Counting the genes in $1"

# count the unique genes in the file
u_genes=$(gunzip -c $1 | cut -f 1 | sort -u | wc -l)

echo "There are $u_genes unique genes in $1"

Let’s run it

./part_2/example_script.sh part_2/homo_sapiens.refseq.tsv.gz
Counting the genes in part_2/homo_sapiens.refseq.tsv.gz
There are    33338 unique genes in part_2/homo_sapiens.refseq.tsv.gz

Loops

  • Useful for iterating over lines of a file or lists
for i in {1..3}
do

echo $i

done
1
2
3

While loops

count=0

while [ $count -lt 5 ]        # loop while count is less than 5
do
    echo $count
    count=$((count+1))
done
0
1
2
3
4

If statements

x=5

if [ $x -gt 10 ]                      # check if x is greater than 10
then
    echo "x is greater than 10"
else
    echo "x is not greater than 10"
fi                                    # end if statement
x is not greater than 10

Other Useful Commands

sed : stream editor

  • Parses and transforms text, using a compact programming language
  • It reads and modifies text line by line from a file or input stream
  • Supports regular expressions
  • Useful for replacing text

Example:

sed 's/search_string/replace_string/g' input.txt > output.txt

ssh : secure shell - conect to remote server

  • Logging in to a remote server
  • Remote desktop for the terminal
ssh username@remote
  • The username would be your user on the remote server and remote is the hostname or IP address of the remote server or computer

scp : secure copy

  • Copy files from a remote server or computer
scp [options] [source] [destination]
  • Copy from local to remote
scp /path/to/local/file.txt username@remote:/path/to/remote/directory/
  • Copy from remote to local
scp username@remote:/path/to/file.txt /path/to/local/directory/
  • -r : copy a whole folder

AWK

awk : processing structured data

  • A small programming language that is designed to work with structured data
  • Has more complicated syntax but is faster at processing large files
  • Designed to read a file or input stream line by line
  • Operates on records (lines) and fields (columns)

Basic command:

awk options 'pattern {action}' input_file

Example : Sum the first 2 columns of a file

awk -F '\t' '{print $1+$2}' part_1/list_numbers.tsv
4
15
17
  • -F : provides the field separator
  • $1,$2 : the first and second fields

Example : Find the average of a column

  • For this example we only want the average if the 5th column equals “RefSeq_mRNA”
gunzip -c part_2/homo_sapiens.refseq.tsv.gz | \
awk -F '\t' '$5 == "RefSeq_mRNA" {sum += $7; count++} \
END {print sum / count}'
64.1533

Resources for learning AWK and sed

End of Part 2

Additional learning materials

Survey

Upcoming Data Science Training Program Workshops

Introduction to Pathway Analysis
April 2, 2024 1:00-4:00pm PDT

Statistics of Enrichment Analysis Methods
April 11-April 12, 2024 1:00-3:00pm PDT

Working on Wynton
April 15, 2024 1:00-4:00pm PDT

Introduction to Linear Mixed Effects Models
April 25-April 26, 2024 1:00-3:00pm PDT

Complete Schedule