This assignment is due Tuesday, March 26, 2023 at 11:59PM Central time.
HW1 had you use a sandbox to exploit a buffer overflow vulnerability and to construct a payload using MetaSploit. In this homework, we will use Ghidra, a tool created by the NSA for analyzing binaries, and LLVM, a compiler framework, to analyze and transform programs. Both are quite common tools used in the security community to conduct a variety of analyses.
While you should read this spec in its entirety, you can find the specific items to turn in by searching for "Turn-in".
Read this assignment document and complete the tasks described. The final deliverable is a zip file containing your written PDF report, your modified crackme (if you patched it), and a copy of your InstrumentFunctions.cpp LLVM pass. There is no length requirement or limit for the report, but I expect this will take 5 pages or fewer (depending on the size of your screenshots where appropriate).
Ensure that your name, VUNetID, and email address appear at the top of each of page.
Ensure that you have two sections to your submission labeled: Task 1 and Task 2.
I strongly recommend the use of LaTeX to complete this (and all) assignments. LaTeX is the lingua franca for scientific communication in computer science — every peer-reviewed publication I have submitted and reviewed has been written in LaTeX. You can use overleaf.com to help you write LaTeX documents. I have also created a template that you can copy, available here.
You can find the starter package at kjl.name/cs8395/hw2-baked.zip. You will find a file called crackme contained in the part-1 directory, and starter code in the part-2 directory.
A Crackme is a toy program that contains a small puzzle to solve. They are training tools for reverse engineering, a critical part of security. "Cracking" the crackme often consists of determining a "flag" or special value that causes the crackme to behave in some desirable way. In this assignment, you will download a crackme and use Ghidra to help determine or circumvent two value checks.
You can download the Crackme for this assignment in the starter package. This is a stripped 32-bit Linux ELF binary. You can run the program in the terminal with ./crackme password1 password2.
Your goal is to either (1) find the correct values of password1 and password2 or (2) to patch the program so that it produces the desired output:
These instructions assume you are continuing to use your Kali VM from HW1. If you are not using a Kali VM, Ghidra works across platforms, but you may have a slightly different version or process for getting it running on your own system.
Ghidra is a powerful binary analysis tool produced by the NSA. It is open source and free to download. If you are using your VM from HW1, you can install it from the command line:
sudo apt install ghidra
When you have installed Ghidra, you can start it by running ghidra on the command line. When it opens, it will prompt you about projects. You can create a new empty project (you can call it hw2 or whatever you like), then click File and click Import File..., then navigate to the HW2 Crackme file provided above. When you open the file, it will prompt you with information it determines about the crackme:
Press OK, then double click the file to open it. It will prompt you to analyze the file; do so with the default options.
Once the analysis is complete, you are shown a window with a large amount of assembly code in it. This is the main result of lifting the crackme binary to assembly. The analysis Ghidra conducts makes several pieces of information available for helping to understand how the program works. First, on the left side of the window, you can use the Symbol Tree to navigate to the start of the main function for this program.
Within this view, you can navigate the assembly and decompiled code on the right to see how the program executes. Note that I don't want to give away details of the inner workings of the program (indeed, that is your goal in this HW). Instead, I will provide several hints to help you crack this crackme.
Tools like Ghidra often provide a "Cross reference" feature that lets you see places in the code that refer to the same location. For example, with local_24 described above, you can find it in the disassembly, right click it, go to References, and click "Show references to local_24":
You can also rename such symbols as you figure out what they mean (e.g., "local_24" might better be described as "score"; you don't have to do this, but it may help you keep track of things).
You can extract a control flow graph from each function. From the Graphs menu, click Code Flow and it will generate a graphical representation of the function you are currently analyzing. For main, it looks something like this:
(you'll have to zoom in on your Ghidra to see the details). This graph shows you the instructions that execute, and shows edges where the flow of execution may diverge. For example, in main, there appears to be a branch (specifically, a jz instruction) whose outcome determines whether the function y or z gets called. Note that such graphs are generated only for a single function — function calls do not produce edges but are instead treated as a single instruction.
You can patch instructions to change what happens in the program. For example, if you find in a function that contains a branch, you can overwrite the branch instruction with a NOP (no operation) instruction to "skip over" the evaluation done before the branch. That is, you can force a particular path to execute if you carefully overwrite instructions. If you want to do so, right click an instruction, then click "Patch Instruction." When you patch an instruction, you must replace both the opcode and, where relevant, the operands. In addition, the x86 instruction set is variable length, which means that each instruction takes a different number of bytes to represents. For example, suppose we wanted to patch this JZ instruction with a NOP:
After patching with a NOP, we get:
Because the NOP is a single 0x90 byte, but the JZ was two bytes long, we have a leftover byte that also must be patched. Moreover, if you need to insert a longer instruction, you must be careful of overwriting too many bytes.
For this task, you must document your experience using Ghidra to crack the input program. Turn in the following:
For this task, you will use LLVM to analyze and instrument a given program to manually implement stack canaries. You will do this by creating an LLVM Pass to transform an input program so that each function contains additional code that implements the stack canary.
When you write software, you generally rely on a tool to build, compile, or interpret it. gcc or clang for C/C++, javac for Java, python for Python. These language processing tools automatically convert your source code to an executable form that the CPU is happy with. Suppose that, over time, we figure out what different vulnerabilities look like in source code. Rather than manually updating all of our projects, can we automatically transform our programs to contain fewer vulnerabilities?
We can build programs that analyze and transform programs for us. This is often referred to as static analysis — the systematic examination of programs to discover properties that are important. For example, we might use a static analysis to determine if it's ever possible that a particular buffer is transmitted over the network (see Information Flow; you could check if variable x contains data that ever reaches a function call, send()). Static analysis is used for a variety of powerful security analyses. In this assignment, you will build a program transformation tool that hardens input programs against stack overflow attacks.
LLVM is a set of tools that lets us do these automated program analyses (it used to stand for "Low-level virtual machine," but this is a terribly outdated name, so we just call it LLVM). It is an extremely powerful, widely-used compiler framework (indeed, if you've used clang before, LLVM is its parent project; the Clang front-end uses LLVM IR under the hood). LLVM allows you to work with, analyze, and transform programs during the compilation process. While it has historically been used to implement compiler optimizations (e.g., to make programs faster or smaller), people use LLVM for a variety of creative analyses.
At its core, LLVM is part of a compiler toolchain (clang) that produces a robust "Intermediate Representation" (IR). In Compiler Theory, an IR is a representation of a program that permits some analysis that is helpful for compilation. For example, a parser produces an Abstract Syntax Tree of the program to verify that a given source code file contains syntactically valid source code.
We frequently work with an IR called a "Control Flow Graph" (CFG). You saw CFGs in part 1 of HW2. This is a graph structure that represents how the CPU flows through instructions in the program. For example, consider the function below:
The foo function would have a CFG that looks like:
ENTRY | printf("Hello..."); if (x < 5) / \ true false x = x + 1 x = x - 1 \ / printf("Goodbye...");
Each function in a program can be represented as a CFG. Indeed, LLVM gives us access to Basic Blocks (nodes) in the CFG. Both Abstract Syntax Trees and Control Flow Graphs are examples of Intermediate Representations used by compiler toolchains like LLVM.
For this assignment, you will use LLVM to make a pass over the instructions contained in a program's CFG to create and enforce stack canaries.
Recall from lecture we discussed how stack overflow attacks work. An attacker controls a buffer in which they place malicious code to run. The attacker overwrites a saved return address that is placed on the stack when a function is first called. One way of detecting stack overflow attacks is through the use of "stack canaries." Loosely, we place a known random value on the stack at the beginning of a function. If an attacker smashes the stack, they will overwrite that stack canary. Thus, we can check whether the canary has been changed right before the function returns. Unless the attacker knows the exact value we chose as the canary, we can tell if that value has been overwritten.
In this task, your job is to write a tool that can automatically transform a given C program so that it contains code to generate and check stack canaries. You will use LLVM to produce this tool.
Suppose we have a program that contains a stack overflow vulnerability. We could potentially protect the program by detecting if the stack has been corrupted. In particular, if we place a magic value on the stack, then check if that value changed later, we can see if an attacker overwrote parts of the stack they were not supposed to. Consider a "before" and "after" version of such a program:
In LLVM, you can create "passes" that run over a given input program. A pass takes in a program as input, and produces a program as output. You will create a pass that, when applied to a given program, transforms all of the functions in that program to contain code that creates and checks stack canaries. The resulting transformed program can be compiled and executed like any other.
First, you must install LLVM. For this assignment, please use LLVM 13. On Kali, you can use
sudo apt-get install llvm-13
After installing it, there's a whole lot of code placed under /usr/lib/llvm-13/include/llvm.
If you ever need clarification on how to invoke a particular LLVM method, you can often grep through that directory. For example,
grep -r 'ReturnInst' /usr/lib/llvm-13/include/llvm
First, ensure you have llvm 13 installed
sudo apt-get install llvm-13
You will also need to set the LLVM_DIR environment variable to point to LLVM's location.
export LLVM_DIR=/usr/lib/llvm-13
(you can add this to your ~/.bashrc file to ensure that gets set every time you start your VM).
In this package, there is starter code under the "pass" directory. To build the LLVM pass, we use cmake. If you are working in the part-2 directory:
cd pass mkdir build cd build cmake -DLLVM_INSTALL_DIR=$LLVM_DIR ../ make
If the build is successful, it produces a file called pass/build/lib/libInstrumentFunctions.so. The .so file is your LLVM pass module. We use it to automatically transform an input file.
To use the pass against an input file, we first create LLVM IR of the subject program (i.e., the program to be transformed) by using clang:
clang -m32 -emit-llvm -S -fno-stack-protector -static subject.c -o subject.ll
This produces a file called subject.ll that contains the raw LLVM IR. This is similar to assembly code, but it is in a particular format that can be readily analyzed by LLVM passes and related tools.
Once you have the LLVM IR, you can run the pass against it:
$LLVM_DIR/bin/opt -load-pass-plugin ./pass/build/lib/libInstrumentFunctions.so --passes="cs8395-hw" subject.ll -S -o instrumented.llAlternatively, there is also a helper script to run the pass once you have it built:
./plug.sh subject.ll instrumented.ll
This produces a file that is transformed according to the pass tool you generate. Once you have the instrumented LLVM IR, you can compile it:
$LLVM_DIR/bin/llc -m32 instrumented.ll clang -m32 -fno-stack-protector -static instrumented.s -o instrumented.o
You use llc to lower the LLVM IR to raw x86 assembly. Then, we compile it with clang to produce a binary (you can likely use gcc as well). Finally, instrumented is a normal program you can run directly:
./instrumented.o
There is also an invoke.sh script you can use (it's slightly better than the one for HW1).
./invoke.sh ./instrumented.o input_file
The starter code contains a functioning LLVM pass that instruments the program to print out every function the program executes. For example, consider the program below:
By default, this program just prints out 13. However, after running the starter LLVM transformation, the program outputs:
In main In foo In bar In foo 13
That is, we have instrumented the program so that, when it runs, it prints out each function that is executed. We have used a static analysis to change the behavior of the program once it is compiled. Your task is to implement an LLVM transformation pass that uses and enforces a stack canary in each function.
You must modify only the file ./pass/lib/InstrumentFunctions.cpp to implement the stack canary protection.
The starter code instruments each function in the subject program with a call to printf that makes the program print the name of each function that executes. In addition, the starter code also creates a (currently unused) stack canary variable in each function. You must make your LLVM pass emit instructions that (1) load the existing canary from memory, and (2) confirm that the canary has not changed, all before the function returns.
At a minimum, your code must transform the subject.c program such that:
Notably, you don't have to do anything fancier — ordinarily, you might expect the program to abort (or similar). However, to make the implementation more straightforward, you only have to detect and print when the canary was overwritten with a different value.
Original, uninstrumented program with malicious payload that shells out to echo Hello, world:
Instrumented program, running with a clean, non-malicious input:
Instrumented program, running with a malicious input:
The way LLVM works is by considering "instructions" in the IR. Instructions in LLVM can be things like add, subtract, compare, jump, branch, call, return, etc. You can see how the starter code emits an instruction to allocate space for the new canary, then stores a randomly-generated value in that new space.
I will note that LLVM has a bit of a learning curve. There are tons of instructions, methods, and types available throughout its codebase. My suggestion is to focus on the following keywords (documentation available via google):
You can likely start at return instructions and work backwards to add code that checks if the canary has been changed. The canary is already created for you by the starter code in each function — you just need to complete the implementation by adding the conditional check before the function returns.
At the top of each function, the starter code uses an IRBuilder to create (1) an AllocaInst and (2) a StoreIntr. In LLVM IR, programs allocate memory (like on the stack) by using an "Allocate" instruction. In particular, we build a 32-bit Integer and allocate it on the stack. Then, we store a randomly-initialized value in that newly-allocated space. You can see in the code how these LLVM instructions get placed into the existing control flow graph of the function (there are helper methods from the IRBuilder class that make it more straightforward).
In addition, the starter code also places a printf call in the function which, at runtime, tells you the name of the function that is currently executed. Since the requirement is that you print out the name of the function whose stack is smashed, you can adapt this code. The difference is that we only want the print to execute if the canary was changed. Given what you can see about how LLVM instructions are placed, how would you change the printf-related instructions so that they are guarded by an if-then check before the function returns?
You must implement manual stack canary protection with the LLVM pass as described above. In your submission, please include:
you must submit a single .zip file called vunetid.zip. While you can work with others in the class conceptually, please submit your own copy of the assignment. Your zip file must contain:
Use the submission system (VU login required) to submit your zip file.