GSoC 2020: SLEIGH Disassembler Backend
September 6, 2020
Introduction
Hello all, I’m Jiaxiang Zhou from China. I was lucky to be selected as a participant of Radare2 project this year. My main work was to integrate SLEIGH as a disassembly backend into Radare2. r2ghidra-dec was my main working repository, aiming to delivering Ghidra’s decompiler to Radare2. It could be renamed as r2ghidra
since it would become not only a decompiler but a complete bridge between Radare2 and Ghidra after this project.
Special thanks should be given to my mentors, Florian Märkl, Giovanni, and Anton. Your patience and guidance are all well appreciated. I couldn’t have completed this project without your support.
Here’s the slides made for r2con2020.
RAsm plugin
SLEIGH disassembler has been deeply embedded into C++ codebase of decompiler. So the solution is clear:
Isolate SLEIGH from C++ codebase of Ghidra’s decompiler
To get full access of Sleigh
and low level spec file and interfaces, I implemented a class(SleighAsm
) just like lite version of Architecture
. This class will export P-codes and registers’ info parsed from spec file. It enable us to disassemble all valid instructions on demand:
RAnal plugin
SLEIGH will give out P-codes as IR to describe what instruction does. So I had to analysis on P-codes to extract control flow, type info. When it came to ESIL, things got more tricky because P-code’s model and ESIL are different. What’s more, P-codes support float number operation, which ESIL doesn’t.
Port SleighInstructionPrototype
Ghidra’s C++ codebase concentrate on decompiler, so it focus on function-level analysis. There’s classes like Funcdata
to analysis intra-function flow. But instruction-level analysis tool only exists in JAVA codebase. So I had to port SleighInstructionPrototype
from JAVA to C++. This enables control flow info extraction on Constructors
(lower than P-code). This port work was tough to minimize the changes on original Ghidra codebase. And I eventually managed to port whole SleighInstructionPrototype
with only two private fields exported!
diff --git a/Ghidra/Features/Decompiler/src/decompile/cpp/context.hh b/Ghidra/Features/Decompiler/src/decompile/cpp/context.hh
index 3372e5ac5..5d2e108f7 100644
--- a/Ghidra/Features/Decompiler/src/decompile/cpp/context.hh
+++ b/Ghidra/Features/Decompiler/src/decompile/cpp/context.hh
@@ -89,6 +89,8 @@ private:
ConstructState *base_state;
int4 alloc; // Number of ConstructState's allocated
int4 delayslot; // delayslot depth
+protected:
+ ConstructState **getBaseState(void) { return &base_state; }
public:
ParserContext(ContextCache *ccache);
~ParserContext(void) { if (context != (uintm *)0) delete [] context; }
diff --git a/Ghidra/Features/Decompiler/src/decompile/cpp/sleigh.hh b/Ghidra/Features/Decompiler/src/decompile/cpp/sleigh.hh
index 22b9c7a21..5e6617db8 100644
--- a/Ghidra/Features/Decompiler/src/decompile/cpp/sleigh.hh
+++ b/Ghidra/Features/Decompiler/src/decompile/cpp/sleigh.hh
@@ -113,6 +113,7 @@ protected:
ParserContext *obtainContext(const Address &addr,int4 state) const;
void resolve(ParserContext &pos) const;
void resolveHandles(ParserContext &pos) const;
+ ContextCache *getContextCache(void) { return cache; }
public:
Sleigh(LoadImage *ld,ContextDatabase *c_db);
virtual ~Sleigh(void);
And this is control flow extracted only from P-code results:
P-code to ESIL translation
Ghidra’s emulate system based on P-code is quite different from ESIL. First, it’s not stack-based, and it can randomly access to any middle variables.
Here’s my design:
-
Add
PICK
to take values from stack to stack topI will leave middle variables intentionally on stack and retrieve them when this middle variable is needed.
-
Add
GET
to retrieve register’s value to stackSometimes register’s name is just one char, ESIL will confuse if it’s compared with a number. Most importantly, I need
GET
to retrieve float number to stack without changing ESIL’s original codes in Radare2. -
Override
=
to store value from stack back to registerThis is to pair with
GET
for float number handling. You will notice that all element except register(used as destination) are immediate value. -
Override
[4]
and[8]
to read float number from memoryWhen a float number is written into a register/memory location, I will record its register name/memory address to track until it’s overwritten by anything except float number.
-
Override
=[4]
and=[8]
to store float number to memoryAs above.
-
Add serials of float operation.
When real translation is running, plugin will employ a stack to emulate middle variables left on stack. This will help calculate offset of middle variables(unique varnodedata) used as argument of current P-code.
Here’s example:
Pattern match on P-code to analysis type of instruction
SLEIGH only provide P-codes. But P-codes doesn’t tell what the instruction is. And the multi-arch support of SLEIGH make thing even more complex.
I made an overview on all R_ANAL_OP_TYPE_*
and summary their patterns. Hope to do the pattern match based on P-codes and know what type the instruction is. Sounds crazy, but I not only typed instructions successfully, I also managed to recover arguments of associated instructions!
Here’s example:
Summary
RAsm and RAnal plugins are both workable. Ready to provide information recovered from Ghidra’s SLEIGH disassembler.
Related commits:
https://github.com/thestr4ng3r/ghidra/commit/50acb42da120072ce1c02a15004bc7e74a682599
https://github.com/thestr4ng3r/ghidra/commit/7bde3b54b43230601363f89b0214ab4bdba8bf6f
https://github.com/radareorg/r2ghidra-dec/commit/59473e5c53d8724cfee489968e0990a31396ef90
https://github.com/radareorg/r2ghidra/commit/e083b183c9954c899fd4f945547683689c9d00a4