Curious case of User Mode Scheduling

In Windows7 x64 Microsoft introduced interesting mechanism called User Mode Scheduling. This mechanism allows us to resign from Kernel Mode Thread Scheduler and write our own scheduler in User Mode. In this video Windows Kernel Architect – Dave Probert – explains motivation behind creating such possibility. In a nutshell:

  • context switching in User Mode can be faster because it does not require going back and forth between user mode and kernel mode.
  • User Mode Scheduling introduces cooperative multitasking which can lead to performance improvements. Now threads can made an assumption that they won’t be preempted by other threads (from the same UMS group) unless they block in kernel (e.g. waiting on IO operation to complete). This assumption could be used e.g. to create more efficient locks.
  • having customized scheduler gives more control over threads execution. For example we can write round robin scheduler which will select threads one by one in fair manner (this could be useful in applications that have real time constraints). Or scheduler that will react on application specific events – events that Kernel Mode Scheduler knows nothing about.

It all sounds good enough to check how User Mode Scheduling works in reality. And that’s the thing I have done.

The Rules of the Game
Detailed description User Mode Scheduling API can be found in MSDN. To shortly sum up how to use this feature:

  • Application creates Worker Threads that will perform actual work. Worker threads can be preempted only if they blocks in kernel mode – for example when page fault occurred or if threads are waiting for IO operation to complete. They can also resign from it’s processor time by calling UmsThreadYield.
  • Application creates User Mode Scheduling Threads for cores on which Worker Threads will run. User Mode Schedulers are notified every time Worker Threads blocks in kernel or yields it’s execution. When flow is passed to scheduler, it is responsible to select next thread that will run.
  • Code executed by scheduler is pretty limited. During notifications it cannot touch any locks acquired by worker threads because it will result in deadlock (thread A waits to be scheduled and scheduler waits for lock acquired by thread A – see deadlock condition?). And this means scheduler cannot call any function which acquires locks (like global heap allocation) unless you know for sure that worker threads won’t acquire same locks.

The notification about blocking in kernel seems to be double edged sword.
The good side of notification is that it gives more control and it allows to employ more sophisticated scheduling strategies. For example we can create pool of threads assigned to one logical processor. When current executing thread blocks, scheduler can pass execution to next thread assigned to same processor to improve throughput. In such scenario processor cache pollution is minimized because one task is entirely done at one core which could lead to better performance.
The bad side of notification is that transition has to be made from kernel to user mode, which of course takes time. With original system scheduler this step is not needed because scheduling takes place in kernel mode. So actually notifications could decrease performance.
If overall performance decreases or increases depends on nature of the tasks. If tasks blocks a lot in kernel we can expect that performance will drop.

Basic UMS
To test performance of User Mode Scheduling I have written benchmark which will compare traditional system scheduler with round robin User Mode Scheduler. The embarrassingly parallel job will be performed: computation of methematical function over range of values. Each range chunk will be assigned to new thread which will perform computation. You can read code here:

#include "stdafx.h"
#include <Windows.h>
#include <new>
#include <memory>
#include <vector>
#include <functional>
#include <algorithm>
#include <chrono>
 
 
typedef std::function<void()> TTask;
 
unsigned g_ProcCount;
HANDLE* g_ThreadArray;
PUMS_COMPLETION_LIST g_CompletionList;
HANDLE g_CompletionListEvent;
HANDLE g_ShutdownEvent;
_declspec(thread) PUMS_CONTEXT g_UmsContext;
std::vector<TTask>* g_Tasks;
 
 
DWORD WINAPI WorkerThreadProc(LPVOID lpThreadParameter)
{
    size_t taskNum = reinterpret_cast<size_t>(lpThreadParameter);
    auto& task = (*g_Tasks)[taskNum];
    task();
 
    return 0;
}
 
HANDLE UtilsCreateUmsThread(LPVOID lpThreadParameter)
{
    SIZE_T size = 0;
    InitializeProcThreadAttributeList(nullptr, 1, 0, &size);
    std::unique_ptr<char[]> mem(new(std::nothrow) char[size]);
    if(!mem)
    {
        return nullptr;
    }
 
    LPPROC_THREAD_ATTRIBUTE_LIST attrLst = reinterpret_cast<LPPROC_THREAD_ATTRIBUTE_LIST>(mem.get());
 
 
    if(!InitializeProcThreadAttributeList(attrLst, 1, 0, &size))
    {
        return nullptr;
    }
 
    PUMS_CONTEXT umsCtx;
    if(!CreateUmsThreadContext(&umsCtx))
    {
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
    UMS_CREATE_THREAD_ATTRIBUTES umsAttrs;
    umsAttrs.UmsCompletionList = g_CompletionList;
    umsAttrs.UmsContext = umsCtx;
    umsAttrs.UmsVersion = UMS_VERSION;
 
    if(!UpdateProcThreadAttribute(attrLst, 0, PROC_THREAD_ATTRIBUTE_UMS_THREAD, &umsAttrs, sizeof(umsAttrs), nullptr, nullptr))
    {
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
 
    HANDLE hUmsThread = CreateRemoteThreadEx(GetCurrentProcess(), nullptr, 0, WorkerThreadProc, lpThreadParameter, 0, attrLst, nullptr);
    if(!hUmsThread)
    {
        DeleteUmsThreadContext(umsCtx);
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
    DeleteProcThreadAttributeList(attrLst);
    return hUmsThread;
}
 
 
VOID NTAPI UmsSchedulerEntryProc(RTL_UMS_SCHEDULER_REASON reason, ULONG_PTR activationPayload, PVOID schedulerParam)
{
    while(true)
    {
        if(!g_UmsContext)
        {
            HANDLE eventArr[] = {g_ShutdownEvent, g_CompletionListEvent};
            DWORD res = WaitForMultipleObjects(2, eventArr, FALSE, INFINITE);
            if(res == WAIT_FAILED) abort();
            if(res == WAIT_OBJECT_0) return;
            if(!DequeueUmsCompletionListItems(g_CompletionList, 0, &g_UmsContext)) abort();
        }
 
        PUMS_CONTEXT ctx = g_UmsContext;
        g_UmsContext = GetNextUmsListItem(ctx);
 
        BOOLEAN isTerminated = TRUE;
        QueryUmsThreadInformation(ctx, UmsThreadIsTerminated, &isTerminated, sizeof(isTerminated), nullptr);
        if(isTerminated)
        {
            DeleteUmsThreadContext(ctx);
        }
        else
        {
            while(!ExecuteUmsThread(ctx));
        }   
    }
}
 
DWORD WINAPI SchedulerThreadProc(LPVOID lpThreadParameter)
{   
    g_UmsContext = nullptr;
    UMS_SCHEDULER_STARTUP_INFO umsStartupInfo;
    umsStartupInfo.UmsVersion = UMS_VERSION;
    umsStartupInfo.SchedulerParam = nullptr;
    umsStartupInfo.SchedulerProc = UmsSchedulerEntryProc;
    umsStartupInfo.CompletionList = g_CompletionList;
    if(!EnterUmsSchedulingMode(&umsStartupInfo))
    {
        return false;
    }
 
 
    return true;
}
 
unsigned UtilsGetNumberOfProcessors()
{
    SYSTEM_INFO sysInfo;
    GetSystemInfo(&sysInfo);
    return sysInfo.dwNumberOfProcessors;
}
 
 
bool InitializeUmsScheduling()
{
    g_ProcCount = UtilsGetNumberOfProcessors();
    g_ThreadArray = new(std::nothrow) HANDLE[g_ProcCount];
    if(!g_ThreadArray) return false;
    memset(g_ThreadArray, 0, g_ProcCount*sizeof(HANDLE));
 
 
    if(!CreateUmsCompletionList(&g_CompletionList))
    {
        return false;
    }
 
    if(!GetUmsCompletionListEvent(g_CompletionList, &g_CompletionListEvent))
    {
        DeleteUmsCompletionList(g_CompletionList);
        return false;
    }
 
    g_ShutdownEvent = CreateEvent(nullptr, TRUE, FALSE, nullptr);
    if(!g_ShutdownEvent)
    {
        CloseHandle(g_CompletionListEvent);
        DeleteUmsCompletionList(g_CompletionList);
    }
 
    bool result = true;
    for(unsigned i = 0; i < g_ProcCount; ++i)
    {
        g_ThreadArray[i] = CreateThread(nullptr, 0, SchedulerThreadProc, nullptr, CREATE_SUSPENDED, nullptr); 
        if(!g_ThreadArray[i])
        {
            result = false;
            break;
        }
 
        if(!SetThreadAffinityMask(g_ThreadArray[i], 1 << i))
        {
            result = false;
            break;
        }
 
        if(ResumeThread(g_ThreadArray[i]) == static_cast<DWORD>(-1))
        {
            result = false;
            break;
        }
    }
 
    if(!result)
    {
        for(unsigned i = 0; i < g_ProcCount; ++i)
        {
            if(g_ThreadArray[i])
            {
                TerminateThread(g_ThreadArray[i], false);
                CloseHandle(g_ThreadArray[i]);
            }
        }
        CloseHandle(g_CompletionListEvent);
        DeleteUmsCompletionList(g_CompletionList);
        return false;
    }
 
    return result;
}
 
void ReleaseUmsScheduling()
{
    printf("Shutting down\n");
    SetEvent(g_ShutdownEvent);
 
    for(unsigned i = 0; i < g_ProcCount; ++i)
    {
        printf("Waiting for (%d)\n", i);
        if(WaitForSingleObject(g_ThreadArray[i], INFINITE) == WAIT_FAILED)
        {
            TerminateThread(g_ThreadArray[i], false);
        }
        CloseHandle(g_ThreadArray[i]);
    }
     
    CloseHandle(g_ShutdownEvent);
    CloseHandle(g_CompletionListEvent);
    DeleteUmsCompletionList(g_CompletionList);
}
 
bool DoParallel(std::vector<TTask>& tasks, bool umsScheduling)
{
    g_Tasks = &tasks;
 
 
    std::unique_ptr<HANDLE[]> threadArr(new(std::nothrow) HANDLE[g_Tasks->size()]);
    if(!threadArr)
    {
        return false;
    }
 
    bool result = true;
    size_t i;
    for(i = 0; i < g_Tasks->size(); ++i)
    {
        //printf("Creating thread (%d)\n", i);
        if(!umsScheduling)
            threadArr[i] = CreateThread(nullptr, 0, WorkerThreadProc, reinterpret_cast<LPVOID>(i), 0, nullptr);
        else
            threadArr[i] = UtilsCreateUmsThread(reinterpret_cast<LPVOID>(i));
 
        if(!threadArr[i])
        {
            result = false;
            break;
        }
    }
 
    if(!result)
    {
        for(size_t j = 0; j < i; ++j)
        {
            TerminateThread(threadArr[j], false);
            CloseHandle(threadArr[j]);
        }
        return false;
    }
         
         
    for(i = 0; i < g_Tasks->size(); ++i)
    {
        //printf("Releasing thread (%d)\n", i);
        if(WaitForSingleObject(threadArr[i], INFINITE) == WAIT_FAILED)
        {
            TerminateThread(threadArr[i], false);
            result = false;
        }
        CloseHandle(threadArr[i]);
    }
    return result;
 
}
 
const unsigned RANGE_SIZE = 1000000000;
const unsigned THRREADS_COUNT = 6000;
const bool USER_MODE_SCHEDULING = true;
 
int _tmain(int argc, _TCHAR* argv[])
{
    if(USER_MODE_SCHEDULING && !InitializeUmsScheduling())
    {
        return -1;
    }
 
    unsigned chunkSize = RANGE_SIZE / THRREADS_COUNT;
 
    std::vector<TTask> tasks;
    for(unsigned i = 0; i < THRREADS_COUNT; ++i)
    {
        tasks.push_back([=]
        {
            unsigned begin = i*chunkSize;
            unsigned end = min(begin+chunkSize, RANGE_SIZE);
            volatile unsigned long long result = 0;
            for(unsigned i = begin; i < end; ++i)
                result += sin(cos(sin(i))); 
        }
        );
    }
 
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::milliseconds milliseconds;
 
    clock::time_point t0 = clock::now();
    if(!DoParallel(tasks, USER_MODE_SCHEDULING))
    {
        printf("DoParallel failed\n");
        if(USER_MODE_SCHEDULING) ReleaseUmsScheduling();
        return -2;
    }
    clock::time_point t1 = clock::now();
    milliseconds totalMs = std::chrono::duration_cast<milliseconds>(t1 - t0);
    printf("Elapsed (%lld)", totalMs.count());
 
    if(USER_MODE_SCHEDULING) ReleaseUmsScheduling();
    return 0;
}

Results of benchmark are surprising – no matter how many worker threads – System Scheduler has always slightly better performance. Process Monitor reveals reason – when using our Basic User Mode Scheduler, not all cores are utilized uniformly. This is because each Scheduler Thread dequeues some number of ready threads from global queue. Number of dequeued threads are not always the same which means work does not split uniformly across cores. And this causes lower throughput. To fix it, we will have to complicate program and write some global queue with ready threads or implement work stealing strategies. Ain’t no free lunch at this time.

Conclusions
It seems that guys from Microsoft also realized that User Mode Scheduling is not the best way to achieve greater performance in simple cases. UMS was used as default scheduling policy in ConcRT (MS library that allows code parallelization). Well, this is no longer the case – default scheduling policy was changed to Win32 System Scheduler. Developer of ConcRT – Don McCrady – explains:

UMS support in Concrt is being retired because although it’s super fast at context-switching, well-structured parallel programs shouldn’t do a lot of context switching. In other words, in most real PPL scenarios, the UMS scheduler wasn’t any faster than the regular Win32 thread scheduler.

Although User Mode Scheduling does not improve performance in common cases, it still can be useful when there is a need for tighter control over threads execution (e.g. SQL Server uses it to achieve better scalability). Also it is good replacement for Fibers – old Windows mechanism that provides cooperative multitasking. Fibers has several limitations like lack of support for multiple cores and constraints on function that could be called during execution (when function blocks all fibers in thread blocks). None of these limitation apply to User Mode Scheduling.

Adventures in Parallel Universe

Writing multithreaded programs is challenging task both because of programming and debugging. Let’s see what traps await for lone rider entering Parallel Universe.

More is less
Psychological research shows that people becomes unhappy when they are overloaded by multiple possibilities. Well if this is true, then analyzing concurrent program could cause serious nervous breakdown of unprepared minds. Number of execution possibilities grows exponentially because of thread scheduler non-determinism. Let’s assume that we operate in sequential consistent model (e.g. execution on single core) and we have only two threads that are executing instructions:

T_1: a_{1};a_{1};...;a_{n}; \\  T_2: b_{1};b_{2};...;b_{m};

Number of possible interleaved executions is equal to number of permutations with repetition

\frac{(m+n)!}{m!n!}

Assuming that m = n

\frac{(n+n)!}{n!\,n!}\sim\frac{4^n}{\sqrt{\pi n}}

which means exponential growth of possibilties with number of instructions.

Even if there are only 2 threads with 3 instructions per thread, there is 20 possible cases to analyze (enumerate it by yourself if you have too much time). This is enough to make your life miserable, but real world programs are much more complicated than this. And it implies that your mind is physically incapable to analyze so many possibilities.

To ensure correctness we have to use locks, which also reduces number of possible execution scenarios (because now several instructions are atomic and could be viewed as single instruction). It also introduces new opportunities to break your program (like deadlocks or priority inversions) and experience performance drops.

Your friend becomes your enemy
In single threaded programming compiler guarantee that even after optimizations your program will have same effect as before optimizations. This is no longer the case in multithreaded programming. Compiler is not able to analyze all instructions interleaves when multiple threads are present, but it will still optimize program as single threaded application. It means that now you are fighting against compiler to suppress certain optimizations.

Simplest optimization that can breaks things is dead code elimination. Consider following code:

#include <cstdio>
#include <pthread.h>
#include <unistd.h>

bool flag = false;

void* thread_func(void*)
{
	while(!flag); //spin
	pthread_exit(0);
}

int main()
{
	pthread_t t;
	pthread_create(&t, 0, thread_func, 0);
	sleep(1);
	flag = true;

	pthread_join(t, 0);
	return 0;
}

New thread waits for flag to become true and exits. After that, main thread terminates. But with optimization -O3 program goes into infinite execution. What happened? Disassembly for thread_func is following:

_Z11thread_funcPv()
_Z11thread_funcPv+20: jmp    0x400744 <_Z11thread_funcPv+20>

It’s exactly an infinite loop. Compiler made an assumption that program is single-threaded, so flag=true will never happen during thread_func execution. Therefore code can be transformed into endless loop (notice that even call to pthread_exit has been wiped out as a dead code).

Another optimization that can be deadly is variable store/reads reordering. Compiler may reorganize reads and writes of non-dependent variables to squeeze more performance. Of course assumption about single threaded execution still holds, so compiler will don’t bother about dependencies between multiple threads. This will probably result in erroneous execution, if such dependencies exists.

In C/C++ both optimizations can be relatively easily found by looking into disassembly listings and fight back by proper use of volatile keyword (e.g. in previous program declaring flag as volatile bool will fix it). But if you think you have defeated final boss you are unfortunately wrong…

The Grey Eminence of all optimizations
Processor. He has last word when and in what order variables will be stored in memory. Rules are similar as with compiler – if program executes on single core, optimization performed by CPU will not affect results of execution (except performance). But when program is executed on multiple cores and each core can reorder instruction in its pipeline you can watch how things break.

To fight back processor reordering first you have to know when reordering may take place. Rules of memory operations ordering are written in processor specification. An Intel has relatively strong consistency model – which means there are not too much cases when it can reorder memory operations. Specification says

Reads may be reordered with older writes to different locations but not with older writes to the same location.

It could break some algorithms like Dekker or Peterson mutual exclusion, but other programming constructs like Double Checking Locking pattern may work correctly. Other processors like Alpha are not so forgiving and almost any memory operation could be reordered (which means Double Checking Locking Pattern won’t work without explicit memory barriers).

If you already know where reordering can cause problems you can suppress it with proper Memory Barrier instruction. In the contrast with compiler optimizations – you are not able to see if processor runtime optimization took place – you can only see its effects. And fighting with invisible opponent can be pretty damn hard.

In next parts of Adventures in Parallel Universe we will explore some concepts presented here more deeply along with the new stuff.