Curious case of User Mode Scheduling

In Windows7 x64 Microsoft introduced interesting mechanism called User Mode Scheduling. This mechanism allows us to resign from Kernel Mode Thread Scheduler and write our own scheduler in User Mode. In this video Windows Kernel Architect – Dave Probert – explains motivation behind creating such possibility. In a nutshell:

  • context switching in User Mode can be faster because it does not require going back and forth between user mode and kernel mode.
  • User Mode Scheduling introduces cooperative multitasking which can lead to performance improvements. Now threads can made an assumption that they won’t be preempted by other threads (from the same UMS group) unless they block in kernel (e.g. waiting on IO operation to complete). This assumption could be used e.g. to create more efficient locks.
  • having customized scheduler gives more control over threads execution. For example we can write round robin scheduler which will select threads one by one in fair manner (this could be useful in applications that have real time constraints). Or scheduler that will react on application specific events – events that Kernel Mode Scheduler knows nothing about.

It all sounds good enough to check how User Mode Scheduling works in reality. And that’s the thing I have done.

The Rules of the Game
Detailed description User Mode Scheduling API can be found in MSDN. To shortly sum up how to use this feature:

  • Application creates Worker Threads that will perform actual work. Worker threads can be preempted only if they blocks in kernel mode – for example when page fault occurred or if threads are waiting for IO operation to complete. They can also resign from it’s processor time by calling UmsThreadYield.
  • Application creates User Mode Scheduling Threads for cores on which Worker Threads will run. User Mode Schedulers are notified every time Worker Threads blocks in kernel or yields it’s execution. When flow is passed to scheduler, it is responsible to select next thread that will run.
  • Code executed by scheduler is pretty limited. During notifications it cannot touch any locks acquired by worker threads because it will result in deadlock (thread A waits to be scheduled and scheduler waits for lock acquired by thread A – see deadlock condition?). And this means scheduler cannot call any function which acquires locks (like global heap allocation) unless you know for sure that worker threads won’t acquire same locks.

The notification about blocking in kernel seems to be double edged sword.
The good side of notification is that it gives more control and it allows to employ more sophisticated scheduling strategies. For example we can create pool of threads assigned to one logical processor. When current executing thread blocks, scheduler can pass execution to next thread assigned to same processor to improve throughput. In such scenario processor cache pollution is minimized because one task is entirely done at one core which could lead to better performance.
The bad side of notification is that transition has to be made from kernel to user mode, which of course takes time. With original system scheduler this step is not needed because scheduling takes place in kernel mode. So actually notifications could decrease performance.
If overall performance decreases or increases depends on nature of the tasks. If tasks blocks a lot in kernel we can expect that performance will drop.

Basic UMS
To test performance of User Mode Scheduling I have written benchmark which will compare traditional system scheduler with round robin User Mode Scheduler. The embarrassingly parallel job will be performed: computation of methematical function over range of values. Each range chunk will be assigned to new thread which will perform computation. You can read code here:

#include "stdafx.h"
#include <Windows.h>
#include <new>
#include <memory>
#include <vector>
#include <functional>
#include <algorithm>
#include <chrono>
 
 
typedef std::function<void()> TTask;
 
unsigned g_ProcCount;
HANDLE* g_ThreadArray;
PUMS_COMPLETION_LIST g_CompletionList;
HANDLE g_CompletionListEvent;
HANDLE g_ShutdownEvent;
_declspec(thread) PUMS_CONTEXT g_UmsContext;
std::vector<TTask>* g_Tasks;
 
 
DWORD WINAPI WorkerThreadProc(LPVOID lpThreadParameter)
{
    size_t taskNum = reinterpret_cast<size_t>(lpThreadParameter);
    auto& task = (*g_Tasks)[taskNum];
    task();
 
    return 0;
}
 
HANDLE UtilsCreateUmsThread(LPVOID lpThreadParameter)
{
    SIZE_T size = 0;
    InitializeProcThreadAttributeList(nullptr, 1, 0, &size);
    std::unique_ptr<char[]> mem(new(std::nothrow) char[size]);
    if(!mem)
    {
        return nullptr;
    }
 
    LPPROC_THREAD_ATTRIBUTE_LIST attrLst = reinterpret_cast<LPPROC_THREAD_ATTRIBUTE_LIST>(mem.get());
 
 
    if(!InitializeProcThreadAttributeList(attrLst, 1, 0, &size))
    {
        return nullptr;
    }
 
    PUMS_CONTEXT umsCtx;
    if(!CreateUmsThreadContext(&umsCtx))
    {
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
    UMS_CREATE_THREAD_ATTRIBUTES umsAttrs;
    umsAttrs.UmsCompletionList = g_CompletionList;
    umsAttrs.UmsContext = umsCtx;
    umsAttrs.UmsVersion = UMS_VERSION;
 
    if(!UpdateProcThreadAttribute(attrLst, 0, PROC_THREAD_ATTRIBUTE_UMS_THREAD, &umsAttrs, sizeof(umsAttrs), nullptr, nullptr))
    {
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
 
    HANDLE hUmsThread = CreateRemoteThreadEx(GetCurrentProcess(), nullptr, 0, WorkerThreadProc, lpThreadParameter, 0, attrLst, nullptr);
    if(!hUmsThread)
    {
        DeleteUmsThreadContext(umsCtx);
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
    DeleteProcThreadAttributeList(attrLst);
    return hUmsThread;
}
 
 
VOID NTAPI UmsSchedulerEntryProc(RTL_UMS_SCHEDULER_REASON reason, ULONG_PTR activationPayload, PVOID schedulerParam)
{
    while(true)
    {
        if(!g_UmsContext)
        {
            HANDLE eventArr[] = {g_ShutdownEvent, g_CompletionListEvent};
            DWORD res = WaitForMultipleObjects(2, eventArr, FALSE, INFINITE);
            if(res == WAIT_FAILED) abort();
            if(res == WAIT_OBJECT_0) return;
            if(!DequeueUmsCompletionListItems(g_CompletionList, 0, &g_UmsContext)) abort();
        }
 
        PUMS_CONTEXT ctx = g_UmsContext;
        g_UmsContext = GetNextUmsListItem(ctx);
 
        BOOLEAN isTerminated = TRUE;
        QueryUmsThreadInformation(ctx, UmsThreadIsTerminated, &isTerminated, sizeof(isTerminated), nullptr);
        if(isTerminated)
        {
            DeleteUmsThreadContext(ctx);
        }
        else
        {
            while(!ExecuteUmsThread(ctx));
        }   
    }
}
 
DWORD WINAPI SchedulerThreadProc(LPVOID lpThreadParameter)
{   
    g_UmsContext = nullptr;
    UMS_SCHEDULER_STARTUP_INFO umsStartupInfo;
    umsStartupInfo.UmsVersion = UMS_VERSION;
    umsStartupInfo.SchedulerParam = nullptr;
    umsStartupInfo.SchedulerProc = UmsSchedulerEntryProc;
    umsStartupInfo.CompletionList = g_CompletionList;
    if(!EnterUmsSchedulingMode(&umsStartupInfo))
    {
        return false;
    }
 
 
    return true;
}
 
unsigned UtilsGetNumberOfProcessors()
{
    SYSTEM_INFO sysInfo;
    GetSystemInfo(&sysInfo);
    return sysInfo.dwNumberOfProcessors;
}
 
 
bool InitializeUmsScheduling()
{
    g_ProcCount = UtilsGetNumberOfProcessors();
    g_ThreadArray = new(std::nothrow) HANDLE[g_ProcCount];
    if(!g_ThreadArray) return false;
    memset(g_ThreadArray, 0, g_ProcCount*sizeof(HANDLE));
 
 
    if(!CreateUmsCompletionList(&g_CompletionList))
    {
        return false;
    }
 
    if(!GetUmsCompletionListEvent(g_CompletionList, &g_CompletionListEvent))
    {
        DeleteUmsCompletionList(g_CompletionList);
        return false;
    }
 
    g_ShutdownEvent = CreateEvent(nullptr, TRUE, FALSE, nullptr);
    if(!g_ShutdownEvent)
    {
        CloseHandle(g_CompletionListEvent);
        DeleteUmsCompletionList(g_CompletionList);
    }
 
    bool result = true;
    for(unsigned i = 0; i < g_ProcCount; ++i)
    {
        g_ThreadArray[i] = CreateThread(nullptr, 0, SchedulerThreadProc, nullptr, CREATE_SUSPENDED, nullptr); 
        if(!g_ThreadArray[i])
        {
            result = false;
            break;
        }
 
        if(!SetThreadAffinityMask(g_ThreadArray[i], 1 << i))
        {
            result = false;
            break;
        }
 
        if(ResumeThread(g_ThreadArray[i]) == static_cast<DWORD>(-1))
        {
            result = false;
            break;
        }
    }
 
    if(!result)
    {
        for(unsigned i = 0; i < g_ProcCount; ++i)
        {
            if(g_ThreadArray[i])
            {
                TerminateThread(g_ThreadArray[i], false);
                CloseHandle(g_ThreadArray[i]);
            }
        }
        CloseHandle(g_CompletionListEvent);
        DeleteUmsCompletionList(g_CompletionList);
        return false;
    }
 
    return result;
}
 
void ReleaseUmsScheduling()
{
    printf("Shutting down\n");
    SetEvent(g_ShutdownEvent);
 
    for(unsigned i = 0; i < g_ProcCount; ++i)
    {
        printf("Waiting for (%d)\n", i);
        if(WaitForSingleObject(g_ThreadArray[i], INFINITE) == WAIT_FAILED)
        {
            TerminateThread(g_ThreadArray[i], false);
        }
        CloseHandle(g_ThreadArray[i]);
    }
     
    CloseHandle(g_ShutdownEvent);
    CloseHandle(g_CompletionListEvent);
    DeleteUmsCompletionList(g_CompletionList);
}
 
bool DoParallel(std::vector<TTask>& tasks, bool umsScheduling)
{
    g_Tasks = &tasks;
 
 
    std::unique_ptr<HANDLE[]> threadArr(new(std::nothrow) HANDLE[g_Tasks->size()]);
    if(!threadArr)
    {
        return false;
    }
 
    bool result = true;
    size_t i;
    for(i = 0; i < g_Tasks->size(); ++i)
    {
        //printf("Creating thread (%d)\n", i);
        if(!umsScheduling)
            threadArr[i] = CreateThread(nullptr, 0, WorkerThreadProc, reinterpret_cast<LPVOID>(i), 0, nullptr);
        else
            threadArr[i] = UtilsCreateUmsThread(reinterpret_cast<LPVOID>(i));
 
        if(!threadArr[i])
        {
            result = false;
            break;
        }
    }
 
    if(!result)
    {
        for(size_t j = 0; j < i; ++j)
        {
            TerminateThread(threadArr[j], false);
            CloseHandle(threadArr[j]);
        }
        return false;
    }
         
         
    for(i = 0; i < g_Tasks->size(); ++i)
    {
        //printf("Releasing thread (%d)\n", i);
        if(WaitForSingleObject(threadArr[i], INFINITE) == WAIT_FAILED)
        {
            TerminateThread(threadArr[i], false);
            result = false;
        }
        CloseHandle(threadArr[i]);
    }
    return result;
 
}
 
const unsigned RANGE_SIZE = 1000000000;
const unsigned THRREADS_COUNT = 6000;
const bool USER_MODE_SCHEDULING = true;
 
int _tmain(int argc, _TCHAR* argv[])
{
    if(USER_MODE_SCHEDULING && !InitializeUmsScheduling())
    {
        return -1;
    }
 
    unsigned chunkSize = RANGE_SIZE / THRREADS_COUNT;
 
    std::vector<TTask> tasks;
    for(unsigned i = 0; i < THRREADS_COUNT; ++i)
    {
        tasks.push_back([=]
        {
            unsigned begin = i*chunkSize;
            unsigned end = min(begin+chunkSize, RANGE_SIZE);
            volatile unsigned long long result = 0;
            for(unsigned i = begin; i < end; ++i)
                result += sin(cos(sin(i))); 
        }
        );
    }
 
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::milliseconds milliseconds;
 
    clock::time_point t0 = clock::now();
    if(!DoParallel(tasks, USER_MODE_SCHEDULING))
    {
        printf("DoParallel failed\n");
        if(USER_MODE_SCHEDULING) ReleaseUmsScheduling();
        return -2;
    }
    clock::time_point t1 = clock::now();
    milliseconds totalMs = std::chrono::duration_cast<milliseconds>(t1 - t0);
    printf("Elapsed (%lld)", totalMs.count());
 
    if(USER_MODE_SCHEDULING) ReleaseUmsScheduling();
    return 0;
}

Results of benchmark are surprising – no matter how many worker threads – System Scheduler has always slightly better performance. Process Monitor reveals reason – when using our Basic User Mode Scheduler, not all cores are utilized uniformly. This is because each Scheduler Thread dequeues some number of ready threads from global queue. Number of dequeued threads are not always the same which means work does not split uniformly across cores. And this causes lower throughput. To fix it, we will have to complicate program and write some global queue with ready threads or implement work stealing strategies. Ain’t no free lunch at this time.

Conclusions
It seems that guys from Microsoft also realized that User Mode Scheduling is not the best way to achieve greater performance in simple cases. UMS was used as default scheduling policy in ConcRT (MS library that allows code parallelization). Well, this is no longer the case – default scheduling policy was changed to Win32 System Scheduler. Developer of ConcRT – Don McCrady – explains:

UMS support in Concrt is being retired because although it’s super fast at context-switching, well-structured parallel programs shouldn’t do a lot of context switching. In other words, in most real PPL scenarios, the UMS scheduler wasn’t any faster than the regular Win32 thread scheduler.

Although User Mode Scheduling does not improve performance in common cases, it still can be useful when there is a need for tighter control over threads execution (e.g. SQL Server uses it to achieve better scalability). Also it is good replacement for Fibers – old Windows mechanism that provides cooperative multitasking. Fibers has several limitations like lack of support for multiple cores and constraints on function that could be called during execution (when function blocks all fibers in thread blocks). None of these limitation apply to User Mode Scheduling.