When C++ meets SAL annotations

SAL annotations introduced by Microsoft is an interesting attempt to improve security and reliability of C/C++ code. They fill the gap between human-readable documentation and documentation that compiler is able to understand. Understanding actually means that compiler can apply static analysis and check code against common mistakes like null pointer dereferences, buffer overflows, etc. To be honest, compiler is also able to analyze code without SAL – but accuracy of such analysis is questionable at best. SAL annotations allows compiler to make analysis more accurate without tons of false positives. I had recently an occasion to check myself how code analysis works in VS2012 for real world C code. It seem that it performs pretty good – for example it was able to detect deadlock in similar (more complicated) code:

BOOL func()
{
	EnterCriticalSection(&cs);
	if(var)
	{
		return FALSE;
	}
	LeaveCriticalSection(&cs);
	return TRUE;
}

Code analysis tool knows that EnterCriticalSection acquires lock so it will give me warning: C26115: Failing to release lock ‘cs’ in function ‘func’.

But this is oldschool C coding style. Now we have 21st century and we are using C++11 in brave new world of information technology. Let’s see how VS2012 code analysis works with new C++ standard.

Deadlock detection
In first test I will modify previous code to use std::mutex instead of CRITICAL_SECTION. Pretty innocent change.

BOOL func()
{
	mtx.lock();
	if(var)
	{
		return false;
	}
	mtx.unlock();
	return true;
}

After running code analysis, VS2012 is not able to detect any flaws. It seems that compiler does not understand C++ locking primitives, even if it uses WinApi under the hood. Fortunately std::mutex implementation performs runtime checks that can detect mistake. But trading compilation error for runtime error is not good deal (especially in production code).

NULL dereference
Another simple example is NULL pointer dereference. VS2012 code analysis is able to check this kind of bug even without SAL annotations (if pointer is local variable):

	char* p = 0;
	if(var) p = (char*)malloc(10);
	strcpy_s(p, 10, "hello");

As expected, compiler shows warning: C6387: ‘p’ could be ‘0’.
This code is of course in poor taste provided that we are coding in C++11. I will use std::unique_ptr instead of raw pointers, which will help to avoid memory leaks:

	std::unique_ptr<char[]>  p = 0;
	if(var) p.reset(new (std::nothrow) char[10]);
	strcpy_s(p.get(), 10, "hello");

After running code analysis, again no flaws detected. std::unique_ptr does not use SAL annotations so compiler cannot determine if p.get() can return NULL. To avoid false positives this situation is just ignored.
But what if get() method would be declared with _Maybenull_ annotation? Well, in this case you will get false positives all the time, even if p will be correct non-null pointer (yeah I’ve checked it). So this is definitely not a solution.

New overload
It seems that VS2012 is at least able to deal correctly with new operator:

  • dereference of memory allocated by new (std::nothrows) will raise warning
  • dereference of memory allocated by default new will pass

But what if we have to use overloaded new that does not throw? I know this is not very good idea to overload new in this way (it will cause problem with STL containers), but unfortunately, sometimes in legacy code it is a must. So let’s check how overloaded new will be treated:

_Maybenull_ void* operator new(size_t size)
{
	return malloc(size);
}

int _tmain(int argc, _TCHAR* argv[])
{
	char* p = (char*)new char;
	*p = 1;
	return 0;
}

Guess what? No warning at all. Visual Studio code analysis assumes that new always returns correct pointer. It totally ignores overloaded new definition.

C++ Paradox
Some wise computer scientist said that all problems in computer science can be solved by another level of indirection. Unfortunately in case of static code analysis this is not true – another level of indirection introduced by C++ abstractions (like smart pointers, portable mutexes, etc.) makes static analysis much harder. There is paradox in here: you use C++ standard library to make code more safe, only to realize that now your code analysis tools are useless. Maybe in some distant future Visual Studio will be able to verify C++ code more or less correctly. But currently all what you can do is to choose lesser evil: resign from automatic code analysis or resign from C++ standard library classes.

Parallel Counters

Suppose you are writing concurrent structure (like thread safe hash map) and you need to update and query number of its elements. Or you must count events that appears concurrently in your system (like tracking number of function calls). Sounds easy but concurrent nature of problem complicates it a little bit. How to write thread-safe counters that will be most efficient?
Let’s see what options do we have:

  1. Use locks to achieve thread safety
  2. User atomic operations to achieve thread safety
  3. Give each thread separate counter and sum all counters to compute final value

Everyone knows that locks are slow and won’t be best solution for such simple problem. So first option goes down.
Atomic operations are easy to implement and should be faster than heavy locks. Most people will choose this option.
But what with option number 3? It requires more effort (because we have to keep data per-thread ) but does not require synchronization at all! Can we gain additional performance choosing this option?

Three Counters
Let’s implement each option and compare performance results. Code for three counters are pretty straightforward. I will use OpenMP to simplify implementation even more.
Option 1:

class CLockedEventCounter
{
public:
	CLockedEventCounter() : m_Count(0) { }

	void Event()
	{
		#pragma omp critical (countLock) 
		m_Count++;
	}

	uint64_t GetCount() const
	{
		uint64_t res;
		#pragma omp critical (countLock) 
		res = m_Count;
		return res;
	}

private:
	uint64_t m_Count;
};

Option 2:

class CInterlockedEventCounter
{
public:
	CInterlockedEventCounter() : m_Count(0) { }

	void Event()
	{
		_InterlockedIncrement(&m_Count);
	}

	uint64_t GetCount() const
	{
		return m_Count;
	}

private:
	volatile uint64_t m_Count;
};

Option 3:

class CParallelEventCounter
{
public:
	CParallelEventCounter(uint32_t count)
	{
		m_Count.resize(count);
	}

	void Event(size_t id)
	{
		m_Count[id].count++;
	}

	uint64_t GetCount() const
	{
		uint64_t res = 0;
		for(size_t i = 0; i < m_Count.size(); ++i)
			res += m_Count[i].count;
		return res;
	}
private:
	struct Counter
	{
		Counter() : count(0) { }
		uint64_t count;
		//Pad to avoid false sharing
		char pad[512];
	};
	std::vector<counter> m_Count;
};

In Parallel Counter each thread access only its data so there is no need for synchronization or atomic operations. But will it improve performance?

Performance Test
In test I will simulate parallel events and call Event() function from multiple threads. We will see what time does it take to count 300 000 000 of generated events. We will also vary number of threads to see what scalability each solution has.
At first, comparison between Interlocked Counter and Parallel Counter:

out1

With no doubt Parallel Counter wins. It is about 12 times faster than Interlocked Counter. It also scales much better with growing number of threads.
To get the whole picture let’s put see what performance has lock based counter (represented by red line):

out2

As you can see, this solution it is real performance killer – it is about 80x slower then previous methods!
Full code of benchmark you can read here:

#include "stdafx.h"
#include <omp.h>
#include <vector>
#include <string>
#include <cstdint>
#include <chrono>
#include <Windows.h>

class CLockedEventCounter
{
public:
	CLockedEventCounter() : m_Count(0) { }

	void Event()
	{
		#pragma omp critical (countLock) 
		m_Count++;
	}

	uint64_t GetCount() const
	{
		uint64_t res;
		#pragma omp critical (countLock) 
		res = m_Count;
		return res;
	}

private:
	uint64_t m_Count;
};

class CInterlockedEventCounter
{
public:
	CInterlockedEventCounter() : m_Count(0) { }

	void Event()
	{
		_InterlockedIncrement(&m_Count);
	}

	uint64_t GetCount() const
	{
		return m_Count;
	}

private:
	volatile uint64_t m_Count;
};

class CParallelEventCounter
{
public:
	CParallelEventCounter(uint32_t count)
	{
		m_Count.resize(count);
	}

	void Event(size_t id)
	{
		m_Count[id].count++;
	}

	uint64_t GetCount() const
	{
		uint64_t res = 0;
		for(size_t i = 0; i < m_Count.size(); ++i)
			res += m_Count[i].count;
		return res;
	}
private:
	struct Counter
	{
		Counter() : count(0) { }
		uint64_t count;
		//Pad to avoid false sharing
		char pad[512];
	};
	std::vector<Counter> m_Count;
};

int main(int argc, char* argv[])
{
	if(argc < 4)
	{
		fprintf(stderr, "To small number of args\n");
		return -1;
	}

	uint32_t threadNum = std::stoul(argv[1]);
	int64_t countTo    = std::stoll(argv[2]);
	uint32_t algo      = std::stoul(argv[3]);


	CLockedEventCounter lockedEventCounter;
	CInterlockedEventCounter interlockedEventCounter;
	CParallelEventCounter parallelEventCounter(threadNum);

	
	typedef std::chrono::high_resolution_clock clock;
	typedef std::chrono::milliseconds milliseconds;

	clock::time_point t0 = clock::now();

	#pragma omp parallel for num_threads(threadNum)
	for(int64_t i = 0; i < countTo; ++i)
	{
		if(algo == 0) lockedEventCounter.Event();
		else if(algo == 1) interlockedEventCounter.Event();
		else if(algo == 2) parallelEventCounter.Event(omp_get_thread_num());
	}
	
	uint64_t count = 0;
	if(algo == 0) count = lockedEventCounter.GetCount();
	else if(algo == 1) count = interlockedEventCounter.GetCount();
	else if(algo == 2) count = parallelEventCounter.GetCount();

	clock::time_point t1 = clock::now();
	milliseconds totalms = std::chrono::duration_cast<milliseconds>(t1 - t0);

	if(count != countTo)
	{
		fprintf(stderr, "counter mismatch (%llx) != (%llx)\n", count, countTo);
		return -1;
	}

	printf("%u,%llu\n", threadNum, totalms);
	return 0;
}

Real world case
But this is only benchmark. Let’s see how counting is done in some real life systems that has to deal with issue. Good example is Wireshark – it has to count in realtime huge amount of incoming and outgoing packets that floods you network card. In Windows version, counters are implemented in WinPcap driver. Creators of WinPcap also choosed parallel counters – but with neat modification not available in user mode.
In Windows Kernel there is a possibility to preempt thread scheduler and make sure that execution of your code won’t be rescheduled to different processor. This is done by raising IRQL to DISPATCH_LEVEL (e.g. by calling KeRaiseIrql function). With such possibility we can modify scenario of counting:

  1. When packet comes, raise IRQL to DISPATCH_LEVEL
  2. Get current Id of logical processor
  3. Increment counter for current logical processor by regular non-atomic operation
  4. Lower IRQL

To compute final value just sum over array of counters.

Snippet from WinPcap code:
Incrementation:

Cpu = KeGetCurrentProcessorNumber();
LocalData = &Open->CpuData[Cpu];
LocalData->Received++;

And computation of final score:

for(i = 0 ; i < g_NCpu ; i++)
{
	pStats[3] += Open->CpuData[i].Accepted;
	pStats[0] += Open->CpuData[i].Received;
	pStats[1] += Open->CpuData[i].Dropped;
	pStats[2] += 0;		// Not yet supported
}

Full code of WinPcap is available here: http://www.winpcap.org/devel.htm

Why so slow?
As you can see, writing to shared locations from multiple cores can drain your performance drastically. The question is: why this is so big deal? Couldn’t writes be faster? The answer lies in cache coherency protocols implemented in modern processors. Most of them (like MOESI protocol implemented in x86 family) guarantees you fast reads, but slow non-scalable writes to shared locations. Why is that so – answer lies in details of how cache coherency is achieved. But for this momement one thing should be clear – to implement fast scalable applications we will have to use such data structures and algorithms that avoids writes to shared memory regions.

Curious case of User Mode Scheduling

In Windows7 x64 Microsoft introduced interesting mechanism called User Mode Scheduling. This mechanism allows us to resign from Kernel Mode Thread Scheduler and write our own scheduler in User Mode. In this video Windows Kernel Architect – Dave Probert – explains motivation behind creating such possibility. In a nutshell:

  • context switching in User Mode can be faster because it does not require going back and forth between user mode and kernel mode.
  • User Mode Scheduling introduces cooperative multitasking which can lead to performance improvements. Now threads can made an assumption that they won’t be preempted by other threads (from the same UMS group) unless they block in kernel (e.g. waiting on IO operation to complete). This assumption could be used e.g. to create more efficient locks.
  • having customized scheduler gives more control over threads execution. For example we can write round robin scheduler which will select threads one by one in fair manner (this could be useful in applications that have real time constraints). Or scheduler that will react on application specific events – events that Kernel Mode Scheduler knows nothing about.

It all sounds good enough to check how User Mode Scheduling works in reality. And that’s the thing I have done.

The Rules of the Game
Detailed description User Mode Scheduling API can be found in MSDN. To shortly sum up how to use this feature:

  • Application creates Worker Threads that will perform actual work. Worker threads can be preempted only if they blocks in kernel mode – for example when page fault occurred or if threads are waiting for IO operation to complete. They can also resign from it’s processor time by calling UmsThreadYield.
  • Application creates User Mode Scheduling Threads for cores on which Worker Threads will run. User Mode Schedulers are notified every time Worker Threads blocks in kernel or yields it’s execution. When flow is passed to scheduler, it is responsible to select next thread that will run.
  • Code executed by scheduler is pretty limited. During notifications it cannot touch any locks acquired by worker threads because it will result in deadlock (thread A waits to be scheduled and scheduler waits for lock acquired by thread A – see deadlock condition?). And this means scheduler cannot call any function which acquires locks (like global heap allocation) unless you know for sure that worker threads won’t acquire same locks.

The notification about blocking in kernel seems to be double edged sword.
The good side of notification is that it gives more control and it allows to employ more sophisticated scheduling strategies. For example we can create pool of threads assigned to one logical processor. When current executing thread blocks, scheduler can pass execution to next thread assigned to same processor to improve throughput. In such scenario processor cache pollution is minimized because one task is entirely done at one core which could lead to better performance.
The bad side of notification is that transition has to be made from kernel to user mode, which of course takes time. With original system scheduler this step is not needed because scheduling takes place in kernel mode. So actually notifications could decrease performance.
If overall performance decreases or increases depends on nature of the tasks. If tasks blocks a lot in kernel we can expect that performance will drop.

Basic UMS
To test performance of User Mode Scheduling I have written benchmark which will compare traditional system scheduler with round robin User Mode Scheduler. The embarrassingly parallel job will be performed: computation of methematical function over range of values. Each range chunk will be assigned to new thread which will perform computation. You can read code here:

#include "stdafx.h"
#include <Windows.h>
#include <new>
#include <memory>
#include <vector>
#include <functional>
#include <algorithm>
#include <chrono>
 
 
typedef std::function<void()> TTask;
 
unsigned g_ProcCount;
HANDLE* g_ThreadArray;
PUMS_COMPLETION_LIST g_CompletionList;
HANDLE g_CompletionListEvent;
HANDLE g_ShutdownEvent;
_declspec(thread) PUMS_CONTEXT g_UmsContext;
std::vector<TTask>* g_Tasks;
 
 
DWORD WINAPI WorkerThreadProc(LPVOID lpThreadParameter)
{
    size_t taskNum = reinterpret_cast<size_t>(lpThreadParameter);
    auto& task = (*g_Tasks)[taskNum];
    task();
 
    return 0;
}
 
HANDLE UtilsCreateUmsThread(LPVOID lpThreadParameter)
{
    SIZE_T size = 0;
    InitializeProcThreadAttributeList(nullptr, 1, 0, &size);
    std::unique_ptr<char[]> mem(new(std::nothrow) char[size]);
    if(!mem)
    {
        return nullptr;
    }
 
    LPPROC_THREAD_ATTRIBUTE_LIST attrLst = reinterpret_cast<LPPROC_THREAD_ATTRIBUTE_LIST>(mem.get());
 
 
    if(!InitializeProcThreadAttributeList(attrLst, 1, 0, &size))
    {
        return nullptr;
    }
 
    PUMS_CONTEXT umsCtx;
    if(!CreateUmsThreadContext(&umsCtx))
    {
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
    UMS_CREATE_THREAD_ATTRIBUTES umsAttrs;
    umsAttrs.UmsCompletionList = g_CompletionList;
    umsAttrs.UmsContext = umsCtx;
    umsAttrs.UmsVersion = UMS_VERSION;
 
    if(!UpdateProcThreadAttribute(attrLst, 0, PROC_THREAD_ATTRIBUTE_UMS_THREAD, &umsAttrs, sizeof(umsAttrs), nullptr, nullptr))
    {
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
 
    HANDLE hUmsThread = CreateRemoteThreadEx(GetCurrentProcess(), nullptr, 0, WorkerThreadProc, lpThreadParameter, 0, attrLst, nullptr);
    if(!hUmsThread)
    {
        DeleteUmsThreadContext(umsCtx);
        DeleteProcThreadAttributeList(attrLst);
        return nullptr;
    }
 
    DeleteProcThreadAttributeList(attrLst);
    return hUmsThread;
}
 
 
VOID NTAPI UmsSchedulerEntryProc(RTL_UMS_SCHEDULER_REASON reason, ULONG_PTR activationPayload, PVOID schedulerParam)
{
    while(true)
    {
        if(!g_UmsContext)
        {
            HANDLE eventArr[] = {g_ShutdownEvent, g_CompletionListEvent};
            DWORD res = WaitForMultipleObjects(2, eventArr, FALSE, INFINITE);
            if(res == WAIT_FAILED) abort();
            if(res == WAIT_OBJECT_0) return;
            if(!DequeueUmsCompletionListItems(g_CompletionList, 0, &g_UmsContext)) abort();
        }
 
        PUMS_CONTEXT ctx = g_UmsContext;
        g_UmsContext = GetNextUmsListItem(ctx);
 
        BOOLEAN isTerminated = TRUE;
        QueryUmsThreadInformation(ctx, UmsThreadIsTerminated, &isTerminated, sizeof(isTerminated), nullptr);
        if(isTerminated)
        {
            DeleteUmsThreadContext(ctx);
        }
        else
        {
            while(!ExecuteUmsThread(ctx));
        }   
    }
}
 
DWORD WINAPI SchedulerThreadProc(LPVOID lpThreadParameter)
{   
    g_UmsContext = nullptr;
    UMS_SCHEDULER_STARTUP_INFO umsStartupInfo;
    umsStartupInfo.UmsVersion = UMS_VERSION;
    umsStartupInfo.SchedulerParam = nullptr;
    umsStartupInfo.SchedulerProc = UmsSchedulerEntryProc;
    umsStartupInfo.CompletionList = g_CompletionList;
    if(!EnterUmsSchedulingMode(&umsStartupInfo))
    {
        return false;
    }
 
 
    return true;
}
 
unsigned UtilsGetNumberOfProcessors()
{
    SYSTEM_INFO sysInfo;
    GetSystemInfo(&sysInfo);
    return sysInfo.dwNumberOfProcessors;
}
 
 
bool InitializeUmsScheduling()
{
    g_ProcCount = UtilsGetNumberOfProcessors();
    g_ThreadArray = new(std::nothrow) HANDLE[g_ProcCount];
    if(!g_ThreadArray) return false;
    memset(g_ThreadArray, 0, g_ProcCount*sizeof(HANDLE));
 
 
    if(!CreateUmsCompletionList(&g_CompletionList))
    {
        return false;
    }
 
    if(!GetUmsCompletionListEvent(g_CompletionList, &g_CompletionListEvent))
    {
        DeleteUmsCompletionList(g_CompletionList);
        return false;
    }
 
    g_ShutdownEvent = CreateEvent(nullptr, TRUE, FALSE, nullptr);
    if(!g_ShutdownEvent)
    {
        CloseHandle(g_CompletionListEvent);
        DeleteUmsCompletionList(g_CompletionList);
    }
 
    bool result = true;
    for(unsigned i = 0; i < g_ProcCount; ++i)
    {
        g_ThreadArray[i] = CreateThread(nullptr, 0, SchedulerThreadProc, nullptr, CREATE_SUSPENDED, nullptr); 
        if(!g_ThreadArray[i])
        {
            result = false;
            break;
        }
 
        if(!SetThreadAffinityMask(g_ThreadArray[i], 1 << i))
        {
            result = false;
            break;
        }
 
        if(ResumeThread(g_ThreadArray[i]) == static_cast<DWORD>(-1))
        {
            result = false;
            break;
        }
    }
 
    if(!result)
    {
        for(unsigned i = 0; i < g_ProcCount; ++i)
        {
            if(g_ThreadArray[i])
            {
                TerminateThread(g_ThreadArray[i], false);
                CloseHandle(g_ThreadArray[i]);
            }
        }
        CloseHandle(g_CompletionListEvent);
        DeleteUmsCompletionList(g_CompletionList);
        return false;
    }
 
    return result;
}
 
void ReleaseUmsScheduling()
{
    printf("Shutting down\n");
    SetEvent(g_ShutdownEvent);
 
    for(unsigned i = 0; i < g_ProcCount; ++i)
    {
        printf("Waiting for (%d)\n", i);
        if(WaitForSingleObject(g_ThreadArray[i], INFINITE) == WAIT_FAILED)
        {
            TerminateThread(g_ThreadArray[i], false);
        }
        CloseHandle(g_ThreadArray[i]);
    }
     
    CloseHandle(g_ShutdownEvent);
    CloseHandle(g_CompletionListEvent);
    DeleteUmsCompletionList(g_CompletionList);
}
 
bool DoParallel(std::vector<TTask>& tasks, bool umsScheduling)
{
    g_Tasks = &tasks;
 
 
    std::unique_ptr<HANDLE[]> threadArr(new(std::nothrow) HANDLE[g_Tasks->size()]);
    if(!threadArr)
    {
        return false;
    }
 
    bool result = true;
    size_t i;
    for(i = 0; i < g_Tasks->size(); ++i)
    {
        //printf("Creating thread (%d)\n", i);
        if(!umsScheduling)
            threadArr[i] = CreateThread(nullptr, 0, WorkerThreadProc, reinterpret_cast<LPVOID>(i), 0, nullptr);
        else
            threadArr[i] = UtilsCreateUmsThread(reinterpret_cast<LPVOID>(i));
 
        if(!threadArr[i])
        {
            result = false;
            break;
        }
    }
 
    if(!result)
    {
        for(size_t j = 0; j < i; ++j)
        {
            TerminateThread(threadArr[j], false);
            CloseHandle(threadArr[j]);
        }
        return false;
    }
         
         
    for(i = 0; i < g_Tasks->size(); ++i)
    {
        //printf("Releasing thread (%d)\n", i);
        if(WaitForSingleObject(threadArr[i], INFINITE) == WAIT_FAILED)
        {
            TerminateThread(threadArr[i], false);
            result = false;
        }
        CloseHandle(threadArr[i]);
    }
    return result;
 
}
 
const unsigned RANGE_SIZE = 1000000000;
const unsigned THRREADS_COUNT = 6000;
const bool USER_MODE_SCHEDULING = true;
 
int _tmain(int argc, _TCHAR* argv[])
{
    if(USER_MODE_SCHEDULING && !InitializeUmsScheduling())
    {
        return -1;
    }
 
    unsigned chunkSize = RANGE_SIZE / THRREADS_COUNT;
 
    std::vector<TTask> tasks;
    for(unsigned i = 0; i < THRREADS_COUNT; ++i)
    {
        tasks.push_back([=]
        {
            unsigned begin = i*chunkSize;
            unsigned end = min(begin+chunkSize, RANGE_SIZE);
            volatile unsigned long long result = 0;
            for(unsigned i = begin; i < end; ++i)
                result += sin(cos(sin(i))); 
        }
        );
    }
 
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::milliseconds milliseconds;
 
    clock::time_point t0 = clock::now();
    if(!DoParallel(tasks, USER_MODE_SCHEDULING))
    {
        printf("DoParallel failed\n");
        if(USER_MODE_SCHEDULING) ReleaseUmsScheduling();
        return -2;
    }
    clock::time_point t1 = clock::now();
    milliseconds totalMs = std::chrono::duration_cast<milliseconds>(t1 - t0);
    printf("Elapsed (%lld)", totalMs.count());
 
    if(USER_MODE_SCHEDULING) ReleaseUmsScheduling();
    return 0;
}

Results of benchmark are surprising – no matter how many worker threads – System Scheduler has always slightly better performance. Process Monitor reveals reason – when using our Basic User Mode Scheduler, not all cores are utilized uniformly. This is because each Scheduler Thread dequeues some number of ready threads from global queue. Number of dequeued threads are not always the same which means work does not split uniformly across cores. And this causes lower throughput. To fix it, we will have to complicate program and write some global queue with ready threads or implement work stealing strategies. Ain’t no free lunch at this time.

Conclusions
It seems that guys from Microsoft also realized that User Mode Scheduling is not the best way to achieve greater performance in simple cases. UMS was used as default scheduling policy in ConcRT (MS library that allows code parallelization). Well, this is no longer the case – default scheduling policy was changed to Win32 System Scheduler. Developer of ConcRT – Don McCrady – explains:

UMS support in Concrt is being retired because although it’s super fast at context-switching, well-structured parallel programs shouldn’t do a lot of context switching. In other words, in most real PPL scenarios, the UMS scheduler wasn’t any faster than the regular Win32 thread scheduler.

Although User Mode Scheduling does not improve performance in common cases, it still can be useful when there is a need for tighter control over threads execution (e.g. SQL Server uses it to achieve better scalability). Also it is good replacement for Fibers – old Windows mechanism that provides cooperative multitasking. Fibers has several limitations like lack of support for multiple cores and constraints on function that could be called during execution (when function blocks all fibers in thread blocks). None of these limitation apply to User Mode Scheduling.

Volatile: Dark Corners of C

Volatile keyword is an interesting quirk. Used to fight back compiler optimization in order to access memory, which could be modified concurrently and unexpectedly. This includes:

  • communication with hardware via memory mapped I/O
  • access to system structures
  • access to program variables modified by multiple threads

I have described general purpose of volatile keyword in Adventures in Parallel Universe. Today we will dive more deeply into concept and we will see some confusing and counter-intuitive cases which can arise during volatile usage.

Volatile Pointers
Assume you have lock-free list with pointer to it’s tail. First thread atomically exchanges tail with pointer to new node, second waits until tail is not NULL.

struct Node {
  struct Node* prev;
  int elem;
}*tail;

void thread1() {
  while(!__sync_bool_compare_and_swap(&tail, oldtail, newtail))
    cpu_relax();
}

void thread2() {
  while(!tail) cpu_relax();
}

Because we are spinning at tail in thread2, tail has to be volatile to prevent compiler optimizations. We have 3 variants of volatile single pointer:
1. volatile Node* tail;
2. Node* volatile tail;
3. volatile Node* volatile tail;
Which should we choose?

After translating it to English we have:
1. pointer to volatile memory which means that dereferences of pointer will not be optimized by compiler. Example: while(tail->prev) will stay but while(tail) could be wiped out.
2. volatile pointer to memory which means that accessing pointer value will not be optimized. Example: while(tail) will stay but while(tail->prev) could be optimized.
3. volatile pointer to volatile memory which is combination of two previous. Example: both while(tail) and while(tail->prev) will not be optimized.

So in our case we can choose option 2 and 3. Option 3 is less desirable because in presented code we are not dereferencing memory on which tail points, but additional volatile prevents compiler from optimizations.
Of course there exists also multiple volatile pointers – rules are the same as with single volatile pointers.

Const Volatile
We can declare variable as const and volatile in the same time. This could be confusing because const tells us that variable is constant (so cannot be changed) and volatile tells us that variable could be changed. As an example let’s take code that communicates with external device via memory:

volatile int* p = (volatile int*)0xC0000000;
while(*p) cpu_relax();

In code we are only reading memory at address 0xC0000000 but we do not perform writes. And as we know, variables that are read-only should be marked as const to prevent accidental modification.

const volatile int* p = (volatile int*)0xC0000000;
while(*p) cpu_relax();

const volatile int* p means that memory pointed by p could be changed by “outside world” (like hardware) but cannot be changed by dereference of p (such attempt will result in compilation error).

Volatile Casts 1
In Windows there is a function for atomic increments of variable:

LONG __cdecl InterlockedIncrement(
  _Inout_  LONG volatile *Addend
);
...
long var;
InterlockedIncrement(&var);

Can we pass non-volatile variable to InterlockedIncrement function?
The answer is yes – in this situation we are performing implicit cast from non-volatile memory to volatile memory. It simply means that InterlockedIncrement will make less assumptions about passed argument and will expect that argument can be changed by outside world.

What with opposite situation: can we pass volatile memory to function that does not expect volatile argument (e.g. to memcpy)?

void * memcpy ( void * destination, const void * source, size_t num );
...
volatile char *dst = (volatile char*)0xC0000000;
char src[40];
memcpy((void*)dst, src, sizeof(src));

In this situation what we’ve got is undefined behavior. The C standard says:

If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.

We have to use memcpy implementation that handles volatile memory or write our own implementation.

Volatile Casts 2
Assume we have legacy code which declares variable as non-volatile but we have to use it in multithreaded context:

#include <pthread.h>
#include <unistd.h>

int flag = 0;

void* thread_func(void* arg) {
    while(!(volatile int)flag);
    return 0;
}

int main(void) {
    pthread_t t;
    pthread_create(&t, 0, thread_func, 0);
    sleep(1);
    *(volatile int*)&flag = 1;
    pthread_join(t, 0);
    return 0;
}

We have casted flag to volatile so everything should be fine and program should exit. Instead of this, program goes into infinite loop. On disasembly listing you can see that while loop was transformed into infinite loop:

thread_func()
thread_func+38: jmp    0x400776 <thread_func+38>

C standard says:

The properties associated with qualified types [in our case volatile] are meaningful only for expressions that are lvalues

And expression (volatile int)flag is not lvalue (if you will try assign value to such expression you will get compilation error). This means expressions
while(flag)
and
while((volatile int)flag)
are exactly the same and your volatile cast was flushed down to a toilet. Nobody expects the Spanish Inquisition.
To fix it we have transform expression to lvalue:

while(!*(volatile int*)&flag);

Or even better – use volatile alias on non-volatile variable.

Volatile Ordering
Consider following scenario: one thread copies data to buffer and sets flag when it’s done, second thread spins on flag and then reads data from buffer:

char buffer[50];
volatile int done = 0;
void thread1() {
  for(int i = 0; i < _countof(buffer); ++i) 
    buffer[i] = next();
  done = 1;
}
void thread2() {
  while(!done) cpu_relax();
  for(int i = 0; i < _countof(buffer); ++i)
    process(buffer[i]);
}

Because flag is volatile, compiler will not cache it in registers so while loop will exit. But code is still incorrect: done=1 can be executed by compiler before for-loop. This is because buffer is not volatile. And two consecutive accesses: first to volatile variable and second to non-volatile variable (or vice versa) could be reordered by compiler. If such optimization is performed, it will result in accessing uninitialized buffer. This can be fixed by declaring buffer as volatile or by placing Compiler Memory Barrier:

void thread1() {
  for(int i = 0; i < _countof(buffer); ++i) 
    buffer[i] = next();
  KeMemoryBarrierWithoutFence();
  done = 1;
}

void thread2() {
  while(!done) cpu_relax();
  KeMemoryBarrierWithoutFence();
  for(int i = 0; i < _countof(buffer); ++i)
    process(buffer[i]);
}

KeMemoryBarrierWithoutFence will prevent reordering performed by compiler but access to memory could be still reordered by processor. Processor’s reordering, in turn, can be suppressed by Hardware Memory Barrier (but this is topic for another article… or several articles).

Volatile and Atomicity
Accessing volatile variable does not guarantee that access will be atomic. Let’s see what requirements must be met for InterlockedIncrement function to be atomic:

The variable pointed to by the Addend parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems

So we will prove that volatile does not guarantee alignment and therefore does not guarantee atomicity. We will place volatile variable into not aligned memory:

    #pragma pack(1)
    struct {
        char c;
        volatile long int v;
    } a ;
    #pragma pack()

    printf("%p", &a.v);

Output:

0x7fff2ba07971

It means that volatile variable is not aligned to 32-bit boundary which will result in non-atomic accesses on x86 processors. Even without requesting special alignment we can cause that volatile variable will be placed in not aligned memory – e.g. by putting it in memory returned from malloc (malloc does not guarantee that allocated memory will be aligned).

Volatile C++
When classes was added to C (i.e. when C++ was created) some new rules about volatile had to be established. Volatile objects, volatile methods, function overloads with volatile parameters, volatile copy constructors, volatile partial template specialization… It’s really funny to observe how C++ amplifies complexity of C features. But in case of volatile, which is hard to use correctly even in plane C, level of complication in C++ becomes insane. It is tempting to just ignore topic and do not use volatile in conjunction with C++ stuff. Unfortunately new C++ standard comes with bunch of generic classes that uses volatile for multithreading. This means if you want to understand and use correctly C++11 threading library you will have dig into gory details of C++’s volatile. Stay tuned…

Thread Stood Still

Thread suspension is a technique commonly used by Application Monitors such as:

  • debuggers
  • stop-the-world garbage collectors
  • security tools

Besides it is useful from technical point of view, having possibility to stop and resume nothing suspecting threads brings me joy. Maybe I’m becoming control freak. Or just spending too much time in front of my computer.

Anyway, let’s cut the crap out and see what possibilities of thread suspension do we have on different operating systems:

  • Windows: WinApi functions SuspendThread / ResumeThread
  • Unix: pthread_suspend / pthread_resume (name depends on particular system)
  • Linux: ptrhread_suspend_np / ptrhread_resume_np

_np postfix in Linux functions doesn’t look good – it means that they are non-portable and according to documentation – available only on RtLinux systems. If we want to have possibility of suspending threads on other Linuxes we have to do it by ourselves. And we want to have this possibility.

General idea
To implement thread suspension from user-mode we will have to break somehow into thread’s execution context. Fortunately this is exactly what signal mechanism provides us. General schema looks like this:

  1. Send signal to victim thread
  2. In signal handler place blocking operation
  3. During resume, notify blocking mechanism and exit signal handler which will resume thread’s execution

To make this mechanism safe we will have to carefully choose what operation we are performing during signal handler and what signal we are blocking:

  • Signal used to resume and suspend will be SIGUSR1 which is left free to programmer
  • Before signal handler is executed, signal mask of victim thread has to be changed to block SIGUSR1. If we won’t do this then multiple concurrent requests to suspend/resume can cause deadlocks and races. It can be done by specifying mask in sigaction function.
  • suspend function has to wait on signal delivery to victim thread. If it will exit asynchronously before signal is delivered it will be worthless as synchronization mechanism and could trick user to think that thread is already blocked (but in fact can still executes). Signal delivery synchronization can be done by spinlock shared between signal handler and issuing thread.
  • blocking operation in signal handler has to be reentrant (it means it has to be on the list of async-safe functions provided in man 7 signal). It also has to atomically change signal mask when entering (to not block SIGUSR1 anymore) and restore it when exiting. Fortunately such function exists and it is sigsuspend. It will block until specified signal is delivered and will temporarily replace current signal mask.

Implmentation
Code is available on github: https://github.com/bit-envoy/threadmgmt

#pragma once
#include <pthread.h>

#define USED_SIG SIGUSR1

int thread_mgmt_init(void);

int thread_mgmt_release(void);

int thread_mgmt_suspend(pthread_t t);

int thread_mgmt_resume(pthread_t t);
#include "threadmgmt.h"
#include <signal.h>
#include <string.h>
#include <pthread.h>

#define CPU_RELAX() asm("pause")
#define SMP_WB()
#define SMP_RB()

typedef struct thread_mgmt_op
{
    int op;
    volatile int done;
    volatile int res;
}thread_mgmt_op_t;

struct sigaction old_sigusr1;
static __thread volatile int thread_state = 1;

static void thread_mgmt_handler(int, siginfo_t*, void*);
static int thread_mgmt_send_op(pthread_t, int);


int thread_mgmt_init(void)
{
  struct sigaction sa;
  memset(&sa, 0, sizeof(sa));
  sa.sa_sigaction = (void*)thread_mgmt_handler;
  sa.sa_flags = SA_SIGINFO;      
  sigfillset(&sa.sa_mask);
  //register signal handler which will full signal mask
  return sigaction(USED_SIG, &sa, NULL);
}

int thread_mgmt_release()
{
    //restore previous signal handler
    return sigaction(USED_SIG, &old_sigusr1, NULL);
}

int thread_mgmt_suspend(pthread_t t)
{
    return thread_mgmt_send_op(t, 0);
}

int thread_mgmt_resume(pthread_t t)
{    
    return thread_mgmt_send_op(t, 1);
}

static int thread_mgmt_send_op(pthread_t t, int opnum)
{
    thread_mgmt_op_t op = {.op = opnum, .done = 0, .res = 0};
    sigval_t val = {.sival_ptr = &op};
    if(pthread_sigqueue(t, USED_SIG, val))
        return -1;
    
    //spin wait till signal is delivered
    while(!op.done) 
        CPU_RELAX();

    SMP_RB();
    return op.res;
}

static void thread_mgmt_handler(int signum, siginfo_t* info, void* ctx)
{
    thread_mgmt_op_t *op = (thread_mgmt_op_t*)(info->si_value.sival_ptr);
    if(op->op == 0 && thread_state == 1)
    {
        //suspend
        thread_state = 0;
        op->res = 0;
        SMP_WB();
        op->done = 1;
        
        sigset_t mask;
        sigfillset(&mask);
        sigdelset(&mask, USED_SIG);

        //wait till SIGUSR1
        sigsuspend(&mask);
    }
    else if(op->op == 1 && thread_state == 0)
    {
        //resume
        thread_state = 1;
        op->res = 0;
        SMP_WB();
        op->done = 1;
    }
    else
    {
        //resume resumed thread or
        //suspend suspended thread
        op->res = -1;
        SMP_WB();
        op->done = 1;
    }
}
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#include "threadmgmt.h"

void*  function(void*  arg)
{
    int i = 0;
    while(1) printf("thread(%d) %p\n", arg, i++);
    return 0;
}

int main( void )
{
    if(thread_mgmt_init())
        return -1;
    
   pthread_t t1, t2;
   pthread_create(&amp;t1, NULL, function, (void*)1);
   pthread_create(&amp;t2, NULL, function, (void*)2);

   
   sleep(1);
   if(thread_mgmt_suspend(t1))
       return -2;

   if(thread_mgmt_suspend(t2))
       return -2;
   
   sleep(2);
   if(thread_mgmt_resume(t1))
       return -3;

   if(thread_mgmt_resume(t2))
       return -3;
   
   sleep(4);
   thread_mgmt_release();
   
   return 0;
}

Additional notes

  • On processors that can reorder writes and reads to different addresses (e.g. some ARMs) macro SMP_WB and SMP_RB should be defined with proper memory barrier instructions. On x86 and x64 macros are empty because those processor does not perform reordering in described case.
  • Disadvantage of using signals to implement suspending is that signals will interrupt blocking operations like sleep(). So threads that will be suspended should be prepared for it. If you know better way how to implement thread suspension on Linux feel free to let me know. Sad true about thread suspension is it should be implemented in kernel (like in Windows and some Unixes) not hacked in user mode.

Final word
Suspending threads is handy mechanism in many situations but it also can be dangerous. If thread is suspended during holding some lock, and then issuing thread will try acquire the same lock it will deadlock. This effect can be observed on test application – when thread is suspended during printf (which acquire locks) and other thread tries to printf something it will hang on the same lock. You have know what you are doing – use it in monitoring / instrumentation scenarios – where you don’t have full control over monitored thread code.

Adventures in Parallel Universe

Writing multithreaded programs is challenging task both because of programming and debugging. Let’s see what traps await for lone rider entering Parallel Universe.

More is less
Psychological research shows that people becomes unhappy when they are overloaded by multiple possibilities. Well if this is true, then analyzing concurrent program could cause serious nervous breakdown of unprepared minds. Number of execution possibilities grows exponentially because of thread scheduler non-determinism. Let’s assume that we operate in sequential consistent model (e.g. execution on single core) and we have only two threads that are executing instructions:

T_1: a_{1};a_{1};...;a_{n}; \\  T_2: b_{1};b_{2};...;b_{m};

Number of possible interleaved executions is equal to number of permutations with repetition

\frac{(m+n)!}{m!n!}

Assuming that m = n

\frac{(n+n)!}{n!\,n!}\sim\frac{4^n}{\sqrt{\pi n}}

which means exponential growth of possibilties with number of instructions.

Even if there are only 2 threads with 3 instructions per thread, there is 20 possible cases to analyze (enumerate it by yourself if you have too much time). This is enough to make your life miserable, but real world programs are much more complicated than this. And it implies that your mind is physically incapable to analyze so many possibilities.

To ensure correctness we have to use locks, which also reduces number of possible execution scenarios (because now several instructions are atomic and could be viewed as single instruction). It also introduces new opportunities to break your program (like deadlocks or priority inversions) and experience performance drops.

Your friend becomes your enemy
In single threaded programming compiler guarantee that even after optimizations your program will have same effect as before optimizations. This is no longer the case in multithreaded programming. Compiler is not able to analyze all instructions interleaves when multiple threads are present, but it will still optimize program as single threaded application. It means that now you are fighting against compiler to suppress certain optimizations.

Simplest optimization that can breaks things is dead code elimination. Consider following code:

#include <cstdio>
#include <pthread.h>
#include <unistd.h>

bool flag = false;

void* thread_func(void*)
{
	while(!flag); //spin
	pthread_exit(0);
}

int main()
{
	pthread_t t;
	pthread_create(&t, 0, thread_func, 0);
	sleep(1);
	flag = true;

	pthread_join(t, 0);
	return 0;
}

New thread waits for flag to become true and exits. After that, main thread terminates. But with optimization -O3 program goes into infinite execution. What happened? Disassembly for thread_func is following:

_Z11thread_funcPv()
_Z11thread_funcPv+20: jmp    0x400744 <_Z11thread_funcPv+20>

It’s exactly an infinite loop. Compiler made an assumption that program is single-threaded, so flag=true will never happen during thread_func execution. Therefore code can be transformed into endless loop (notice that even call to pthread_exit has been wiped out as a dead code).

Another optimization that can be deadly is variable store/reads reordering. Compiler may reorganize reads and writes of non-dependent variables to squeeze more performance. Of course assumption about single threaded execution still holds, so compiler will don’t bother about dependencies between multiple threads. This will probably result in erroneous execution, if such dependencies exists.

In C/C++ both optimizations can be relatively easily found by looking into disassembly listings and fight back by proper use of volatile keyword (e.g. in previous program declaring flag as volatile bool will fix it). But if you think you have defeated final boss you are unfortunately wrong…

The Grey Eminence of all optimizations
Processor. He has last word when and in what order variables will be stored in memory. Rules are similar as with compiler – if program executes on single core, optimization performed by CPU will not affect results of execution (except performance). But when program is executed on multiple cores and each core can reorder instruction in its pipeline you can watch how things break.

To fight back processor reordering first you have to know when reordering may take place. Rules of memory operations ordering are written in processor specification. An Intel has relatively strong consistency model – which means there are not too much cases when it can reorder memory operations. Specification says

Reads may be reordered with older writes to different locations but not with older writes to the same location.

It could break some algorithms like Dekker or Peterson mutual exclusion, but other programming constructs like Double Checking Locking pattern may work correctly. Other processors like Alpha are not so forgiving and almost any memory operation could be reordered (which means Double Checking Locking Pattern won’t work without explicit memory barriers).

If you already know where reordering can cause problems you can suppress it with proper Memory Barrier instruction. In the contrast with compiler optimizations – you are not able to see if processor runtime optimization took place – you can only see its effects. And fighting with invisible opponent can be pretty damn hard.

In next parts of Adventures in Parallel Universe we will explore some concepts presented here more deeply along with the new stuff.

Complication overload

Every kid knows that string processing in C is mundane job and it pretty easy to make a mistake. That’s why in C++ we have std::string class, right? Now we can easily concatenate strings, do other stuff, concatenate strings… Concatenate strings. Let’s do some concatenation.

#include <string>
#include <cstdio>
int main(int argc, char** argv)
{
 std::string text = "current app is: \"";
 text = text + argv[0] + '"';
 puts(text.c_str());
 return 0;
}

Output

current app is: "./prog"

So far everything is ok, but there is small inefficiency – instead text=text+… we can replace it with construction text+=… This will save additional reallocation. Besides, concatenation with std::string is easy – what can possibly go wrong?

#include <string>
#include <cstdio>
int main(int argc, char** argv)
{
 std::string text = "current app is: \"";
 //text = text + argv[0] + '"';
 text += argv[0] + '"';
 puts(text.c_str());
 return 0;
}

Output

current app is: "r/local/bin:/usr/bin:/bin

What just happened? My application for sure does not have this fancy name…
Let’s take a closer look at line:

text += argv[0] + '"';

It looks quite intuitively and harmless, and it should be the same as text=text+argv[0]+'”‘, which worked in previous case. Well, it actually causes a problem.

Expression ‘”‘ is treated as integer with value 0x22 (ascii code of ). At first right side is executed so we will get expression  argv[0]+'”‘ which is substitute for argv[0]+0x22. This expression will be finally concatenated to text. Due to pointer arithmetic argv[0]+0x22 is just a pointer that points outside the argv[0] so we will reference some random memory.
Expression text=text+argv[0]+'”‘ is different because at first text+argv[0] is executed which results in std::string. Then operator+(‘”‘) is executed on string which give us correct result.

As you can see expressions a=a+b and a+=b will not necessary give always same results, not only because of performance.