Parallel Counters

Suppose you are writing concurrent structure (like thread safe hash map) and you need to update and query number of its elements. Or you must count events that appears concurrently in your system (like tracking number of function calls). Sounds easy but concurrent nature of problem complicates it a little bit. How to write thread-safe counters that will be most efficient?
Let’s see what options do we have:

  1. Use locks to achieve thread safety
  2. User atomic operations to achieve thread safety
  3. Give each thread separate counter and sum all counters to compute final value

Everyone knows that locks are slow and won’t be best solution for such simple problem. So first option goes down.
Atomic operations are easy to implement and should be faster than heavy locks. Most people will choose this option.
But what with option number 3? It requires more effort (because we have to keep data per-thread ) but does not require synchronization at all! Can we gain additional performance choosing this option?

Three Counters
Let’s implement each option and compare performance results. Code for three counters are pretty straightforward. I will use OpenMP to simplify implementation even more.
Option 1:

class CLockedEventCounter
{
public:
	CLockedEventCounter() : m_Count(0) { }

	void Event()
	{
		#pragma omp critical (countLock) 
		m_Count++;
	}

	uint64_t GetCount() const
	{
		uint64_t res;
		#pragma omp critical (countLock) 
		res = m_Count;
		return res;
	}

private:
	uint64_t m_Count;
};

Option 2:

class CInterlockedEventCounter
{
public:
	CInterlockedEventCounter() : m_Count(0) { }

	void Event()
	{
		_InterlockedIncrement(&m_Count);
	}

	uint64_t GetCount() const
	{
		return m_Count;
	}

private:
	volatile uint64_t m_Count;
};

Option 3:

class CParallelEventCounter
{
public:
	CParallelEventCounter(uint32_t count)
	{
		m_Count.resize(count);
	}

	void Event(size_t id)
	{
		m_Count[id].count++;
	}

	uint64_t GetCount() const
	{
		uint64_t res = 0;
		for(size_t i = 0; i < m_Count.size(); ++i)
			res += m_Count[i].count;
		return res;
	}
private:
	struct Counter
	{
		Counter() : count(0) { }
		uint64_t count;
		//Pad to avoid false sharing
		char pad[512];
	};
	std::vector<counter> m_Count;
};

In Parallel Counter each thread access only its data so there is no need for synchronization or atomic operations. But will it improve performance?

Performance Test
In test I will simulate parallel events and call Event() function from multiple threads. We will see what time does it take to count 300 000 000 of generated events. We will also vary number of threads to see what scalability each solution has.
At first, comparison between Interlocked Counter and Parallel Counter:

out1

With no doubt Parallel Counter wins. It is about 12 times faster than Interlocked Counter. It also scales much better with growing number of threads.
To get the whole picture let’s put see what performance has lock based counter (represented by red line):

out2

As you can see, this solution it is real performance killer – it is about 80x slower then previous methods!
Full code of benchmark you can read here:

#include "stdafx.h"
#include <omp.h>
#include <vector>
#include <string>
#include <cstdint>
#include <chrono>
#include <Windows.h>

class CLockedEventCounter
{
public:
	CLockedEventCounter() : m_Count(0) { }

	void Event()
	{
		#pragma omp critical (countLock) 
		m_Count++;
	}

	uint64_t GetCount() const
	{
		uint64_t res;
		#pragma omp critical (countLock) 
		res = m_Count;
		return res;
	}

private:
	uint64_t m_Count;
};

class CInterlockedEventCounter
{
public:
	CInterlockedEventCounter() : m_Count(0) { }

	void Event()
	{
		_InterlockedIncrement(&m_Count);
	}

	uint64_t GetCount() const
	{
		return m_Count;
	}

private:
	volatile uint64_t m_Count;
};

class CParallelEventCounter
{
public:
	CParallelEventCounter(uint32_t count)
	{
		m_Count.resize(count);
	}

	void Event(size_t id)
	{
		m_Count[id].count++;
	}

	uint64_t GetCount() const
	{
		uint64_t res = 0;
		for(size_t i = 0; i < m_Count.size(); ++i)
			res += m_Count[i].count;
		return res;
	}
private:
	struct Counter
	{
		Counter() : count(0) { }
		uint64_t count;
		//Pad to avoid false sharing
		char pad[512];
	};
	std::vector<Counter> m_Count;
};

int main(int argc, char* argv[])
{
	if(argc < 4)
	{
		fprintf(stderr, "To small number of args\n");
		return -1;
	}

	uint32_t threadNum = std::stoul(argv[1]);
	int64_t countTo    = std::stoll(argv[2]);
	uint32_t algo      = std::stoul(argv[3]);


	CLockedEventCounter lockedEventCounter;
	CInterlockedEventCounter interlockedEventCounter;
	CParallelEventCounter parallelEventCounter(threadNum);

	
	typedef std::chrono::high_resolution_clock clock;
	typedef std::chrono::milliseconds milliseconds;

	clock::time_point t0 = clock::now();

	#pragma omp parallel for num_threads(threadNum)
	for(int64_t i = 0; i < countTo; ++i)
	{
		if(algo == 0) lockedEventCounter.Event();
		else if(algo == 1) interlockedEventCounter.Event();
		else if(algo == 2) parallelEventCounter.Event(omp_get_thread_num());
	}
	
	uint64_t count = 0;
	if(algo == 0) count = lockedEventCounter.GetCount();
	else if(algo == 1) count = interlockedEventCounter.GetCount();
	else if(algo == 2) count = parallelEventCounter.GetCount();

	clock::time_point t1 = clock::now();
	milliseconds totalms = std::chrono::duration_cast<milliseconds>(t1 - t0);

	if(count != countTo)
	{
		fprintf(stderr, "counter mismatch (%llx) != (%llx)\n", count, countTo);
		return -1;
	}

	printf("%u,%llu\n", threadNum, totalms);
	return 0;
}

Real world case
But this is only benchmark. Let’s see how counting is done in some real life systems that has to deal with issue. Good example is Wireshark – it has to count in realtime huge amount of incoming and outgoing packets that floods you network card. In Windows version, counters are implemented in WinPcap driver. Creators of WinPcap also choosed parallel counters – but with neat modification not available in user mode.
In Windows Kernel there is a possibility to preempt thread scheduler and make sure that execution of your code won’t be rescheduled to different processor. This is done by raising IRQL to DISPATCH_LEVEL (e.g. by calling KeRaiseIrql function). With such possibility we can modify scenario of counting:

  1. When packet comes, raise IRQL to DISPATCH_LEVEL
  2. Get current Id of logical processor
  3. Increment counter for current logical processor by regular non-atomic operation
  4. Lower IRQL

To compute final value just sum over array of counters.

Snippet from WinPcap code:
Incrementation:

Cpu = KeGetCurrentProcessorNumber();
LocalData = &Open->CpuData[Cpu];
LocalData->Received++;

And computation of final score:

for(i = 0 ; i < g_NCpu ; i++)
{
	pStats[3] += Open->CpuData[i].Accepted;
	pStats[0] += Open->CpuData[i].Received;
	pStats[1] += Open->CpuData[i].Dropped;
	pStats[2] += 0;		// Not yet supported
}

Full code of WinPcap is available here: http://www.winpcap.org/devel.htm

Why so slow?
As you can see, writing to shared locations from multiple cores can drain your performance drastically. The question is: why this is so big deal? Couldn’t writes be faster? The answer lies in cache coherency protocols implemented in modern processors. Most of them (like MOESI protocol implemented in x86 family) guarantees you fast reads, but slow non-scalable writes to shared locations. Why is that so – answer lies in details of how cache coherency is achieved. But for this momement one thing should be clear – to implement fast scalable applications we will have to use such data structures and algorithms that avoids writes to shared memory regions.

Advertisements

Volatile: Dark Corners of C

Volatile keyword is an interesting quirk. Used to fight back compiler optimization in order to access memory, which could be modified concurrently and unexpectedly. This includes:

  • communication with hardware via memory mapped I/O
  • access to system structures
  • access to program variables modified by multiple threads

I have described general purpose of volatile keyword in Adventures in Parallel Universe. Today we will dive more deeply into concept and we will see some confusing and counter-intuitive cases which can arise during volatile usage.

Volatile Pointers
Assume you have lock-free list with pointer to it’s tail. First thread atomically exchanges tail with pointer to new node, second waits until tail is not NULL.

struct Node {
  struct Node* prev;
  int elem;
}*tail;

void thread1() {
  while(!__sync_bool_compare_and_swap(&tail, oldtail, newtail))
    cpu_relax();
}

void thread2() {
  while(!tail) cpu_relax();
}

Because we are spinning at tail in thread2, tail has to be volatile to prevent compiler optimizations. We have 3 variants of volatile single pointer:
1. volatile Node* tail;
2. Node* volatile tail;
3. volatile Node* volatile tail;
Which should we choose?

After translating it to English we have:
1. pointer to volatile memory which means that dereferences of pointer will not be optimized by compiler. Example: while(tail->prev) will stay but while(tail) could be wiped out.
2. volatile pointer to memory which means that accessing pointer value will not be optimized. Example: while(tail) will stay but while(tail->prev) could be optimized.
3. volatile pointer to volatile memory which is combination of two previous. Example: both while(tail) and while(tail->prev) will not be optimized.

So in our case we can choose option 2 and 3. Option 3 is less desirable because in presented code we are not dereferencing memory on which tail points, but additional volatile prevents compiler from optimizations.
Of course there exists also multiple volatile pointers – rules are the same as with single volatile pointers.

Const Volatile
We can declare variable as const and volatile in the same time. This could be confusing because const tells us that variable is constant (so cannot be changed) and volatile tells us that variable could be changed. As an example let’s take code that communicates with external device via memory:

volatile int* p = (volatile int*)0xC0000000;
while(*p) cpu_relax();

In code we are only reading memory at address 0xC0000000 but we do not perform writes. And as we know, variables that are read-only should be marked as const to prevent accidental modification.

const volatile int* p = (volatile int*)0xC0000000;
while(*p) cpu_relax();

const volatile int* p means that memory pointed by p could be changed by “outside world” (like hardware) but cannot be changed by dereference of p (such attempt will result in compilation error).

Volatile Casts 1
In Windows there is a function for atomic increments of variable:

LONG __cdecl InterlockedIncrement(
  _Inout_  LONG volatile *Addend
);
...
long var;
InterlockedIncrement(&var);

Can we pass non-volatile variable to InterlockedIncrement function?
The answer is yes – in this situation we are performing implicit cast from non-volatile memory to volatile memory. It simply means that InterlockedIncrement will make less assumptions about passed argument and will expect that argument can be changed by outside world.

What with opposite situation: can we pass volatile memory to function that does not expect volatile argument (e.g. to memcpy)?

void * memcpy ( void * destination, const void * source, size_t num );
...
volatile char *dst = (volatile char*)0xC0000000;
char src[40];
memcpy((void*)dst, src, sizeof(src));

In this situation what we’ve got is undefined behavior. The C standard says:

If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.

We have to use memcpy implementation that handles volatile memory or write our own implementation.

Volatile Casts 2
Assume we have legacy code which declares variable as non-volatile but we have to use it in multithreaded context:

#include <pthread.h>
#include <unistd.h>

int flag = 0;

void* thread_func(void* arg) {
    while(!(volatile int)flag);
    return 0;
}

int main(void) {
    pthread_t t;
    pthread_create(&t, 0, thread_func, 0);
    sleep(1);
    *(volatile int*)&flag = 1;
    pthread_join(t, 0);
    return 0;
}

We have casted flag to volatile so everything should be fine and program should exit. Instead of this, program goes into infinite loop. On disasembly listing you can see that while loop was transformed into infinite loop:

thread_func()
thread_func+38: jmp    0x400776 <thread_func+38>

C standard says:

The properties associated with qualified types [in our case volatile] are meaningful only for expressions that are lvalues

And expression (volatile int)flag is not lvalue (if you will try assign value to such expression you will get compilation error). This means expressions
while(flag)
and
while((volatile int)flag)
are exactly the same and your volatile cast was flushed down to a toilet. Nobody expects the Spanish Inquisition.
To fix it we have transform expression to lvalue:

while(!*(volatile int*)&flag);

Or even better – use volatile alias on non-volatile variable.

Volatile Ordering
Consider following scenario: one thread copies data to buffer and sets flag when it’s done, second thread spins on flag and then reads data from buffer:

char buffer[50];
volatile int done = 0;
void thread1() {
  for(int i = 0; i < _countof(buffer); ++i) 
    buffer[i] = next();
  done = 1;
}
void thread2() {
  while(!done) cpu_relax();
  for(int i = 0; i < _countof(buffer); ++i)
    process(buffer[i]);
}

Because flag is volatile, compiler will not cache it in registers so while loop will exit. But code is still incorrect: done=1 can be executed by compiler before for-loop. This is because buffer is not volatile. And two consecutive accesses: first to volatile variable and second to non-volatile variable (or vice versa) could be reordered by compiler. If such optimization is performed, it will result in accessing uninitialized buffer. This can be fixed by declaring buffer as volatile or by placing Compiler Memory Barrier:

void thread1() {
  for(int i = 0; i < _countof(buffer); ++i) 
    buffer[i] = next();
  KeMemoryBarrierWithoutFence();
  done = 1;
}

void thread2() {
  while(!done) cpu_relax();
  KeMemoryBarrierWithoutFence();
  for(int i = 0; i < _countof(buffer); ++i)
    process(buffer[i]);
}

KeMemoryBarrierWithoutFence will prevent reordering performed by compiler but access to memory could be still reordered by processor. Processor’s reordering, in turn, can be suppressed by Hardware Memory Barrier (but this is topic for another article… or several articles).

Volatile and Atomicity
Accessing volatile variable does not guarantee that access will be atomic. Let’s see what requirements must be met for InterlockedIncrement function to be atomic:

The variable pointed to by the Addend parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems

So we will prove that volatile does not guarantee alignment and therefore does not guarantee atomicity. We will place volatile variable into not aligned memory:

    #pragma pack(1)
    struct {
        char c;
        volatile long int v;
    } a ;
    #pragma pack()

    printf("%p", &a.v);

Output:

0x7fff2ba07971

It means that volatile variable is not aligned to 32-bit boundary which will result in non-atomic accesses on x86 processors. Even without requesting special alignment we can cause that volatile variable will be placed in not aligned memory – e.g. by putting it in memory returned from malloc (malloc does not guarantee that allocated memory will be aligned).

Volatile C++
When classes was added to C (i.e. when C++ was created) some new rules about volatile had to be established. Volatile objects, volatile methods, function overloads with volatile parameters, volatile copy constructors, volatile partial template specialization… It’s really funny to observe how C++ amplifies complexity of C features. But in case of volatile, which is hard to use correctly even in plane C, level of complication in C++ becomes insane. It is tempting to just ignore topic and do not use volatile in conjunction with C++ stuff. Unfortunately new C++ standard comes with bunch of generic classes that uses volatile for multithreading. This means if you want to understand and use correctly C++11 threading library you will have dig into gory details of C++’s volatile. Stay tuned…