We recently had a nasty concurrency bug in our server software, it was a data corruption bug. And I’ll try share the experience of debugging it:
1. First of all you’ll need to reproduce this bug. With concurrency bugs it’s not that easy, you’ll have to create test scripts which try to reproduce the bug, ideally it should be one click script and reproducibility 100% or close. If you don’t have 100% reproducibility try to add more threads or less, more data or more time/times running your code.
2. Localization of your bug. Here you’ll need a good logging solution, which can log events from concurrent threads. You should identify each log entry by it’s thread, if thread has some human readable identifier – it’s great (we had ResoMail identifiers for each thread)
3. Review the code you’ve localized, look for any static or shared non final data, ideally there should be none. Be careful about final objects which have internal state, the object internal state might not be final – if you’re in doubt try unsharing the object, for example transform it from static to dynamic or clone the objects and see how it works.
Recommended reading: Java Concurrency in Practice