Page 1 of 1

Race condition in wand.c

Posted: 2009-03-18T06:37:07-07:00
by egb
I'm using Magick++ in a multithreaded C++ Program, and noticed a race condition between AcquireWandId and RelinquishWandId, that sometimes leads to a segfault:
  • Thread1 holds a Wand.
  • Thread2 wants to acquire one, enters AcquireWandId, and runs past the "wand_ids==NULL"-Check (its still valid).
  • context switch.
  • Thread1 releases his wand, and RelinquishWandId deallocates the SplayTree and the mutex, since this was the last wand in use.
  • Thread2 now runs AcquireSemaphoreInfo, thus creating a fresh semaphore, increments "i", and then calls AddValueToSplayTree with a null-pointer as argument.
I think AcquireWandId shouldn't unlock the mutex between the check for a valid "wand_ids" splay-tree, and the time it actually uses it:

Code: Select all

WandExport unsigned long AcquireWandId(void)
{
  static unsigned long
    id = 0;

  AcquireSemaphoreInfo(&wand_semaphore);

  if ((wand_ids == (SplayTreeInfo *) NULL) &&
      (instantiate_wand == MagickFalse))
    {
          wand_ids=NewSplayTree((int (*)(const void *,const void *)) NULL,
            (void *(*)(void *)) NULL,(void *(*)(void *)) NULL);
          instantiate_wand=MagickTrue;
    }
  id++;
  (void) AddValueToSplayTree(wand_ids,(const void *) id,(const void *) id);
  RelinquishSemaphoreInfo(wand_semaphore);
  return(id);
}
The version I'm using is 6.5.0.1.

Re: Race condition in wand.c

Posted: 2009-03-18T07:40:43-07:00
by magick
Let us know if this works in your threaded application:

Code: Select all

WandExport unsigned long AcquireWandId(void)
{
  static unsigned long
    id = 0;

  AcquireSemaphoreInfo(&wand_semaphore);
  if ((wand_ids == (SplayTreeInfo *) NULL) &&  (instantiate_wand == MagickFalse))
    {
      wand_ids=NewSplayTree((int (*)(const void *,const void *)) NULL,
        (void *(*)(void *)) NULL,(void *(*)(void *)) NULL);
      instantiate_wand=MagickTrue;
    }
  id++;
  (void) AddValueToSplayTree(wand_ids,(const void *) id,(const void *) id);
  RelinquishSemaphoreInfo(wand_semaphore);
  return(id);
}

Re: Race condition in wand.c

Posted: 2009-03-18T07:47:52-07:00
by egb
magick wrote:Let us know if this works in your threaded application:
The version you posted looks quite similar to my version in the initial post.

I tested that one, and couldn't reproduce the problem with it, so I think it solves the problem.
However there were still crashes, probably related to the image pixel cache. I'll post a seperate report once I have found the exact locations.

Thanks for the fast response,
/Ernst

Re: Race condition in wand.c

Posted: 2009-03-18T07:59:10-07:00
by magick
Doh! You're right, our code is exactly your code. We think alike.

We are interested in problems you may find in the threaded environment. We run a number of thread unit tests and they all pass but of course as you know threading problems are sometimes difficult to detect.

Re: Race condition in wand.c

Posted: 2009-03-18T10:42:43-07:00
by egb
Same file, another (very rare) Problem:

The end of RelinquishWandId reads:

Code: Select all

  
  RelinquishSemaphoreInfo(wand_semaphore);
  DestroySemaphoreInfo(&wand_semaphore);
Only once in several thousand test calls, a second thread managed to squeeze in between those calls. There it could Lock the wand_semaphore again, but when trying to unlock it, wand_semaphore was already a NULL Pointer.

I don't see an easy fix for that, a possible solution would require a change to the semantics of DestroySemaphoreInfo and quite some changes to the rest of the code:

DestroySemaphoreInfo takes a locked semaphore as argument, and would first set *semaphore_info=NULL and then pthread_unlock + pthread_mutex_destoy + free.
This way each semaphore_info would also be used protect its own pointer, and DestroySemaphoreInfo would work inverse to AcquireSemaphoreInfo, which always returns a locked semaphore.

Maybe there's an easier way to work arround this problem...

Re: Race condition in wand.c

Posted: 2009-03-18T11:22:24-07:00
by magick
Download http://magick.imagemagick.org/semaphore.c and change
  • RelinquishSemaphoreInfo(wand_semaphore);
    DestroySemaphoreInfo(&wand_semaphore);
to
  • DestroySemaphoreInfo(&wand_semaphore);
in wand/wand.c. Does that resolve the problem or is there still a race condition?

Re: Race condition in wand.c

Posted: 2009-03-18T15:34:09-07:00
by egb
magick wrote: Does that resolve the problem or is there still a race condition?
Have not tested yet, but will now.

At the first glance it looks like you changed the semaphore_info struct to always contain a reference count, and in DestroySemaphoreInfo you copy the semaphore_info pointer, null the original, and then unlock it as often as neccessary before deleting.
so, if the mutex was locked before by the same thread calling DestroySemaphoreInfo, the unlocking should work ok, but its unfortunatly still not safe against race conditions with AcquireSemaphoreInfo:
here the global semaphore_mutex is unlocked before the call to LockSemaphoreInfo, which then might bail out with semaphore_info==NULL...
Also, any other thread currently waiting on the semaphore_info (inside pthread_mutex_lock(&semaphore_info->mutex)) might relock the mutex in the same moment the last iteration of the unlocking-while-loop finishes, or have its semaphore deleted while its still waiting on it.

So it looks like I didn't think my initial proposal through to the end :(

Having LockSemaphoreInfo grab the global semaphore_mutex to work arround this seems like a very bad idea, you'd end up with some sort of "Big Kernel Lock", preventing most parallel execution or creating deadlocks.

Back to the drawing board:

The root cause of the problem are the calls to DestroySemaphoreInfo while there might be still threads using it.
So, would it be an option to simply NOT destroy the wand_semaphore each time the count of active wands reaches 0?
Deleting and re-creating the wand_ids SplayTree each time would still be possible, but depending on the application it might be faster to simply keep it alive as well.

On program termination, there'd now be a leak of a few bytes for the semaphore_info and an (hopefully empty) Splay Tree, as well as a pthread_mutex_t.

To clean them up, an approach like, for example, the DestroyPixelCacheResources function from magick/cache.c, called from MagickCoreTerminus could be used, or would the extra exported function break binary compatibility?

Re: Race condition in wand.c

Posted: 2009-03-18T17:47:31-07:00
by magick
We had considered your idea in an earlier version of MagickWand. We will call DestroyWandSemaphore() when MagickWandTerminus() is called. Look for a patch by sometime tomorrow.

Re: Race condition in wand.c

Posted: 2009-03-19T06:08:21-07:00
by magick
Do you call MagickWandGenesis() in your program? It acquires a wand id which ensures the wand id list is not destroyed until MagickWandTerminus() is called. Give that a try or download ImageMagick 6.5.0-3 Beta from ftp://ftp.imagemagick.org/pub/ImageMagick/beta where we removed the need to call MagickWandGenesis().

Re: Race condition in wand.c

Posted: 2009-03-19T07:46:11-07:00
by egb
magick wrote:Do you call MagickWandGenesis() in your program?
No, I was using the Magick++ Interface, where Magick::InitializeMagick only seems to call MagickCoreGenesis.