Behind the scenes
This article details some "behind the scenes" information on how LabVIEW works. The information may be incomplete, out of date, or just plain wrong, so don't take it as gospel.
How the execution system runs VIs
When a caller VI loads, it is "patched" with its subVIs. This means it holds onto a pointer that it can use to efficiently call that VI. The caller also allocates a parameter list for each call. When the call executes, the only thing the caller has to do is fill out this parameter list with the locations of its data and pass the pointer to this table to the subVI's code.
When you call a VI that is built into a DLL, the call library node must first put all the parameters into the form expected by the DLL calling convention. This is different than LabVIEW's parameter list approach and typically means putting each parameter on the CPU's stack. The exported function in the LabVIEW built DLL, must first get a connection to the VI to execute. It accomplishes this by getting a VI reference from a cache of VI references created when the DLL loaded. Each VI reference has a parameter list which must be filled in the parameters for this call. It can then call the code. When the code is complete, there may be additional work to put any outputs into the form expected by the caller. The VI reference must then be returned to the cache.
So there will be more parameter manipulation when calling through a DLL and a couple of accesses to a cache that can cause overhead and additional jitter with parallel calls. Even a call through a strict VI reference a would be faster. With that you are in direct control of the VI reference so no cache is involved and the call puts the parameters directly into a parameter list the subVI expects so there is less manipulation there.
There are also several other limitations introduced by calling a VI through a DLL.
If you pass data to the DLL that contains handles, then you must make sure that the DLL is only ever called from VIs in the same version of LabVIEW. If you don't do this, then you will be passing handles allocated in one memory manger to a different instance and LabVIEW will abort.
If you avoid passing data that contains handles, then LabVIEW will have to make copies of any strings or arrays that are passed to the DLL, further hindering performance.
Preallocated reentrancy is not available through DLLs. This is the kind of reentrancy that allows the subVI to maintain state specific to each caller. When called through a DLL, we have no idea where we got called from so we can't give you back a specific instance. Each cached VI reference will have a specific VI instance but you don't know which one you will get. In essence this makes all reentrant VIs in DLLs act like shared reentrant. (This could explain why it didn't seem like your uninitialized shift register was working.)[1]