Remember one of our earlier blogs where statistics on columns with gradually increasing data would get out of date very fast and need frequent update stats.
Assume SQL Server is being used as a backend for highly OLTP (Transaction based system) and there were around 10K + transactions per second. You thought SQL Server could not handle those or never heard of ;).. So it is here. We were facing issue in our company where there was 10K transactions and only about 7 K or 8 K transactions / second we would have heavy blocking and response time would come down to more than 30 seconds and queries would time out.
To give more background about SQL Server usage and our application it was an ERP system where SQL Server was backend. And we had an audit and permission tables where every user activity is monitored and logged (in case someone deletes it we would want to know how deleted and when was it deleted) and also all deletes were only soft delete, ie. we never deleted data but just updated flag as deleted and would transfer this data asynchronously to a different table during off peak hours.
To enable such auditing we had a table which had column somewhat as below
AuditUserActivity (TimeOfActivity DateTime, UserID int, TypeOfAction varchar(10), Action Varchar(10), MachineIP varchar(10))
And there was a clustered index in TimeOfActivity.
Definition of Clustered Index was Create Clustered Index ClustActivity on AuditUserActivity(TimeOFActivity)
And from this table there were no reads only inserts or deletes. At end of every Quarter this data would be removed from table.
All of a sudden one fine day I got a call saying SQL Server CPU utilization is more than normal and also heavy blocking at same time. I initial thought may be I had a wrong problem description as till date I always believed that if there is heavy blocking on server, CPU would be lower as all threads would be in sleep state while blocked.
And soon all my understanding was to be shattered when I ran our beloved SQL command “Select * from Sysprocesses where blocked <> 0 and SPID <> Blocked”.
Earlier I used to only “Select * from Sysprocesses where blocked <> 0” but post SP4 SQL Server 2000 provides latch information as well, to filter I modified as above. Refer this link for self blocking or latch information in SP4 (http://support.microsoft.com/kb/906344)
Lo behold, see below what I have got.. take a moment and observe below results do you find anything odd there..
There is lot of stuff here that initially did not make sense to me.
See SPID being Blocked like 55 is blocked by 159 and 234 is again blocked by 159. Is 159 culprit or head blocker , no it is not see just down below we have even have 224, 232, 201, 60 and there are lot of head blockers and each of head blocker is not head for more than couple of seconds. It was like there are lot of head blockers each was only blocking for couple of second. So conclusion there is no one big bad statement executed from one session that is blocking this.
See WaitType Columns and LastWaitType columns both are PageLatch_Ex which is quite odd. That means all of the sessions are trying to hold of that page only and no other page in that entire database. Notice Wait Resource 19 Database , 4th File and 1980641 what is so important about that page that every session wants to read / write there.
Finally see type, it is either insert or rollback transactions. So, for all inserts type may be they still waiting to insert or those in rollback type have timedout and all rolling their transaction back.
To understand waittypes, refer this KB (http://support.microsoft.com/kb/822101) or get this from here (http://www.sqlmag.com/Articles/Index.cfm?ArticleID=40925). from KB PageLatch_EX means a Buff Latch and below is excerpt from KB
“The buffer latches are used to synchronize access to BUF structures and their associated database pages. The typical buffer latching occurs during operations that require serialization on a buffer page, (during a page split or during the allocation of a new page, for example). These latches are not held for the duration of a transaction.” And from SQLMag link
“Page latch exclusive:Contention can be caused by issues other than IO or memory performance, for example, heavy concurrent inserts into the same index range can cause this type of contention. If a lot of inserts need to be placed on the same page they are serialized using the latch. A lot of inserts into the same range can also cause page splits in the index which will hold onto the latch while allocating a new page (this can take a while). Any read accesses to the same range as the inserts would also conflict on the latches. The solution in these cases is to distribute the inserts using a more appropriate”
Marked red line is interesting enough.. So we went ahead read trying to get input buffer for each statement that was blocked or blocking and all of them were running same insert statement sample as below..
Insert into AuditUserActivity (getdate(),@UserID, @ActionType, @Action,@IP)
Every session was running this single statement if looked at plan this is one of most simplest query and nothing to optimize for. But since clustered index is on TimeOfActivity column data is ordered by that column and always inserts are going to come to last page of table as getdate() would always get latest date which should come in last. So in essence due to our design we had created a bottleneck where all of sessions are waiting to insert into last page. Interestingly they would also have to wait for page to be split as it is always last page that would split.
Learning from this:
1. Do not create clustered indexes on columns that are gradually increasing like GetDate(), Identity or (Auto Increment) or anything column were data insertion would be sequential in nature. They would create a bottleneck like our server had.
This is not to say do not have clustered index at all have clustered index on a different column but intention should be that every insert should be as much random as possible rather than sequential in nature.
Our solution to this problem was
1. Moved Clustered index to UserID and ActionType
2. Since users logging into system and their actions would be very random we would not end up going to same page all the time.
3. Create index with fill factor as low as 60% to avoid page splits.
HTH and have a Nice Week End.
As always, if you have any questions or needs to be addressed please feel free to shoot me mail at firstname.lastname@example.org