Monday, June 8, 2009

Batch Updates in ADO.Net 2.0 - How to find the optimum batch size?

Before I explain how to go about estimating the optimum batch size to be used with ADO.NET 2.0 batch operations, I want to briefly touch upon the basics. if you want to skip the basics, click here

One the new features in ADO.NET 2.0 is the "Batch Updates" operation. It promises to improve performance of the application by reducing the number of round trips to the database. Prior to ADO.NET 2.0, if we made any changes to the DataSet and then saved it using the Update method of the SqlDataAdapter class, it made round trips to the database for each modified row in the DataSet. This was a major performance hindrance. So how does reducing database roundtrip improves perfromance?


Let's look at the following example:

Assume we are building a 3-tier application (client-webserver-database). The web-servers are located in Boston while the database servers are in California (an extreme example, but a practical one never the less). Also, in the application, we have a datagrid (associated with an underlying datatable) that displays records and through some specific operations (and user interaction) the records in the datatable are updated and we need to persist these changes back to the database. We are using ADO.NET, SQLDataAdapter and its Update method. Let us say there were 50 records were modified. Each individual Update operation takes 1 second. And each roudtrip to SQL server takes 1 seconds. So with ADO.NET 1.1, where we need to make 50 roundtrips to the database server, the overall operation will take 50 * (1+1) = 100 seconds

So how does the above scenario change with ADO.NET 2.0 ?

In ADO.NET 2.0, we now have a new "UpdateBatchSize" property which indicates the number of rows that are processed in each round-trip to the server. It can take the following values:

n=0 - There is no limit on the batch size
n=1 - Disables batch updating.
n>1 - Changes are sent using batches of 'n' operations at a time

So in our above example, let us set batch size(n) = 5. That means, In each round-trip to server 5 records will be processed. So now the overall operation will take (50/5 * 1) + (50 *1) = 60 seconds.
Let n=10 ==> (50/10 * 1) + (50 * 1) = 55 seconds
Similarly if n=50 ==> (50/50 * 1) + (50 * 1) = 50 seconds

So we see that in this simple contrived example, we easily get a performance boost of 40% - 50% with Batch Update mode. So the key question is what should be the "optimum batch size" that will provide us the maximum performance gain. Surprisingly, this is not a simple question to answer.There is no automated way to figure this out. Moreover, there are not much information out there to help us make an informed decission. This is what MSDN says - Executing an extremely large batch could decrease performance. Therefore, you should test for the optimum batch size setting before implementing your application..

So how to test and 'estimate' the optimum batch size for your own application? Note the word 'estimate'. That is exactly what we are going to compute. In order to do that we need to start with some raw data.

Let say an update operation (take the example I described above) takes 't' seconds to complete (includng the roundtrip time to database server and the web service call alogn with the actual update operation). Execute the operation multiple times, each time with different batch size 'S' and note down the time taken 't' in each case. Here's the pseudo code -


maxbatchsize = M;
batchincrement = 1;
for(S=0; S < M; j=j+batchincrement)
{
if (S!=1) {
t1 = starttimer();
ExecuteBatchUpdate(S, dt)//where S=batch size, dt=datatable with 'R' rows updated
t2 = endtimer();
t' = t2-t1;
LogInTextFile(S, t')//log the batch size and the correspondign time taken t'
}
}

Q: Why do we skip S=1?

Let say for first run, R=500, M=500 and batchincrement =1. That means in our simulation, we will have batch size from 0 to 500 (except 1) and the log file would look something like this:

Batch Time
----- ----
0 t1
2 t2
3 t3
.. ..
500 t500

For the next run R=500, M=500 and batchincrement = 10
Batch Time
----- ----
0 t1
10 t2
20 t3
.. ..
500 t50

For the next run R=500, M=500 and batchincrement = 25
Batch Time
----- ----
0 t1
25 t2
50 t3
.. ..
500 t21

The 'batchincrement' value decides the batchsize for the update operation and the total number of roundtrip to the database server. Ideally you would want to chnage the value of 'batchincrement' (from low to high i.e from 1 to 500) to control the batch size and execute as many runs as you can. Once you have collected the raw data, the exciting part begins.

Now for each run data do the following:

a. Do a scatter plot for Batch Vs Time (x vs y)
b. Then plot the line-of-best fit i.e. do curve fitting and add a trend line.
e.g. you can se a quadratic function (a+bx+cx^2) or a linear function
c. Find the minimum value from the line-of-best fit.


So how to perfrom the above steps? You can use Excel do it or you can use the free open-source statistical analysis software R. With R, its easy as writing a 20 line of R-code to do the above steps.Here's the program (uses a quadartic function for curve fitting for my sample data):


setwd("C:/")
Data<-read.table(file="BatchResult.txt",sep="",header=TRUE)
#par(mfrow=c(2,1))
Batch<-Data[,1]
Time<-Data[,2]
plot(Batch,Time,type="p",col="red",cex=0.5)
## Fit Quadratic Time=a+b*Batch+c*Batch^2
Batch.2<-Batch^2
fit.quadratic<-lm(Time~Batch+Batch.2)
print(summary(fit.quadratic))
coef<-fit.quadratic$coefficients
a<-coef[1]
b<-coef[2]
c<-coef[3]
print(coef)
Time.est<-a+b*Batch+c*Batch.2
lines(Batch,Time.est)
cat("B.hat at min time = ", -(b)/(2*c),"\n")

The output of the above program is shown below:



So say for Run1 we get optimum batch size value B1. Repeat the steps for all the other runs and we have something like this(Table A):

Run1 B1 (B1/R)*100
Run2 B2 (B2/R)*100
Run3 B3 (B3/R)*100
.......
RunN BN (BN/R)*100

where BN/R*100 proivdes the the % of the the total records being modified as the estimated batch size. Sort Table A by the value BN/R*100 and you will get an esitmate of what would be the "optimum range of the batch size" to use in your application.

So now for any new functionality you intrduce in your appliation that needs to use the ADO.NET 2.0 batch operations, you can find out the estiamted range of the optimum batch size which will provide maximum perfromance boost (no more blind trial and error!)

Thanks to Sourish for helping me with the above program

No comments:

Post a Comment