Emulating Excel’s NORMDIST in T-SQL

This is something that came up at work recently.  A user wanted to migrate a report from Excel to SSRS and needed to be able to implement the NORMDIST function in T-SQL.  NORMDIST takes four parameters:

  • x, a number indicating the value of interest on the normal distribution curve
  • mu, a number indicating the sample mean
  • sigma, a number indicating the sample standard deviation
  • cumulative, a boolean indicating whether to compute the cumulative value of the normal distribution curve (in other words, the integral of the normal distribution from negative infinity to x). For what I have done here, I am assuming that cumulative is TRUE.  The results for cumulative = FALSE will be completely different.

My statistics knowledge is, to say the very least, quite rusty, but with the help of a couple of Wikipedia pages (this and this), I came up with the following T-SQL that (at least for a smattering of value I tested) comes pretty close to the Excel NORMDIST function.

-- Initialize parameters
declare @x float = 1.5;
declare @mu float = 0.0;
declare @sigma float = 1.0;
 
-- Compute NORMDIST
declare @erfx float = (@x - @mu) / sqrt(2.0 * power(@sigma, 2.0));
declare @sign float = sign(@erfx);
select @erfx = abs(@erfx);
 
declare @t float = 1.0 / (1.0 + 0.3275911 * @erfx);
declare @erf float = @sign * 
	(1.0 - (0.254829592 * @t - 0.284496736 * power(@t, 2.0) +
	1.421413741 * power(@t, 3.0) - 1.453152027 * power(@t, 4.0) +
	1.061405429 * power(@t, 5.0)) * exp(-power(@erfx, 2.0)));
declare @normDist float = 0.5 * (1.0 + @erf);
select @normDist;

Of course, the usual caveats apply.  Be sure to validate this script against Excel for any values of interest, and please let me know if you find any errors in the results.

A final tip:  Tempting as it may be, don’t implement this as a user-defined function.  Scalar UDFs really are quite bad for performance.

COUNT DISTINCT not supported in a windowing context (not documented)

As a trivial example, consider the following dataset:

CountDistinctSalesDataset

Suppose we need to determine the number of customers associated to employee, and we need this result for each row in a report in we are generating.  (The report computes other columns of interest, such as a running total, that we will ignore in this post.)

We could write this as follows:

select	EmployeeName,
	count(CustomerName) over (partition by EmployeeName) CustomerCount
from	Sales;

This works fine for Bob, but for Alice it incorrectly returns a value of 3.  This, of course, happens because Alice is associated with Eve on two different sales.

CountDistinctIncorrectResult

It seems simple and logical to add the distinct keyword:

select	EmployeeName,
	count(distinct CustomerName) over (partition by EmployeeName) CustomerCount
from	Sales;

The first time that I tried this was on a SQL Server 2008 system running at SP2, and it generated the following error:

Msg 102, Level 15, State 1, Line 2
Incorrect syntax near 'distinct'.

On a SQL Server 2008 system at SP3, the following improved error is generated instead:

Msg 10759, Level 15, State 1, Line 2
Use of DISTINCT is not allowed with the OVER clause.

The same error happens with SQL Server 2008 R2 (SP1) and with SQL Server 2012 (RTM).

As the second error message makes clear, this particular usage is not supported.  This is fine and well, except that this limitation doesn’t seem to documented in Books Online, at least not under the COUNT or OVER topics for these three versions of SQL Server.