Creating Realistic Test Data

There are a number of tools out there to create test data, but a few months ago I wanted to try my hand at rolling my own. My initial attempt was a combination of T-SQL and C#, and it works well, but in this blog post I want to demonstrate a pure T-SQL solution. We’ll create random customer data, including first and last names, street address, city and state. This is intended to create addresses that are United States-centric, but the same principles should be applicable anywhere.

Source Data

In order the data to appear realistic, we need to have some basis for creating that data. The U.S. Census Bureau publishes lists of first and last names, and we’ll start with that as a basis. ┬áTo keep things simple, imagine that there are only 5 last names in the world:

NameFrequency
Smith1.006
Johnson0.810
Williams0.699
Brown0.621
Jones0.621

We need to “normalize” the frequencies so that they sum to 1. We will also keep a running total on the normalized frequency for all previous rows (there are rounding anomalies in the results).

NameFrequencyNormalized FrequencyRunning Total
Smith1.0060.2678.0000
Johnson0.8100.2156.2678
Williams0.6990.1861.4834
Brown0.6210.1653.6694
Jones0.6210.1653.8347

Now we can generate a random number from 0 to 1 (more on that in a bit) and use that to select a record in the table. For instance, if the random number is 0.7421, we find the first record where the running total is less than or equal to 0.7421, or “Brown.”

The raw data sources I used were:

Male first names (U.S. Census Bureau)
Female first names (U.S. Census Bureau)
Last names (U.S. Census Bureau)
Cities and states (U.S. Census Bureau)
Street names (I am having a hard time finding where I originally obtained this data. If I figure it out later, I will update this post.)

I have somewhat filtered and modified the data sets, mostly to eliminate frequency information that is rounded to 0 in the raw data. I have also included StreetPrefix and StreetSuffix data sets that I came up with myself.

Let’s create a few tables to store the data:

CREATE TABLE [dbo].[Name]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[Name] nvarchar (40) NULL,
	[NameType] char (6) NULL,
	[Fraction] float NULL
);
 
CREATE TABLE [dbo].[City]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[City] nvarchar (150) NULL,
	[State] char (2) NULL,
	[Population] int NULL
);
 
CREATE TABLE [dbo].[StreetPrefix]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[Prefix] nvarchar (20) NULL,
	[Weight] float NULL
);
 
CREATE TABLE [dbo].[StreetSuffix]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[Suffix] nvarchar (20) NULL,
	[Weight] float NULL
);
 
CREATE TABLE [dbo].[Street]
(
	[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
	[StreetName] nvarchar (150) NULL
);

Note that each of the data sets has some sort of weighting factor except for the Street data. I don’t have any sort of frequency information, so I will just make the assumption that each street is equally likely.

Once the tables are created, download the data file and run it. The data file is a SQL script with a bunch of inserts into the tables. It’s about 200,000 lines long, but it should run fine in SSMS; it just may take a minute or two. Also note that I have combined all of the first and last names into a single table.

Finally, let’s normalize the data sets. It’s easiest to use windowing functions, so this requires at least SQL Server 2012, but with some rewriting it can be done in earlier versions as well. Note that for the names data set we separate out last names from first names (male and female combined). Since the street data does not contain any weighting information, we just use the COUNT function rather than sum. Finally, we’ll create some useful nonclustered indexes.

alter table dbo.Name add NameGroup varchar(10);
 
update dbo.Name set NameGroup = 'Last' where NameType = 'Last';
update dbo.Name set NameGroup = 'First' where NameType = 'Female';
update dbo.Name set NameGroup = 'First' where NameType = 'Male';
 
alter table dbo.Name add FractionStart float;
alter table dbo.City add FractionStart float;
alter table dbo.StreetPrefix add FractionStart float;
alter table dbo.StreetSuffix add FractionStart float;
alter table dbo.Street add FractionStart float;
 
with NameData as
(
	select FractionStart,
		1.0 * isnull(sum(Fraction) over (partition by NameGroup order by (select null) rows between unbounded preceding and 1 preceding), 0) / sum(Fraction) over (partition by NameGroup) NewFractionStart
	from dbo.Name
)
update NameData
set FractionStart = NewFractionStart;
 
with CityData as
(
	select FractionStart,
		1.0 * isnull(sum(Population) over (order by (select null) rows between unbounded preceding and 1 preceding), 0) / sum(Population) over () NewFractionStart
	from dbo.City
)
update CityData
set FractionStart = NewFractionStart;
 
with PrefixData as
(
	select FractionStart,
		1.0 * isnull(sum(Weight) over (order by (select null) rows between unbounded preceding and 1 preceding), 0) / sum(Weight) over () NewFractionStart
	from dbo.StreetPrefix
)
update PrefixData
set FractionStart = NewFractionStart;
 
with SuffixData as
(
	select FractionStart,
		1.0 * isnull(sum(Weight) over (order by (select null) rows between unbounded preceding and 1 preceding), 0) / sum(Weight) over () NewFractionStart
	from dbo.StreetSuffix
)
update SuffixData
set FractionStart = NewFractionStart;
 
with StreetData as
(
	select FractionStart,
		1.0 * isnull(count(*) over (order by (select null) rows between unbounded preceding and 1 preceding), 0) / count(*) over () NewFractionStart
	from dbo.Street
)
update StreetData
set FractionStart = NewFractionStart;
 
create nonclustered index IX_Name_NormalizedFraction on dbo.Name (NameGroup, FractionStart);
create nonclustered index IX_City_NormalizedFraction on dbo.City (FractionStart);
create nonclustered index IX_StreetPrefix_NormalizedFraction on dbo.StreetPrefix (FractionStart);
create nonclustered index IX_StreetSuffix_NormalizedFraction on dbo.StreetSuffix (FractionStart);
create nonclustered index IX_Street_NormalizedFraction on dbo.Street (FractionStart);

Now that we have data, let’s take a look at the idea of randomness.

Randomness

Unfortunately, the built-in SQL Server rand() function has a lot of limitations, not the least of which is that the function returns the same value when called in set-based operations. For instance, try running

select ID, Suffix, Weight, FractionStart, rand() RandValue from StreetSuffix;

You’ll get a different value for the RandValue column every time you run this statement, but the you’ll also get the same value in each row. This makes rand() OK for row-by-row operations, but fairly useless in the set-based world.

Fortunately, the newid() function comes to the rescue. For example, consider

select ID, Suffix, Weight, FractionStart, newid() RandGuid from StreetSuffix;

Every time you run this, you will get a different GUID in each row. To turn this into a random number, we can compute the binary_checksum() of the GUID. Since binary_checksum() returns a value in the range of int, we can normalize the random value by adding a number to it to ensure the result is non-negative and then dividing by the range of int values.

select ID, Suffix, Weight, FractionStart,
	(binary_checksum(newid()) + 2147483648.) / 4294967296. RandValue
from StreetSuffix;

It’s ugly, but we now get a random value that is greater than or equal to zero and less than one.

Now we can generate a random city:

select top 1 City, State
from dbo.City
where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
order by FractionStart desc;

If you execute this statement a number of times, you find that it tends to favor larger cities. This is exactly what we expect. All other things being equal, realistic data will show far more customer from Los Angelos, CA than from a small midwestern town.

Random Values over Ranges of Data

So far, we’ve covered how to select a random item out of a finite list of data, but what about generating random values over ranges. In our situation, we need to create a house number as part of the address. My initial arbitrary rules were:

  • The house number should contain between 1 and 5 digits. The specific number of digits is randomly determined.
  • If the house number has 5 digits, the first digit should be 1. (This is to avoid fairly unrealistic addresses such as 87369 Main St.)
  • There should be no leading zeros in the house number.

We can generate a single random digit from 0 to 9 with this statement:

select cast((binary_checksum(newid()) + 2147483648.) / 4294967296. * 10. as int);

Now, we could string together some code to follows the above rules exactly, but what they really come down is that we need a number from 1 to 19,999:

select 1 + cast((binary_checksum(newid()) + 2147483648.) / 4294967296. * 19999. as int);

Certainly a case could be made that house numbers will tend toward smaller values (and even better, could be correlated with the size of the associated city) but we’ll keep things simple here and stick with this simple requirement.

Putting It Together

Now we have the data that we need, as well as a means to produce data randomly. Using the same strategy that we used before to get a random city, we can get all the different bits of pieces of the customer record:

with FirstName as
(
	select top 1 Name FirstName
	from dbo.Name
	where NameGroup = 'First'
	and FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by NameGroup, FractionStart desc
), LastName as
(
	select top 1 Name LastName
	from dbo.Name
	where NameGroup = 'Last'
	and FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), HouseNumber as
(
	select 1 + cast((binary_checksum(newid()) + 2147483648.) / 4294967296. * 19999. as int) HouseNumber
), StreetPrefix as
(
	select top 1 Prefix
	from dbo.StreetPrefix
	where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), StreetName as
(
	select top 1 StreetName
	from dbo.Street
	where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), StreetSuffix as
(
	select top 1 Suffix
	from dbo.StreetSuffix
	where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), CityState as
(
	select top 1 City, State
	from dbo.City
	where FractionStart <= (binary_checksum(newid()) + 2147483648.) / 4294967296.
	order by FractionStart desc
), RandomCustomer as
(
	select	FirstName.FirstName, 
			LastName.LastName, 
			cast(HouseNumber.HouseNumber as varchar(5)) + ' ' + StreetPrefix.Prefix + ' ' + StreetName.StreetName + ' ' + StreetSuffix.Suffix Address,
			CityState.City,
			CityState.State
	from	FirstName, LastName, HouseNumber, StreetPrefix, StreetName, StreetSuffix, CityState
)
select *
from RandomCustomer;

It would be nice if we could generate multiple customers at once, but here we run into some trouble. If we try joining the above to a tally table, we might hope that we get a whole bunch of customers. But, alas, the optimizer outsmarts us and all we get is repeated rows (well, mostly repeated; the house number does in fact change from row to row, but otherwise it’s all the same output). Nor can we wrap the above statement into a table-valued function because the newid() function is not allowed.

There is a solution, however. It’s a bit clunky, but it does work. Let’s create a TVF where the random values are passed in as parameters:

if exists (select * from sys.objects where name = 'fnGenerateRandomCustomer' and type = 'IF')
	drop function fnGenerateRandomCustomer;
go
create function fnGenerateRandomCustomer(@r1 float, @r2 float, @r3 float, @r4 float, @r5 float, @r6 float, @r7 float)
returns table
as
return
with FirstName as
(
	select top 1 Name FirstName
	from dbo.Name
	where NameGroup = 'First'
	and FractionStart <= @r1
	order by NameGroup, FractionStart desc
), LastName as
(
	select top 1 Name LastName
	from dbo.Name
	where NameGroup = 'Last'
	and FractionStart <= @r2
	order by FractionStart desc
), HouseNumber as
(
	select 1 + cast(@r3 * 19999. as int) HouseNumber
), StreetPrefix as
(
	select top 1 Prefix
	from dbo.StreetPrefix
	where FractionStart <= @r4
	order by FractionStart desc
), StreetName as
(
	select top 1 StreetName
	from dbo.Street
	where FractionStart <= @r5
	order by FractionStart desc
), StreetSuffix as
(
	select top 1 Suffix
	from dbo.StreetSuffix
	where FractionStart <= @r6
	order by FractionStart desc
), CityState as
(
	select top 1 City, State
	from dbo.City
	where FractionStart <= @r7
	order by FractionStart desc
), RandomCustomer as
(
	select	FirstName.FirstName, 
			LastName.LastName, 
			cast(HouseNumber.HouseNumber as varchar(5)) + ' ' + StreetPrefix.Prefix + ' ' + StreetName.StreetName + ' ' + StreetSuffix.Suffix Address,
			CityState.City,
			CityState.State
	from	FirstName, LastName, HouseNumber, StreetPrefix, StreetName, StreetSuffix, CityState
)
select *
from RandomCustomer;

Then we invoke the function by generating the random values and passing them to the function. We’ll generate a virtual Nums table to control the number of customers produced, and then generate seven random numbers to be passed to the TVF we just created.

declare @customersToGenerate int = 10;
 
with l0 as (select 1 v union all select 1), l1 as (select a.v from l0 a, l0), l2 as (select a.v from l1 a, l1),
l3 as (select a.v from l2 a, l2), l4 as (select a.v from l3 a, l3), l5 as (select a.v from l4 a, l4),
Nums as (select row_number() over (order by (select null)) n from l5),
RandomValues as
(
	select	(binary_checksum(newid()) + 2147483648.) / 4294967296. r1,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r2,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r3,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r4,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r5,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r6,
			(binary_checksum(newid()) + 2147483648.) / 4294967296. r7
	from	Nums n
	where	n.n <= @customersToGenerate
)
select cust.FirstName, cust.LastName, cust.Address, cust.City, cust.State
from RandomValues
cross apply dbo.fnGenerateRandomCustomer(r1, r2, r3, r4, r5, r6, r7) cust;

Here is the output from running this script. As you would expect, the results are going to be different every time.

RandomCustomerData

If you look at the actual query plan, you can see that the system is doing an index seek and key lookup on each of the data tables for each customer generated, which is not surprising given the random nature of the query. This means that as the number of customers generated increases, there is a lot overhead doing those key lookups. However, it is still reasonably efficient. On my machine I was able to insert 1,000,000 randomly generated customers into a table in about 30 seconds.

Summary

I have put together this post merely to present an example of how randomized test data can be generated. The focus has been on generating customer names and addresses, but the same principles can be applied to creating a wide variety of data types. It’s not perfect, but it can suffice nicely for a lot of situations.

Convert Local Time to UTC in SQL Server

I have occasionally had to convert from local time to UTC in SQL Server, but unfortunately there is no built-in functionality to do this. I will share one solution in this post to accomplish this conversion.

By far, the hardest part of the conversion has been accounting for daylight saving time. The rules behind DST vary greatly in different parts of the world and change periodically. The solution that I present here works only for the United States, as accounting for every possible combination would be far more complex. Even for the United States, this code doesn’t account for everything; however, for my purposes, this has been more than adequate.

A little research shows that in the United States, there was no uniform DST rule prior to 1966, and I am making the simplification to ignore DST from 1965 and earlier. One set of rules applied from 1966 to 1986 (last Sunday in April to last Sunday in October). There were some exceptions in 1974 and 1975, but I am ignoring that here. From 1987 to 2006, DST was in effect from the first Sunday in April to the last Sunday in October, and since 2007 it is in effect from the second Sunday in March until the first Sunday in November. The change always happens at 2:00 am local time.

In order to implement this code, I first create a small code table that tells us when when DST starts and ends. This table, as well as all of the functions described in this post, should be created in an administrative database. The table structure is:

create table DstRanges
(
	RangeID int not null identity(1,1) primary key clustered,
	RangeStart datetime null,
	RangeEnd datetime null,
	IsDst bit not null
);
 
create nonclustered index ix_DstRanges__RangeStart_RangeEnd on DstRanges (RangeStart, RangeEnd);
go

Next, I create a helper function that will be used to populate the DstRanges tables. This function can dropped after the load is completed.

create function fn_GetDSTBeginEnd (@date datetime)
returns @dates table (dst_begin datetime, dst_end datetime)
as
begin
	declare	@year int,
			@dst_begin datetime,
			@dst_end datetime;
 
	-- determine if daylight saving time is in effect (all of the following rules based in the US)
 
	select	@year = datepart(year, @date);
 
	-- From 1966 to 1986, daylight saving in effect from:
	-- last Sunday in April until the last Sunday in October
	-- (some exception for 1974-1975, but those will not be accounted for here)
 
	if	@year >= 1966 and @year <= 1986
	begin
		-- start by getting May 1 and November 1
		select	@dst_begin = cast(cast(@year as varchar(20)) + '-05-01' as datetime),
				@dst_end = cast(cast(@year as varchar(20)) + '-11-01' as datetime);
		-- set to first Sunday in May and first Sunday in November
		select	@dst_begin = dateadd(day, 7 - ((datepart(weekday, @dst_begin) + @@datefirst - 1) % 7), @dst_begin),
				@dst_end = dateadd(day, 7 - ((datepart(weekday, @dst_end) + @@datefirst - 1) % 7), @dst_end);
		-- back one week to get last Sunday in April and last Sunday in October
		select	@dst_begin = dateadd(day, -7, @dst_begin),
				@dst_end = dateadd(day, -7, @dst_end);
		-- add two hours because the clock change happens at 2:00 am
		select	@dst_begin = dateadd(hour, 2, @dst_begin),
				@dst_end = dateadd(hour, 2, @dst_end);
	end;
 
	-- From 1987 to 2006, daylight saving in effect from:
	-- first Sunday in April until the last Sunday in October
 
	else if	@year >= 1987 and @year <= 2006
	begin
		-- start by getting April 1 and November 1
		select	@dst_begin = cast(cast(@year as varchar(20)) + '-04-01' as datetime),
				@dst_end = cast(cast(@year as varchar(20)) + '-11-01' as datetime);
		-- set to first Sunday in April and first Sunday in November
		select	@dst_begin = dateadd(day, 7 - ((datepart(weekday, @dst_begin) + @@datefirst - 1) % 7), @dst_begin),
				@dst_end = dateadd(day, 7 - ((datepart(weekday, @dst_end) + @@datefirst - 1) % 7), @dst_end);
		-- back off end date by one week to get last Sunday in October
		select	@dst_end = dateadd(day, -7, @dst_end);
		-- add two hours because the clock change happens at 2:00 am
		select	@dst_begin = dateadd(hour, 2, @dst_begin),
				@dst_end = dateadd(hour, 2, @dst_end);
	end;
 
	-- From 2007 onward, daylight saving in effect from:
	-- second Sunday in March until the first Sunday in November
 
	else if	@year >= 2007
	begin
		-- start by getting March 1 and November 1
		select	@dst_begin = cast(cast(@year as varchar(20)) + '-03-01' as datetime),
				@dst_end = cast(cast(@year as varchar(20)) + '-11-01' as datetime);
		-- set to first Sunday in March and first Sunday in November
		select	@dst_begin = dateadd(day, (7 - ((datepart(weekday, @dst_begin) + @@datefirst - 1) % 7))%7, @dst_begin),
				@dst_end = dateadd(day, (7 - ((datepart(weekday, @dst_end) + @@datefirst - 1) % 7))%7, @dst_end);
		-- forward end date by one week to get second Sunday in March
		select	@dst_begin = dateadd(day, 7, @dst_begin);
		-- add two hours because the clock change happens at 2:00 am
		select	@dst_begin = dateadd(hour, 2, @dst_begin),
				@dst_end = dateadd(hour, 2, @dst_end);
	end;
 
	if  @date >= @dst_begin
	and	@date < @dst_end
	begin
		-- spring forward
		select	@date = dateadd(hour, 1, @date);
	end;
 
	insert	@dates (dst_begin, dst_end) values (@dst_begin, @dst_end);
	return;
end
go

And next, the table can be populated with this code. In this example the table will be loaded through the year 2099, assuming that the current rules are not changed.

create table #year
(
	YearNbr int not null,
	DstStart datetime null,
	DstEnd datetime null
);
 
with l0 as (select 1 v union all select 1), l1 as (select a.v from l0 a, l0), l2 as (select a.v from l1 a, l1),
l3 as (select a.v from l2 a, l2), l4 as (select a.v from l3 a, l3), l5 as (select a.v from l4 a, l4),
nums as (select row_number() over (order by (select null)) n from l5)
insert #year (YearNbr)
select n.n
from nums n
where n.n >= 1966
and n.n <= 2099;
 
update y
set y.DstStart = dst.dst_begin,
	y.DstEnd = dst.dst_end
from #year y cross apply (select * from fn_GetDSTBeginEnd(datefromparts(y.YearNbr, 1, 1))) dst;
 
with Dst as
(
	select	y.YearNbr,
			lag(y.DstEnd, 1, null) over (order by y.YearNbr) PrevDstEnd,
			y.DstStart,
			y.DstEnd
	from	#year y
)
insert DstRanges (RangeStart, RangeEnd, IsDst)
select Dst.PrevDstEnd RangeStart, DstStart RangeEnd, 0
from Dst
union all
select DstStart RangeStart, DstEnd RangeEnd, 1
from Dst
order by RangeStart;

Finally, we define a pair of table-valued functions (one helper function, and one intended to be called by the user). Be sure to change the name of your Admin database in the two places it is referenced (lines 11 and 27).

if exists (select * from sys.objects where name = 'fn_GetDSTInfo')
	drop function fn_GetDSTInfo
go
create function fn_GetDSTInfo (@TargetDate datetime2)
returns table
as
return
	select		td.TargetDate,
				isnull(r.IsDst, 0) IsDst
	from		(select @TargetDate TargetDate) td
	left join	Admin.dbo.DstRanges r
	on			r.RangeStart <= td.TargetDate
	and			r.RangeEnd > td.TargetDate;
go
if exists (select * from sys.objects where name = 'fn_LocalTimeToUTC')
	drop function fn_LocalTimeToUTC;
go
create function fn_LocalTimeToUTC (@LocalTime datetime2, @UtcOffsetHours int, @UsesDst bit)
returns table
as
return
	with OffsetInfo as
	(
		select		@LocalTime LocalTime,
					@UtcOffsetHours UtcOffsetHours,
					case when @UsesDst = 1 and dst.IsDst = 1 then 1 else 0 end DstOffsetHours
		from		Admin.dbo.fn_GetDSTInfo(@LocalTime) dst
	)
	select	OffsetInfo.LocalTime,
            OffsetInfo.UtcOffsetHours,
            OffsetInfo.DstOffsetHours,
			dateadd(hour, -(OffsetInfo.UtcOffsetHours + OffsetInfo.DstOffsetHours), OffsetInfo.LocalTime) UtcTime
	from	OffsetInfo;
go

To use this, we call the fn_LocalTimeToUTC function. It has three parameters.

  • The local time, expressed as a datetime2
  • The offset of local time (ignoring daylight saving) from UTC
  • A bit indicating if daylight saving time is observed locally

  • For example, I am in the Central Time Zone (-6 hours from UTC) and daylight saving is observed, so I would call:

    select * from fn_LocalTimeToUTC(sysdatetime(), -6, 1);

    And the output is:

    LocalTimeToUTC

    Note that there is ambiguity between 1:00 am and 2:00 am when DST ends in the fall, and there is no good way to resolve that ambiguity. This code presumes that DST is still in effect.

    Because fn_LocalTimeToUTC is written as a TVF, it can be efficiently used in set-based operations, usually using an OUTER APPLY operator. For example:

    select oh.OrderId, oh.OrderDate, utc.UtcTime
    from OrderHeader oh
    outer apply Admin.dbo.fn_LocalTimeToUTC(oh.OrderDate, -6, 1) utc
    where oh.OrderId >= 50
    and oh.OrderId <= 59;

    LocalTimeToUTCSetBased