Optimizing a Terabyte-Scale Azure SQL Database

This article is contributed. See the original author and article here.

Question Description:
A customer has designed a database schema for a Terabyte-level Azure SQL database, but the original design lacks a unique key. As the data volume increases, performance issues become more pronounced.

To address these performance issues, Azure SQL uses a hidden index with uniqueifier identifiers, which have a maximum limit of 2,147,483,648.

When the number of data rows in a table exceeds 2.1 billion, any insert or update operations will result in the following error:

Msg 666, Level 16, State 2, Line 25

This error can lead to partial or complete database outages, severely impacting business operations.

It isn’t a good idea to create a clustered index on a uniqueidentifier column and generate your GUIDs with NEWID(). The reason for this is that NEWID() generates GUIDs in non-sequential order and SQL Server orders a clustered index sequentially. It will work – SQL Server will let you build a clustered index around a uniqueidentifier column, however it will cause the SQL Server to do unnecessary work and cause performance slowdowns. The reason for this is that to insert data into the middle of a clustered index (out of sequential order) causes SQL Server to make room for the data by rearranging the cluster.

Mitigation Strategies:

1. Quick Workaround:

Make index creation online and resumable by specifying ONLINE=ON and RESUMABLE=ON. This allows the operation to use smaller transactions, making it possible to resume from the failure point if it fails for any reason.

2. Permanent Fix:

Rebuild the clustered index with a unique key by adding a new column. The steps are as followed:

I: Create the index with the new definition offline.

II: Use the DROP_EXISTING option for online index creation. This keeps both indexes up-to-date, and once the new index is built, it will replace the existing one.

III: Use the resumable option to break the creation into smaller transactions, reducing transaction log bloat. Note that there is some performance impact.

IV: Manually create a new copy of the table, copying all cold data, then rename the table. This involves scanning the source table, adding a small additional load to the system. Use a shorter outage and a merge statement for the new data.

V: Create a partitioned view that is a union of the old and new tables with different index schemes. Ensure that insert statements supply values for every field in the table. This method can be complex and may require taking the database offline, with the rebuilding process taking anywhere from a few hours to several days.

Best Practice:
Planning early and carefully designing the clustered index is crucial to avoid these issues.