Saturday, 10 April 2010

Validating Integers in SQL Server

Introduction

Hypothetical scenario: Data has been loaded into a SQL Server table from a CSV file; each column in the table has been defined as VARCHAR(255) to prevent the data load job from failing if one or more values are not of the expected data type. You're job is to copy data from this 'raw' table into one where the columns are strongly typed (i.e. INT, DATETIME, DECIMAL, etc). The requirement is to pull through those rows which are entirely valid, but to ignore those rows where any value in the VARCHAR column cannot be successfully converted to the target data type. Actually, if rows which are only partially valid could be copied across too, with invalid values being set to NULL in the target table, then that would be even better.

The requirement seems simple enough, but how would you do this? It's actually quite tricky to get right. So I'm going to make the scenario even simpler - you only have to support INT data types. Your job is simply to make sure that the supplied VARCHAR can be converted to an INT before you actual attempt to do so.

ISNUMERIC

You first port-of-call might be to use ISNUMERIC to establish whether a source value is numeric. This sounds reasonable enough.

SELECT ISNUMERIC('9876543210')

The above statement returns 1 - the value '9876543210' is numeric. But is it an INT? No - it's too large. The documentation for ISNUMERIC says that it returns 1 if the supplied character column can be converted to at least one of the numeric data types. So that's not particularly useful if we're trying to establish whether a value is a valid INT or not.

ARITHABORT

So you decide simply to convert the value, ignoring arithmetic overflows using SET ARITHABORT OFF. Let's try:

SET ARITHABORT OFF
SELECT CONVERT(INT, '9876543210')

Msg 248, Level 16, State 1, Line 2
The conversion of the varchar value '9876543210' overflowed an int column. Maximum integer value exceeded.

Then you remember that setting ARITHABORT OFF has no effect unless you also set ANSI_WARNINGS OFF. So you try again:

SET ARITHABORT OFF
SET ANSI_WARNINGS OFF
SELECT CONVERT(INT, '9876543210')

Perfect - this selects the value NULL, which can be easily tested for.

User-Defined Function

So we wrap this test into a user-defined function so it can be easily re-used.

CREATE FUNCTION dbo.IsINT
(
  @Input VARCHAR(255)
)
RETURNS BIT
AS
BEGIN
  SET ARITHABORT OFF
  SET ANSI_WARNINGS OFF
  RETURN CASE WHEN CONVERT(INT, @Input) IS NULL THEN 0 ELSE 1 END
END
GO

Msg 443, Level 16, State 15, Procedure IsINT, Line 8
Invalid use of side-effecting or time-dependent operator in 'SET OPTION OFF' within a function.
Msg 443, Level 16, State 15, Procedure IsINT, Line 9
Invalid use of side-effecting or time-dependent operator in 'SET OPTION OFF' within a function.

We can't use SET option OFF within a function? That's a shame. Okay, so we'll have to require that the caller does this. But the least we should do is check that they have.

CREATE FUNCTION dbo.IsINT
(
  @Input VARCHAR(255)
)
RETURNS BIT
AS
BEGIN
  IF SESSIONPROPERTY('ARITHABORT') = 1
    RAISERROR (N'This function requires that ARITHABORT be set to OFF before it is called.', 16, 1)

  IF SESSIONPROPERTY('ANSI_WARNINGS') = 1
    RAISERROR (N'This function requires that ANSI_WARNINGS be set to OFF before it is called.', 16, 1)

  RETURN CASE WHEN CONVERT(INT, @Input) IS NULL THEN 0 ELSE 1 END
END

Msg 443, Level 16, State 14, Procedure IsINT, Line 9
Invalid use of side-effecting or time-dependent operator in 'RAISERROR' within a function.
Msg 443, Level 16, State 14, Procedure IsINT, Line 12
Invalid use of side-effecting or time-dependent operator in 'RAISERROR' within a function.

Nope. Can't do that either. We'll just have to be content with returning NULL if the options haven't been set correctly and hoping the caller has read the documentation for our function.

CREATE FUNCTION dbo.IsINT
(
  @Input VARCHAR(255)
)
RETURNS BIT
AS
-- Note to callers: ARITHABORT and ANSI_WARNINGS must be set to OFF
-- before calling; if not, this function will always return NULL.
BEGIN
  IF SESSIONPROPERTY('ARITHABORT') = 1
    RETURN NULL

  IF SESSIONPROPERTY('ANSI_WARNINGS') = 1
    RETURN NULL

  RETURN CASE WHEN CONVERT(INT, @Input) IS NULL THEN 0 ELSE 1 END
END

Let's try our function with something which is clearly a valid integer:

SELECT dbo.IsINT('12345')

Ooops - forgot to set ARITHABORT and ANSI_WARNINGS off so it returned NULL. Good job it warned me. Let's try again.

SET ARITHABORT OFF
SET ANSI_WARNINGS OFF
SELECT dbo.IsINT('12345')

That worked (it returned 1). Let's throw a few more valid integers at it:

SELECT dbo.IsINT('2147483647')
SELECT dbo.IsINT('0')
SELECT dbo.IsINT('-0')
SELECT dbo.IsINT('-') -- treated as 0
SELECT dbo.IsINT('-2147483648')

Yup - they all returned 1. How about a couple which are out-of-range:

SELECT dbo.IsINT('2147483648')
SELECT dbo.IsINT('-2147483649')

Both these returned 0. Finally, let's check to see what happens when we pass a non-integer number or a non-numeric value.

SELECT dbo.IsINT('1.23')

Msg 245, Level 16, State 1, Line 1
Conversion failed when converting the varchar value '1.23' to data type int.

SELECT dbo.IsINT('N/A')

Msg 245, Level 16, State 1, Line 1
Conversion failed when converting the varchar value 'N/A' to data type int.

Not so good. It seems that ARITHABORT can only be used to suppress errors of an arithmetic nature, not all conversion errors.

Numeric, but not Integer

We can cater for non-numerics by using ISNUMERIC within our function:

  RETURN CASE WHEN ISNUMERIC(@Input) = 0 OR CONVERT(INT, @Input) IS NULL THEN 0 ELSE 1 END

But how are we going to cope with '1.23' without raising an error? I guess we could convert the input to a FLOAT first (on the assumption that all genuine numerics will convert to a FLOAT) and then check to see if there are any digits to the right of the decimal place. Or we could take a look for a '.' within the input string. We might need to be a little careful as I'm sure there are some locales which use a character other than '.' as the decimal symbol, although I'm not sure what support SQL Server has for these. A quick check in Control Panel shows that the decimal symbol is indeed configurable within Windows itself, with the default in France being ','. I can't see any support for this within SQL Server though so perhaps that's a red herring.

So let's go with the simple option of looking for a '.' within the input string; if we find one we know the input cannot be an integer. (I'm ignoring the question of whether or not "1.0" should be considered an integer for the moment.)

Our RETURN statement thus becomes:

  RETURN CASE WHEN CHARINDEX('.', @Input) != 0 OR ISNUMERIC(@Input) = 0 OR CONVERT(INT, @Input) IS NULL THEN 0 ELSE 1 END

Surely we must have tested for everything now: we only even attempt the conversion if the input is devoid of a '.' and is considered to be numeric, and then we ignore the result of any arithmetic overflow. Job done.

Exponential Notation

Hang on a moment. What about integers expressed in exponential form? What happens if we try our IsINT function against '1E5'. It doesn't contain a '.', is numeric, and shouldn't cause an overflow. Let's see:

SELECT dbo.IsINT('1E5')

Msg 245, Level 16, State 1, Line 1
Conversion failed when converting the varchar value '1E5' to data type int.

Now 1E5 is certainly an integer. It's just 100000 expressed in exponential form. But the requirement was to "to make sure that the supplied VARCHAR can be converted to an INT". So the fact that 1E5 is an integer is irrelevant - we only care about whether or not it can be converted to an INT. It cannot. In fact, this little 'Get Out of Jail Free' card allows us to ignore the issue of whether or not "1.0" is an integer or not too - CONVERT chokes on it, and that's all that counts.

So, it's looks like we're going to have to handle this as a special case. Using the same approach we used when checking for a '.' seems the right thing to do. Actually, seeing as we'll have to check for '.', 'E' and 'e' we might as well use LIKE.

Solution

The final solution we end up with is therefore:

CREATE FUNCTION dbo.IsINT
(
  @Input VARCHAR(255)
)
RETURNS BIT
AS
-- Note to callers: ARITHABORT and ANSI_WARNINGS must be set to OFF
-- before calling; if not this function will always return NULL.
BEGIN
  IF SESSIONPROPERTY('ARITHABORT') = 1
    RETURN NULL

  IF SESSIONPROPERTY('ANSI_WARNINGS') = 1
    RETURN NULL

  RETURN CASE WHEN @Input LIKE '%[.Ee]%' OR ISNUMERIC(@Input) = 0 OR CONVERT(INT, @Input) IS NULL THEN 0 ELSE 1 END
END

That's quite a lot of effort just to establish whether a VARCHAR can convert into an INT. Imagine how much more effort it would be if we needed to support DATETIME types too, with all their variations, or DECIMALs with precision and scale to account for. And even now, I don't know for sure that IsINT won't thrown an exception one day when someone passes something to it which I haven't allowed for.

So, the next time someone suggests you just load data as VARCHARs and do the conversion to a string type within SQL Server, I hope you'll at least think twice before saying what a good idea that is.

Post Script

In case any of you are thinking that the easy option would simply be to wrap the attempted conversion in a TRY CATCH block, you're out of luck. SQL Server doesn't support the use of exception handling within a user-defined function. And clearly you don't want to attempt to convert an entire set of data within a TRY CATCH block as a single conversion failure will drop you into the CATCH block, whilst we want good data to survive. The only way I can see to use the TRY CATCH approach is to CURSOR over the set of data being converted. Okay for small sets of data perhaps, but not for anything larger.

No comments:

Post a Comment