Shortcutting/Clowning/Virtualising Azure SQL Database Bulk Loads

-3 votes
1 answer
86 views
sql-server azure-sql-database sql-server-2019 bcp
                          I'm currently working on a solution to bulk load (almost) structured data into an Azure SQL Database. The long term plan is to use ADF, but the client is keen to have some static data loaded early in process so they can start building BI stuff off it - and I'm keen to keep things agnostic so we can evaluate different bulk load methodologies based on efficiency and cost.

I'm also trying to make everything as foolproof and automated as possible, but BCP into Azure SQL Database is posing some issues.

**Background**

The source data I'm working with is structured at a glance, but has some serious quality issues. My design goals are:
- 3NF data model underlying everything
- Minimal pre-load manipulation of the source data
- Dumb insert surfaces
- Trigger-based data manipulation
- Minimal staging

My on-prem prototype (SQL Server 2019 Developer) works really well and exactly as intended with some unconventional methods. For example - here's a simplified but typical simulation of the source data

    ProjectNumber | ProjectName | Contractor | State | Cost (AU$k) | Approved
    -----------------------------------------------------------------------
    1234, Project ABC, XYZ Construction, NSW, 1000,  TRUE
    1235, Project DEF, TUV Civil,        ACT, na,    unknown
    1236, Project GHI, XYZ Construction, QLD, tba,   FALSE
    1237, Project JKL, ,                 NSW, 2,000, FALSE

Without griping too much about this (paid subscription!) source data, there are obviously some problems. I've built the underlying schema with the following model:

    -- Properly typed tables
    CREATE TABLE _Projects (
      ProjectNumber  int IDENTITY(1,1) -- PK
    , ProjectName    nvarchar
    , ContractorID   int -- FK to _Contractors table
    , StateID        int -- FK to _States table
    , Cost           int
    , IsApproved     bit
    
    -- "Exposition View"
    CREATE VIEW Projects AS (
    SELECT
      ProjectNumber
    , ProjectName
    , _Contractors.ContractorName
    , _States.StateName
    , Cost
    , IsApproved
    FROM _Projects
    LEFT JOIN _Contractors ON [...]
    LEFT JOIN _States ON [...]

Typical, sensible stuff. Then the idea was to have INSTEAD OF INSERT triggers to catch the type violations and clean them up. But because the schema of *inserted* is determined by the target table/view I get pinged for data type violation for the non-numericals in the int columns. No problem though, because I can just do this:

    -- View for Insert Clowning 
    CREATE VIEW Projects#Insert AS (
    SELECT
      ProjectName
    , _Contractors.ContractorName
    , _States.StateName
    , CAST(Cost AS varchar(11)) AS Cost
    , CAST(IsApproved AS varchar(5)) AS IsApproved
    FROM _Projects
    LEFT JOIN _Contractors ON [...]
    LEFT JOIN _States ON [...]
    
    -- Retyping on insert over view
    CREATE TRIGGER TR_Projects#Insert
    ON Projects#Insert
    INSTEAD OF INSERT
    AS
    BEGIN
      -- Insert into FK tables first
      [...]
    
      -- Insert into main table
      INSERT INTO _Projects (ProjectName, ContractorID, StateID, Cost, IsApproved)
      SELECT
        ProjectName
      , _Contractors.ContractorID
      , _States.StateID
      , CASE
        WHEN Cost IN ('na', 'tba') THEN NULL
        ELSE CAST(REPLACE(Cost, ',', '') AS int)
        END AS Cost
      , CASE (IsApproved)
        WHEN 'TRUE' THEN 1
        WHEN 'FALSE' THEN 0
        ELSE CAST(IsApproved AS bit)
        END AS IsApproved
      FROM inserted
      LEFT JOIN _Contractors ON [...] 
      LEFT JOIN _States ON [...]
    END

I'm sure there are potential pitfalls to this, but the advantages I see are:
- The normalisation/validation/remediation are all self-contained and centrally maintainable by SQL admins.
- The insert surface is dumb, meaning it should be fairly tool agnostic. Rather than rewriting the same logic for ADF, BCP, OPENROWSET, XML, JSON, etc. processes - we can just pipe the same low quality data in with a zero-logic one-to-one mapping using whatever best fits the source and know that the target will handle it.
- Formatting for OPENROWSET and BCP is super low effort, because I can target the clowned views and get an exact schema for the insert, based on the datatypes and casts in the view, instead of targeting a table and than manually editing identity columns, FKs and the like.
- The staging is handled by pseudo-table *inserted*, no need for side-by-side staging or temp tables and two step inserts.
- In theory, it can sidestep a whole lot of Azure resourcing - no need for ADF or Blob storage when you can use BCP on a remote client, right?

It works really well with my on prem instance. I've tried BCP and OPENROWSET bulk inserts and both work perfectly. However...

**The Problem**

Azure SQL Database gives me errors because it seems to see through the clown makeup. For example, to zero in on the bit column - I get nothing but NULLs, regardless of the inserted value. To troubleshoot, I changed the Insert trigger to capture the data into a debugging table:

    ALTER TRIGGER TR_Projects#Insert
    ON Projects#Insert
    INSTEAD OF INSERT
    AS
    BEGIN
      -- Insert into main table
      INSERT INTO _Debug (InsertValue, CastValue) -- varchar and bit respectively
      SELECT IsApproved, CAST(IsApproved AS bit)
      FROM inserted
    END

When I check the data, it's all NULLs in _Debug. Even though it's ostensibly inserting varchar(5) values into a varchar(5) pseudo-column, and is completely different to the behaviour of on-prem SQL 2019. If I change the bit column in _Projects to varchar(5), it all works - because it's clearly ignoring the view definition in favour of the underlying table data type.

Interestingly, this works:

    INSERT INTO Projects#Insert (ProjectName, ContractorName, StateName, Cost, IsApproved)
    VALUES 
    ('Project ABC', 'XYZ Construction', 'NSW', '1000',  'TRUE')
    ,('Project DEF', 'TUV Civil',        'ACT', 'na',    'unknown')
    ,('Project GHI', 'XYZ Construction', 'QLD', 'tba',   'FALSE')
    ,('Project JKL', '',                 'NSW', '2,000', 'FALSE')

Everything validates and casts as expected - so it does appear like the Azure SQL version of BCP *server* is getting over zealous and tripping up on a false positive, rather than it being a difference between Azure SQL Server (2022-ish) and on prem SQL Server 2019.

I haven't looked in depth at the integer column *Cost*, but it throws errors because it can't implicitly cast character data. So again, it's favouring the underlying table schema instead of the view schema.

I guess my questions are:
- Has anyone else ever tried this sort of thing, and have they been able to work around any restrictions from BCP?
- Is anyone aware of changes to the SQL engine since 2019 that might explain this? I've found this  vague allusion to enforced validation that "might cause scripts to fail" - and it seems like a high probability that this enforced validation is the reason why my otherwise valid data load method is failing.
- Failing any of that, are there any alternate methods that meet my design goals? Staging the data is an obvious choice, but I'd really like to completely rule out the possibility of inline retyping which seems like a more efficient and elegant way of handling the complexities of poor data quality.

And to get some of the obvious clarifying questions out of the way:
- I'm using the *FIRE_TRIGGERS* hint and can confirm my triggers are triggering
- The CHECK_CONSTRAINTS hint has no discernable effect
- My BCP format file has the relevant fields as CharTerm along with sensible collation and max length
- My BCP format file also has the relevant columns as SQLVARYCHAR - so I'm not doing anything silly like loading a bit value into an nvarchar and then casting it back to bit.


Thanks as always!


                        
Asked by Vocoder (117 rep)
Jul 20, 2023, 04:32 PM
Last activity: Jul 22, 2023, 09:55 AM
Shortcutting/Clowning/Virtualising Azure SQL Database Bulk Loads

Related Questions