Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

2 votes

1 answers

151 views

How to implement updates from mySQL operational DB to Azure SQL DB for reporting

sql-server mysql azure-sql-database data-warehouse azure-data-factory

We have an **operational mySQL DB running on AWS** for a transactional system and an **Azure SQL DB for reporting** with PowerBI. Now I'd like to regularly (e.g. every night) do an update of certain tables from the mySQL DB to the Azure SQL DB. I found [this description on how to do incremental copi...

                                  We have an **operational mySQL DB running on AWS** for a transactional system and an **Azure SQL DB for reporting** with PowerBI. Now I'd like to regularly (e.g. every night) do an update of certain tables from the mySQL DB to the Azure SQL DB.

I found this description on how to do incremental copies using Azure Data Factory , however the alternatives don't seem feasible to me:

1. Delta data loading from database by using a watermark  requires adding watermark columns to the source DB, but I don't want to make changes to the operational DB because it is managed and regularly updated by the transactional system.

2. Delta data loading from SQL DB by using the Change Tracking technology  seems to require an SQL Server DB as the source if I understand this correctly.

The remaining two alternatives apply only to updates from files, not DBs, to my understanding.

Are there other feasible alternatives based on the described conditions? They don't necessarily need to involve Azure Data Factory, however the updates should run completely automated in the cloud.

Maybe a non-incremental update (i.e. full replacement of the target DB tables every time) would be an option too, but I'm afraid that this would lead to high costs on the Azure SQL Server side - please share any experience on that as well, if available.

Jörg Brenninkmeyer (121 rep)

May 2, 2019, 07:26 AM • Last activity: Jul 21, 2025, 06:05 AM

0 votes

1 answers

569 views

ADF: TempDb to user table very slow

sql-server query-performance sql-server-2019 wait-types azure-data-factory

I have a Azure Data Factory Data Flow that writes ~250 Mio. rows (~100 GB) to a MS Sql Server 2019. Currently, I have set the 'Use TempDB' option to true: [![enter image description here][1]][1] [1]: https://i.sstatic.net/DK5dD.png The first part of the job, i.e. writing all the data to the temporar...

                                  I have a Azure Data Factory Data Flow that writes ~250 Mio. rows (~100 GB) to a MS Sql Server 2019. 

Currently, I have set the 'Use TempDB' option to true:

The first part of the job, i.e. writing all the data to the temporary table is fairly quick (~2h). As I can see in the activity monitor, it then basically copies to data from the temp-table into the target table.
However, this part is incredibly slow (taking several days). Checking the activity monitor and logs, it seems like the responsible task gets suspended all the time due to a PAGEIOLATCH_SH wait.

Additionally, to avoid a bottleneck due to writing logs, the recovery mode is set to SIMPLE.
The SQL server + database was freshly set up just for this task, so there are no other running/interfering tasks. How comes that it runs into these PAGEIOLATCH_SH constantly?

guid (1 rep)

Aug 31, 2022, 07:36 AM • Last activity: Jun 5, 2025, 04:05 AM

1 votes

2 answers

594 views

Migrating Azure SQL database to Azure sql Managed instance

azure-sql-database azure-sql-managed-instance azure-data-factory

What are different options to move database from "Azure sql server" to "Azure sql managed instance", Looks like below options are not possible 1) Azure migration service - Not supporting azure sql server as source 2) bacpac and use sqlpackage to import, this is not working and getting struck with no...

                                  What are different options to move database from "Azure sql server" to "Azure sql managed instance", Looks like below options are not possible
1) Azure migration service - Not supporting azure sql server as source 
2) bacpac and use sqlpackage to import, this is not working and getting struck with no result 

only options i see is through Azure data factory with self hosted integration runtime 

is there any better options to move from "Azure sql server" to "Azure managed instance"
                                

Dyaneshwaran S (11 rep)

Feb 3, 2020, 06:23 PM • Last activity: May 15, 2025, 03:24 PM

0 votes

1 answers

431 views

Replace Managed Instance replication with CDC and ADF or Azure Function?

sql-server replication change-data-capture azure-sql-managed-instance azure-data-factory

The company I work for put everything on Azure. We use SQL Server replication to move data from one big collection db server (Managed Instance) to our other database servers (say 20 in total). Every day we publish around 10 million (some days will be more, say 100 million plus) new/updated data in v...

                                  The company I work for put everything on Azure. We use SQL Server replication to move data from one big collection db server (Managed Instance) to our other database servers (say 20 in total).

Every day we publish around 10 million (some days will be more, say 100 million plus) new/updated data in various databases.

We can only have one publisher (my understanding), and fairly often, we see replication commands get built up, things slow down, and our DBAs will firefight to get things moving.

In the database server where all the data is collected, we enabled CDC change tracking.

I am wondering if I should create 10 or 20 Azure functions (C# code) to periodically pull changes from CDC. These Azure functions will then copy the changes to our 20 database servers (say all these servers need all these data).

Would this be a reasonable alternative to replication?

For me, each Azure function acts like a distributor, so we suddenly will have 10 or 20 distributors other than just one.

I could use Azure Data Factory to do it, but it is way too expensive compared to Azure functions in my case.

Is this a good idea or would we have any big issues?

jerry xu (63 rep)

Jan 5, 2023, 11:22 PM • Last activity: Mar 5, 2025, 07:36 AM

0 votes

1 answers

987 views

Azure Data Factory - copy data activity into a Postgresql database (on an IaaS Windows Azure vm, not Azure Database for Postgresql (PaaS))

postgresql azure etl azure-data-factory

I have a Postgresql database on a Windows vm (not a PaaS Azure Database), and I need to import csv files into it daily. ADF seems like a perfect fit, but I see that a standard Postgresql db is not a dataset option for the Sink in the Copy Data activity. Is there any way to get this done in ADF? If n...

                                  I have a Postgresql database on a Windows vm (not a PaaS Azure Database), and I need to import csv files into it daily.  ADF seems like a perfect fit, but I see that a standard Postgresql db is not a dataset option for the Sink in the Copy Data activity.  Is there any way to get this done in ADF?  If not, could I please get some recommendations on other tools that include scheduling and alerting?  Thanks!
                                

Bobogator (95 rep)

Sep 19, 2022, 06:53 PM • Last activity: Jan 23, 2025, 08:00 PM

0 votes

1 answers

1231 views

In ADF, are copy data activities wrapped in transactions?

azure-sql-database azure azure-sql-managed-instance azure-data-factory

I have a copy data activity that is moving data from a managed instance to a sql database. The flow of the process is: - truncate a staging table on the sql database as a distinct activity - call a stored procedure as a source in the copy activity - land the data in the staging table on the sql data...

                                  I have a copy data activity that is moving data from a managed instance to a sql database. The flow of the process is:
- truncate a staging table on the sql database as a distinct activity
- call a stored procedure as a source in the copy activity
- land the data in the staging table on the sql database in the copy activity

There is a retry on the copy activity because we are having transient issues, and this is the guidance from Microsoft to handle these errors.

My question(s) then is - if the data is being copied to the staging table, and this is interrupted by a transient error, and then the retry is called, will the staging table be empty because a transaction is rolled back? Or will some of the data from the first try still be there, and then I will end up with duplicate data?

I have spent some time digging around, including https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#resume-from-last-failed-run , but cannot find anything to clarify.
                                

Schmocken (101 rep)

May 9, 2022, 03:04 PM • Last activity: Jan 20, 2025, 07:02 AM

0 votes

1 answers

19 views

Local SQL Server -> MS Managed SQL warehouse via ADF. Compare UniqueID column in both to delete/insert the changes?

azure-data-factory

I have a local SQL server that is maintaining a replica of our ERP production db. I use SQLAgent to run SSIS packages that truncate the replica and do a full copy of the prod db. I am adding a uniqueID column that is a hashbyte of all columns after casting to varchar and concatenating. I want to use...

                                  I have a local SQL server that is maintaining a replica of our ERP production db. I use SQLAgent to run SSIS packages that truncate the replica and do a full copy of the prod db. I am adding a uniqueID column that is a hashbyte of all columns after casting to varchar and concatenating. 

I want to use this uniqueID to find rows in our company's cloud warehouse that need to be deleted or added. I have tried pulling them into a dataflow as sources, but dataflow doesn't work with SHIR sources. So I did a copy tool to Azure blob csv that just captures this uniqueID column, then open this csv in dataflow. When I try joining the two csv files in dataflow it seems like the hashes (now strings, because csv) aren't lining up. Even if the join did seem to work, it still feels very inefficient to save the column to blob storage before processing.

Is there a better way to do this in a pipeline?

funkyman50 (1 rep)

Apr 5, 2024, 10:41 PM • Last activity: Dec 31, 2024, 05:13 PM

0 votes

0 answers

29 views

Local SQL to Managed Cloud SQL via ADF. Copy Data throughput fell off a cliff

sql-server optimization azure-sql-managed-instance azure-data-factory

I have an on-site mini PC acting as a data extraction and replication SQL server that copies data nightly from the ERP production database that's running some legacy linux architecture. The copying is done using SSIS SQL Agent jobs, where they Truncate the SQL tables and Select * the production tabl...

                                  I have an on-site mini PC acting as a data extraction and replication SQL server that copies data nightly from the ERP production database that's running some legacy linux architecture. The copying is done using SSIS SQL Agent jobs, where they Truncate the SQL tables and Select * the production tables.

This mini PC has MS Integration Runtime installed and connected to our ADF environment. Similarly, I have ADF running nightly Copy Data pipelines that truncate our managed cloud SQL tables and Select * the on-site replication tables. This whole project is only about a month old. 

For the first two weeks these ADF pipelines were seeing a throughput between 6 and 8 MB/s. But suddenly, over the weekend, they started only reaching ~550KB/s at best. Pipelines that were only taking 23 seconds on average now take 2 to 3 minutes. The longest pipeline was averaging 11 minutes for ~13 millions rows now takes about three hours. The local network Prod -> Replication SSIS jobs are still as fast as ever.

I have my IT friends looking at the problem from a networking perspective. Is there anything I can troubleshoot from a DB optimization angle?

funkyman50 (1 rep)

Feb 14, 2024, 02:42 PM

0 votes

1 answers

136 views

How to archive a on-premise sql server database to azure

sql-server azure azure-blob-storage azure-data-factory

We have an on-premise database of about eight terabytes that we are trying to archive to Azure. How can we go about it? This archive needs to be accessible because we might need to pull data from it in the future. I have read about azure data lake, blob storage and data factory but i still don't kno...

                                  We have an on-premise database of about eight terabytes that we are trying to archive to Azure. How can we go about it?

This archive needs to be accessible because we might need to pull data from it in the future.

I have read about azure data lake, blob storage and data factory but i still don't know how to go about it.

David ONAJOBI

May 16, 2023, 04:45 PM • Last activity: May 17, 2023, 03:08 AM

2 votes

1 answers

838 views

What permissions are needed for a user running Ola Hallengren's IndexOptimize stored procedure?

permissions ola-hallengren-scripts azure-data-factory

I'm planning to orchestrate my index maintenance with other jobs in my Azure SQL Database (serverless) using Azure Data Factory. The job will be run by the Managed Identity of my ADF service, and the MI has been added to the "db_datareader", "db_datawriter" and "db_ddladmin" roles, as well as been g...

                                  I'm planning to orchestrate my index maintenance with other jobs in my Azure SQL Database (serverless) using Azure Data Factory. The job will be run by the Managed Identity of my ADF service, and the MI has been added to the "db_datareader", "db_datawriter" and "db_ddladmin" roles, as well as been granted EXECUTE rights.

The statistics update worked, but after that I got this error:
> Sql error number: 50000. Error Message: Msg 297, The user does not have permission to perform this action.

What permission(s) is my MI missing? I would like to avoid making it "db_owner" if possible

Widforss (23 rep)

Oct 18, 2022, 12:47 PM • Last activity: Oct 18, 2022, 01:14 PM

3 votes

3 answers

1273 views

ADF patterns to minimise batch run elapsed time

etl azure-data-factory

We use Azure Data Factory (ADF) to pull a number of source tables from an on-prem SQL Server DB into Azure Data Lake (DL). We've made this data-driven using the Lookup-ForEach pattern. There is one big table, a couple of large-ish ones and several small ones. They range from 400GB to 1MB. [![Size in...

                                  We use Azure Data Factory (ADF) to pull a number of source tables from an on-prem SQL Server DB into Azure Data Lake (DL). We've made this data-driven using the Lookup-ForEach pattern.

There is one big table, a couple of large-ish ones and several small ones. They range from 400GB to 1MB.

*fig 1: Tables' sizes. The distribution is very skewed.*

The degree of run-time parallelism is controlled by the ADF ForEach activity's Batch Count parameter. I think of this as the number of work queues, or "workers" available.

The standard implementation distributes items over workers in a round-robin fashion  in advance of execution. This means there will always be tasks queued behind the largest table. This needlessly increases the overall elapsed time. Manual investigation suggests if the largest table is given its own worker all the other tables can fit into three other workers and the elapsed times come out somewhat uniform.

*fig 2: Arranging observed elapsed times so the largest table (dark blue) has its own worker yields fairly even end times.*

What techniques or patterns to allocate work to workers can I implement so that the overall elapsed time is minimized?

We can vary ADF and the DL as we like. The corresponding Integration Runtime (IR) is finite, however. We cannot just scale out to resolve the situation.

The source system is third-party. Small modifications can be accommodated but major source table refactoring would not be possible.

We will be adding further source tables. A solution which requires minimal re-coding as sources are added would be preferable. The system is under active maintenance so necessary changes will be implemented.

Each source table's size will vary from day-to-day but not hugely. If the table holds 20GB today it may be 19GB or 21GB tomorrow, but it will not be terabytes.

So yes, this is a scheduling problem with the number of items, their relative sizes and the number of queues fairly stable.

### Related ###

Select data divided in groups evenly distributed by value

Michael Green (25255 rep)

Jul 26, 2021, 01:40 PM • Last activity: May 19, 2022, 04:27 AM

0 votes

0 answers

369 views

Replicating on-prem MS SQL databases with Azure DataFactory

replication azure-data-factory

Is it possible to replicate a database from an on-prem SQL server to a SQL Server running on an Azure VM (SQL on IaaS, **not** a SQL Managed Instance or an AzureSQL database) using Azure DataFactory? Most of the documentation refers to cloud migration, i.e. from on-prem to AzureSQL. I'm hoping to us...

                                  Is it possible to replicate a database from an on-prem SQL server to a SQL Server running on an Azure VM (SQL on IaaS, **not** a SQL Managed Instance or an AzureSQL database) using Azure DataFactory?

Most of the documentation refers to cloud migration, i.e. from on-prem to AzureSQL. 

I'm hoping to use Azure DataFactory, rather than log shipping or distributed AGs, or something more near-realtime than hourly backup/FTP/restore.

Marcel (111 rep)

Mar 14, 2022, 12:48 PM

2 votes

3 answers

1971 views

How to run legacy SSIS packages with Azure Synapse

azure-data-factory azure-synapse-analytics

I would like to know how to run an SSIS package from Azure Synapse Studio or whether it's at all possible. Apparently there is no support for SSIS runtime in Synapse but I suppose there should be a way to run legacy packages since there was in Azure Data Factory (SSIS runtime). The only documentatio...

                                  
I would like to know how to run an SSIS package from Azure Synapse Studio or whether it's at all possible. Apparently there is no support for SSIS runtime in Synapse but I suppose there should be a way to run legacy packages since there was in Azure Data Factory (SSIS runtime). The only documentation I found about it is this one highlighting the differences between ADF and ASA: https://learn.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences  but couldn't find any documentation explaining the reasons, or how to run legacy packages with Synapse.

Jayvee (121 rep)

Jul 19, 2021, 02:46 PM • Last activity: Mar 1, 2022, 10:18 PM

1 votes

1 answers

2567 views

How to call python file in repo in databricks from data factory outside DBFS?

azure python azure-data-factory databricks

In Azure Databricks I have I have a repo cloned which contains python files, not notebooks. In Azure Data Factory I want to configure a step to run a Databricks Python file. However when I enter the /Repos/..../myfile.py (which works for Databricks Notebooks) it gives me the error " DBFS URI must st...

                                  In Azure Databricks I have I have a repo cloned which contains python files, not notebooks.

In Azure Data Factory I want to configure a step to run a Databricks Python file. However when I enter the /Repos/..../myfile.py   (which works for Databricks Notebooks) it gives me the error "
DBFS URI must starts with 'dbfs:'"

How can I reference a python file from a report which is not in dbfs?

NOTE I see a duplicate question here but the answer was just to wrap it in a Databricks Notebook - OK workaround but when I do it I get "No module named 'my_python_file'"

https://stackoverflow.com/questions/70096408/how-to-create-a-databricks-job-using-a-python-file-outside-of-dbfs

Brendan Hill (301 rep)

Dec 1, 2021, 08:14 AM • Last activity: Jan 7, 2022, 07:51 AM

-1 votes

2 answers

766 views

Accessing SQL server on-Prem database views from remote server through Azure Data Factory

sql-server linked-server azure-data-factory

We need to copy the data from client's remote SQL server (on-prem) database to our Azure SQL server through Azure data factory. So we can automate the data pull on regular basis. Client tried to offer us that they can create a domain login for us to access their database views. I would like to know...

                                  We need to copy the data from client's remote SQL server (on-prem) database to our Azure SQL server through Azure data factory.

So we can automate the data pull on regular basis.

Client tried to offer us that they can create a domain login for us to access their database views. I would like to know if that works to make a connection via Azure Data Factory pipelines and linked service connections.

Please suggest.

user237624 (1 rep)

Aug 31, 2021, 10:26 AM • Last activity: Sep 1, 2021, 05:20 AM

2 votes

0 answers

283 views

Exactly-once FIFO queue in Synapse

azure queue azure-data-factory azure-synapse-analytics

Creating a queue table in SQL Server is a [much][1]-[studied][2] [problem][3]. However, I would like to implement one in Azure Synapse where many of the building blocks do not exists. Specifically * no table hints (READPAST etc.) * no OUTPUT clause * sp_getapplock is not available Our Synapse instan...

                                  Creating a queue table in SQL Server is a much -studied  problem . However, I would like to implement one in Azure Synapse where many of the building blocks do not exists. Specifically

 * no table hints (READPAST etc.)

 * no OUTPUT clause

 * sp_getapplock is not available

Our Synapse instance is configured  READ UNCOMMITTED.

Each evening a batch job will run. Fewer than 100 items will be placed in the queue, in the desired order. These will then be processed by between 4 and 10 concurrent consumers until the queue is drained. The cycle repeats the following day.

The number of items and consumers can change from day to day. The elapsed time for items will be quite skewed, from under a minute to over an hour. The design of the queue and the consumer code is completely open. The process is orchestrated by Azure Data Factory (ADF); the solution can rely on ADF if needed. We would rather avoid additional Azure services to limit costs. When the run fails we do not want to re-process completed items but those in-flight can be abandoned and started from scratch i.e. checkpoint/ restart at the item level is desired.

Michael Green (25255 rep)

Aug 2, 2021, 01:12 PM

2 votes

1 answers

539 views

How are tasks assigned to ForEach iterations

azure azure-data-factory

The Lookup-ForEach pattern is common in Azure Data Factory (ADF). How are items produced by the Lookup allocated to the ForEach's workers, the number of which is controlled by Batch Count?

                                  The Lookup-ForEach pattern is common in Azure Data Factory (ADF). How are items produced by the Lookup allocated to the ForEach's workers, the number of which is controlled by Batch Count?
                                

Michael Green (25255 rep)

Jun 3, 2021, 01:23 PM • Last activity: Jun 3, 2021, 05:30 PM

-1 votes

1 answers

863 views

Data Factory keeps getting "untrusted connection" error when logging into on prem SQL Server via IR on VM

sql-server security azure-data-factory

We have: 1. A local SQL Server database 2. A VM with the Integration Runtime installed on it 3. A Data Factory sharing this IR, working (according to the ADF interface) 4. Logins stored in Key Vault for SQL Server auth from data factory to log into the SQL Server via the IR The connection string pro...

                                  We have: 

 1. A local SQL Server database
 2. A VM with the Integration Runtime installed on it
 3. A Data Factory sharing this IR, working (according to the ADF interface)
 4. Logins stored in Key Vault for SQL Server auth from data factory to log into the SQL Server via the IR

The connection string properties of the connection appear to be what I want, note encryption is set to off and trustservercertificate=true:

The IR reports as connected and working fine:

I have tested the SQL Server auth on the VM with the IR installed via the IR's Diagnostics tab. Works OK:

From Data Factory, I test using the same login parameters on the linked service connected to the IR, which works:

Note we use a lot of parameters to define the connections, these work fine as well.

I set up a dataset, using the same properties, test connection works:

Now I set up an easy copy activity using the same parameters, using the same Dataset and the same linked service:

I also preview the data I want, which works, bringing data back from the server as expected.

This makes no sense to me, I have tested using exactly the same parameters in each step, but a simple copy activity fails when all the other connection tests have succeeded. Anyone have any idea?

NOTE: I have tried setting the Linked Service connection parameters: Encryption = True which *sometimes* means the connection goes through OK. Again, this makes little sense.

blobbles (1621 rep)

Feb 4, 2021, 11:34 PM • Last activity: Feb 12, 2021, 01:27 AM

0 votes

1 answers

244 views

SQL Server 2016 compress function details - external decompression

sql-server-2016 compression snowflake azure-data-factory

According to SQL Server 2016 docs COMPRESS and DECOMPRESS methods are just a blackbox - you put the data in and after some magic it gets compressed or decompressed. The problem is that I need to find a way on how to decompress this data outside SQL Server - using Snowflake or Azure Data Factory pref...

                                  According to SQL Server 2016 docs COMPRESS and DECOMPRESS methods are just a blackbox - you put the data in and after some magic it gets compressed or decompressed. The problem is that I need to find a way on how to decompress this data outside SQL Server - using Snowflake or Azure Data Factory preferably. Does anyone please have a clue on how to approach this topic? This difficult in particular due to lack of detailed docs on the compression method/algorithm used.
                                

Paweł Sopel (111 rep)

Oct 29, 2020, 12:54 PM • Last activity: Oct 29, 2020, 01:18 PM

0 votes

1 answers

261 views

Alternative methods for client data integration (Azure SQL Database)

azure-sql-database integration azure-data-factory

I have an application that ingests data from clients on a daily/weekly basis (two different data sets, one daily and the other weekly) into a SQL Azure Database. The clients' data source depends on what software they use, so can vary from client to client. I currently have two methods of integration...

                                  I have an application that ingests data from clients on a daily/weekly basis (two different data sets, one daily and the other weekly) into a SQL Azure Database. The clients' data source depends on what software they use, so can vary from client to client.
I currently have two methods of integration, depending on the client:

1. Using Azure Data Factory and Self-Hosted Integration Runtime. In this method, the client is required to provide (within their network) a VM in which I set up the Integration Runtime, and a SQL Server database with just two tables where they dump the two datasets as required. In ADF, I create pipelines to pull the data directly from their SQL Server into my Azure SQL Database, then run necessary import procedures.
2. Using Azure Data Factory and BLOB Storage. In this method, I provide the client with a set of Powershell scripts to be run on a schedule (Windows task scheduler) that help them to copy their exported files (.CSV) to our BLOB storage. Then, the ADF pipelines copy from the BLOB storage to the Azure SQL Db, then run the necessary import procedures.

The first method is much simpler, but in terms of infrastructure at the client end, it seems like a bit of overkill to set up a mostly blank Windows VM and a database with just a couple of data dump tables. Obviously, this can be costly if the client themselves are cloud-hosted - firing up a new VM is not cheap, so could make them think twice about using our product.

The second method requires me to set up a storage container for each client, which I feel could make administration difficult as we scale up. Also, providing scripts to run with Windows Task Scheduler doesn't feel overly elegant.

Does anybody have any alternative solutions to this scenario? Or am I on the right track?

Any insights would be greatly appreciated. Thanks.

brad (11 rep)

Feb 24, 2020, 12:15 AM • Last activity: Aug 31, 2020, 05:20 AM

Showing page 1 of 20 total questions