We use Azure Data Factory (ADF) to pull a number of source tables from an on-prem SQL Server DB into Azure Data Lake (DL). We've made this data-driven using the Lookup-ForEach pattern.
There is one big table, a couple of large-ish ones and several small ones. They range from 400GB to 1MB.
*fig 1: Tables' sizes. The distribution is very skewed.*
The degree of run-time parallelism is controlled by the ADF ForEach activity's Batch Count parameter. I think of this as the number of work queues, or "workers" available.
The standard implementation distributes items over workers in a round-robin fashion in advance of execution. This means there will always be tasks queued behind the largest table. This needlessly increases the overall elapsed time. Manual investigation suggests if the largest table is given its own worker all the other tables can fit into three other workers and the elapsed times come out somewhat uniform.
*fig 2: Arranging observed elapsed times so the largest table (dark blue) has its own worker yields fairly even end times.*
What techniques or patterns to allocate work to workers can I implement so that the overall elapsed time is minimized?
We can vary ADF and the DL as we like. The corresponding Integration Runtime (IR) is finite, however. We cannot just scale out to resolve the situation.
The source system is third-party. Small modifications can be accommodated but major source table refactoring would not be possible.
We will be adding further source tables. A solution which requires minimal re-coding as sources are added would be preferable. The system is under active maintenance so necessary changes will be implemented.
Each source table's size will vary from day-to-day but not hugely. If the table holds 20GB today it may be 19GB or 21GB tomorrow, but it will not be terabytes.
So yes, this is a scheduling problem with the number of items, their relative sizes and the number of queues fairly stable.
### Related ###
Select data divided in groups evenly distributed by value


Asked by Michael Green
(25265 rep)
Jul 26, 2021, 01:40 PM
Last activity: May 19, 2022, 04:27 AM
Last activity: May 19, 2022, 04:27 AM