Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

0 answers

19 views

ODI-26211 "Mapping name is not unique" error — even with fresh repository and no agent

I'm working with Oracle Data Integrator 12.2.1.4.0, and I'm encountering a persistent repository-related issue. ❗ Problem: When I try to save changes to a mapping or generate a scenario, I receive the following errors: ``` ODI-26211: Mapping name is not unique ODI-20088: Error while generating the s...

ODI-26211: Mapping name is not unique  
ODI-20088: Error while generating the scenario  
ODI-10182: Uncategorized exception during repository access

This happens even though: It's a fresh repository. The mapping I created is the very first and only one. Tables such as SNP_POP, SNP_SCEN, and SNP_STEP are completely empty. 🔍 Additional context: Initially, I was able to save the mapping. Later, the error started appearing. I suspect this may have happened after I tried to run the mapping (it failed due to missing agent) — but I can't be sure. Renaming the mapping doesn't help. No agents are installed yet — I'm just using Standalone ODI Studio locally. 🔄 Environment: Oracle Data Integrator 12.2.1.4.0 (also tried 14.1.2.0.0 with same result) Oracle XE 21c, PDB: XEPDB1 Repositories freshly created Windows Server 2019 Standard No agent, just testing in local development 🛠️ Background: Originally I started in ODI 14c, encountered the same issue, then switched to 12c and recreated the repositories from scratch, but the issue persists in both versions. Any insight into what might be wrong with the repository setup? Could something be corrupted after running a mapping without agent? Any metadata cleanup suggestions?

Karlen Simonyan (1 rep)

Jul 22, 2025, 11:05 AM • Last activity: Jul 22, 2025, 11:09 AM

0 votes

0 answers

19 views

SSIS Data Quality Services Issue - Object reference not set to an instance of an object

ssis etl data-quality-services

I have tried to load data using my SSIS pipelines and there is no issue when the source table have around 90k+ data. This Data Quality Services component works well. But once the data has reached 100k+ (132,812) rows, it suddenly breaking and produce this error: > [DQS Cleansing] Error: An unexpecte...

                                  I have tried to load data using my SSIS pipelines and there is no issue when the source table have around 90k+ data. This Data Quality Services component works well. But once the data has reached 100k+ (132,812) rows, it suddenly breaking and produce this error:

> [DQS Cleansing] Error: An unexpected error occurred in the DQS infrastructure.
System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.Ssdqs.Component.DataCorrection.Logic.DataCorrectionComponent.ProcessChunk(ReadOnlyCollection1 fieldMappings, ReadOnlyCollection1 records, CorrectedRecordsStatusStatistics& correctedRecordsTotalStatusStatistics)

> [DQS Cleansing ] Error: Microsoft.Ssdqs.Infra.Exceptions.EntryPointException: The attempt to update or delete a DAO object of type 'AKnowledgebase' with id 1000175 has failed because the object is not up to date or is being deleted from the database.
   at Microsoft.Ssdqs.Proxy.Database.DBAccessClient.Exec()
   at Microsoft.Ssdqs.Proxy.EntryPoint.KnowledgebaseManagementEntryPointClient.DQProjectDelete(DQProject dqProject)
   at Microsoft.Ssdqs.Component.DataCorrection.Logic.DataCorrectionComponent.ProcessInput(Int32 inputID, PipelineBuffer buffer)
   at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostProcessInput(IDTSManagedComponentWrapper100 wrapper, Int32 inputID, IDTSBuffer100 pDTSBuffer, IntPtr bufferWirePacket)

> [SSIS.Pipeline] Error: SSIS Error Code DTS_E_PROCESSINPUTFAILED.  The ProcessInput method on component "DQS Cleansing" (369) failed with error code 0x80131500 while processing input "DQS Cleansing Input" (382). The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running.  There may be error messages posted before this with more information about the failure.

This is my current pipeline that I come up with:

And this is the same pipeline but after the 3rd load I made:

**I have tried googling** this error here  but it does not help resolve the error. Is there anyway I can fix this issue? I will also appreciate any resource if you can provide.

Amir Hamzah (11 rep)

Jul 16, 2025, 01:34 AM • Last activity: Jul 16, 2025, 02:12 AM

0 votes

1 answers

189 views

Handling delayed key info in real time data warehouse

data-warehouse etl

Our data warehouse is currently loaded with a traditional daily ETL batch job, but we're looking to soon implement a few stars in real-time (fed by streaming Kafka messages, FWIW). So only a couple fact tables and dimensions will be in real-time, while the rest will remain loaded daily. I was wonder...

                                  Our data warehouse is currently loaded with a traditional daily ETL batch job, but we're looking to soon implement a few stars in real-time (fed by streaming Kafka messages, FWIW).  So only a couple fact tables and dimensions will be in real-time, while the rest will remain loaded daily.  I was wondering how others have dealt with the issue of a real-time fact table with FKs into daily batch-loaded dimensions.

The scenario we have is our business is pushing for our Sales Workflow star schema to be in real-time so we can run intraday workflow analytics, but there are keys in the fact table, like into customers and sales reps (and several others), that'll still be populated daily.  Without turning the entire DW in real-time, what are some best practices for solving the "delayed" key problem?

romanpilot (101 rep)

Oct 2, 2017, 05:57 PM • Last activity: Jun 28, 2025, 07:01 PM

0 votes

1 answers

298 views

Selectively load data using Oracle 11g r2 External Table Preprocessor

oracle oracle-11g-r2 etl import

I am using the Preprocessor feature in Oracle 11g r2 to load several data files in a table. However, the raw files seem to have certain rows that shouldn't be imported with the rest. For example: col1 col2 col3 col4 A 1 2 3 A 2 3 4 B ab bc cd So, I want to only load the rows with col1 = 'A' and load...

                                  I am using the Preprocessor feature in Oracle 11g r2 to load several data files in a table. However, the raw files seem to have certain rows that shouldn't be imported with the rest. For example:

    col1    col2    col3    col4
     A       1       2       3
     A       2       3       4
     B       ab      bc      cd 

So, I want to only load the rows with col1 = 'A' and load the rows with col1 = 'B' into another table. How can I do this inside the processor?

sfactor (111 rep)

Sep 15, 2014, 10:14 AM • Last activity: May 23, 2025, 06:04 AM

1 votes

1 answers

293 views

MYSQL Huge Data migration (5TB) from old server to new server

mysql backup migration etl

We have an old MYSQL server contains huge data ~5TB and wanted to migrate to the new server in order to minimize the costs and get rid of very old hardware. My one and only idea in my mind is using mysqldump for the migration but I’m pretty sure that it’s a poor and risk option with that huge data....

                                  We have an old MYSQL server  contains huge data ~5TB and wanted to migrate to the new server in order to minimize the costs and get rid of very old hardware.

My one and only idea in my mind is using mysqldump for the migration but I’m pretty sure that it’s a poor and risk option with that huge data.

Then someone in my team came up with the idea for using ETL tools but we haven’t go into the deeper details and not really sure if this ETL way can literally help us. 

Any idea are always welcomed

thanks

BTH.S3 (111 rep)

Sep 18, 2020, 07:52 AM • Last activity: May 23, 2025, 03:06 AM

0 votes

1 answers

251 views

Extracting multiple similar tables in different schemas

mariadb etl

I am attempting to extract tables from multiple schemas in the same database. The tables are similar enough that they are easy to identify and work with using meta data queries like this one: ``` SELECT DISTINCT TABLE_NAME FROM information_schema.columns WHERE TABLE_NAME LIKE 'form_entries%' ``` Wit...

I am attempting to extract tables from multiple schemas in the same database. The tables are similar enough that they are easy to identify and work with using meta data queries like this one:

SELECT  DISTINCT TABLE_NAME 
FROM information_schema.columns 
WHERE TABLE_NAME  LIKE 'form_entries%'

With the wildcard standing in for the only distinct part of the table. Each of these tables has a similar format, but a differing number of columns. I can write a Python script to loop the schemas, dump each of the tables I need, then put them all in a single schema. That's the only approach I have at the moment. I am somewhat familiar with ETL tools like Kettle, but I don't have the experience to know if there is an ETL workflow, or just a better way of accomplishing this. **Purpose** Each of the tables involved contains form submission data. Each table is a different form, so the column names are variable. The ultimate goal is data analysis of the form data. There are a lot of steps to get to the analysis we want, but we are stuck at the first hurdle of how widely spread out the data is.

David Hamilton (101 rep)

Jan 11, 2022, 02:32 PM • Last activity: May 22, 2025, 07:08 PM

0 votes

1 answers

286 views

Oracle - Exchange partitions between two List partitioned tables

oracle partitioning etl

Need some help here. I have an ETL process which loads the data into a target table A. We have created another table B which is same as the target table in structure and this table is accessed by reporting team to generate reports. This is done to minimize the downtime for report generation. This wa...

                                  Need some help here.

I have an ETL process which loads the data into a target table A. We have created another table B which is same as the target table in structure and this table is accessed by reporting team to generate reports. This is done to minimize the downtime for report generation. This way the reports are always accessing the latest data.

Target table B is a List partition table, partitioned on Client ID. ETL team loads data for each client in the respective partition, in Table A.

I tried doing it with Exchange partition: exchange partition mechanism to swap the segments of A and partitioned table B. But couldn’t do it as both tables are List partitioned and Oracle doesn’t like that. I created partitions so as to avoid creating multiple tables (TableA_ClientId) for each Client.

My other option: whenever data is loaded into target table A, rename the table B as table Temp, table B as table A and table A as table Temp.

Can you please suggest a better approach.

Manish Mishra (1 rep)

Apr 7, 2021, 03:42 PM • Last activity: May 11, 2025, 09:09 AM

5 votes

2 answers

10690 views

MySQL failing and very slow on importing a large file

mysql postgresql etl

I have a 47 GB MySQL dump of a single table: http://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-image.sql.gz I ultimately want it into PostgreSQL, but since I didn't figure out an easy way to transform the MySQL SQL to PostgreSQL SQL I figured, I will get it into MySQL and then write a...

                                  I have a 47 GB MySQL dump of a single table:

http://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-image.sql.gz 

I ultimately want it into PostgreSQL, but since I didn't figure out an easy way to transform the MySQL SQL to PostgreSQL SQL I figured, I will get it into MySQL and then write a small ETL script to do this.

I initially tried to just load it using MySQL WorkBench Data Import/Restore, but it was failing.

Now I've run split -l 2000 commonswiki-latest-image.sql and have 20 files of roughly 2 GB each, but it still fails with:

    23:12:28 Restoring C:\temp\commonswiki\xaa.sql

    Running: mysql.exe --defaults-extra-file="c:\users\jmurdoch\appdata\local\temp\tmpi_ltz8.cnf"  --host=127.0.0.1 --user=root --port=3306 --default-character-set=utf8 --comments --database=mediawikiimages < "C:\\temp\\commonswiki\\xaa.sql"

    ERROR 2013 (HY000) at line 331: Lost connection to MySQL server during query

    Operation failed with exitcode 1

    00:54:17 Import of C:\temp\commonswiki\xaa.sql has finished with 1 errors

It's also terribly slow as it only imported 209236 rows during the nearly 2 hours, but I think there are about 20M items to import, so at that rate it would take 200 hours to import.

I'm using Windows 7, MySQL 5.6.14 and MySQL Workbench 6.

My main questions:

* Is there a script which would feed MySQL with the splitted files and perform automatic error recovery in case it times out?

* Is there a conversion tool which would convert MySQL SQL to PostgreSQL SQL without first loading it into a database?

John M (181 rep)

Feb 8, 2014, 07:03 AM • Last activity: Apr 30, 2025, 07:19 AM

1 votes

2 answers

720 views

ODI remove auto lock when object / datastore is opened

oracle database-design oracle-12c etl

I was being stupid and enabled automatically lock object during edit / opened when I was asked the first time. Now I want to remove the automatically lock from GUI but I couldn't seem to find it. I googled with such topics https://www.appservgrid.com/documentation111/docs/fmw11g1114documentation/integrate.1111/e12643/appendix_b.htm https://odielt.wordpress.com/tag/odi-studio-settings/ So I searched userpref.xml inside the hard drive, I do find one userpref.xml. I opened the file and inside it's just

false
      false
      false
    
    None
    
      None
      None

What can I do to make it not locking up anymore each time I open an object? If it can be done through GUI then it's perfect if not, what should I do inside the XML file? Thanks in advance for any help and advices.

Dora (135 rep)

Feb 24, 2021, 12:00 AM • Last activity: Mar 26, 2025, 04:46 PM

0 votes

1 answers

320 views

Snowflake/S3 Pipeline: ETL architecture Questions

aws etl python snowflake

I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requireme...

                                  I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requirements are the following:

1. I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
The monthly pulls should get the last full month (since this should be a monthly pipeline).
2. All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.
3. The data load from S3 (query_01012020,query_01022020,query_01032020) back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
4. I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
5. I want to get any error notifications in real-time when issues in the pipeline occur.

I hope you are able to guide me on what components the pipeline should include. Any relevant documentation/tutorials for this effort are truly appreciated.

Thank you very much.
                                

cocoo84hh (101 rep)

Jun 14, 2020, 06:54 PM • Last activity: Mar 13, 2025, 06:02 AM

0 votes

1 answers

946 views

Parallel insert and delete into staging tables

sql-server etl bulk-insert parallelism

I work on a small system that receives quite a lot of data from multiple sources (hundreds of them). Depending on few factors I can get anywhere from few rows to a couple of thousands (in fact smallest message contains single row and largest contains at most five thousands, but if more data should b...

                                  I work on a small system that receives quite a lot of data from multiple sources (hundreds of them). Depending on few factors I can get anywhere from few rows to a couple of thousands (in fact smallest message contains single row and largest contains at most five thousands, but if more data should be loaded then I get this data in few batches). Currently data is processed by few identical services. Each service load data into one of five staging tables and executes procedure that does the right thing - each staging table has different procedure associated with it, but it always boils down to changing unique identifiers (e.g. UUIDs) to proper database identifiers, inserting data into destination table and removing data from staging table. All procedures have this form:

    INSERT INTO TARGET (A, B, C)
    SELECT T1.ID, B, C FROM STAGING_TABLE
    JOIN T1 ON A = T1.ID
    
    DELETE FROM STAGING_TABLE
    
It is possible for all services to work on single staging (and destination) table in parallel. Currently this is done using snapshot isolation, but it is painfully obvious that for one reason or another we are losing data. What I mean by it is that there are messages that are properly processed by the services but all information from them is lost - we don't see records in the database. I can't prove that snapshot isolation is in responsible, but such incidents started to happen after snapshot isolation was introduced - which in turn was introduced after services loading data in parallel were introduced. Databases are currently far from my main field of expertise and I don't know why it happens, but it seems that snapshot isolation is the main culprit.

My question is: what is the lowest isolation level that can support this scenario? Is there better way to do it? I'm not fully aware what transaction isolation level was used earlier (when data was loaded by single service), but we never observed data loss. I tried (blindly) using "serializable" and "repeatable read", but "serializable" results in dropped messages due to deadlocks and "repeatable read" while seems to do the right thing (no data loss) also degrades performance to the level of serial writing.

EDIT:
Is it viable to load data using snapshot isolation, insert into temporary table (or table variable), THEN switch to some very permissive isolation level, insert data from temporary table into target table, revert to snapshot and delete data from staging table? If I read this correctly:

https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2005/ms173763(v=sql.90) 

It should be possible, but I don't understand yet if any of this would have any effect in discussed case - target table is not read, only written in this scenario, and I *think* this means that write won't be any "more parallel" than under snapshot isolation. But maybe I'm wrong?

Note that we can't wait and for example load data from multiple sources into single staging table and then move it into target table. We're not aiming for real time, but we need data insert ASAP.

Jędrzej Dudkiewicz (173 rep)

Jul 8, 2019, 05:17 PM • Last activity: Feb 12, 2025, 10:38 AM

1 votes

1 answers

4148 views

How can I control how Pentaho DI formats integers coming from a text file?

postgresql etl pentaho

I have a simple pipe-delimited text file with integers recorded as a sequence of numbers: 012345|00678 |various|text|fields There are some leading zeros, and some trailing white space, but no decimals, commas, internal spaces, or other formatting. Should be simple, right? Pentaho's field analysis co...

                                  I have a simple pipe-delimited text file with integers recorded as a sequence of numbers:

    012345|00678 |various|text|fields

There are some leading zeros, and some trailing white space, but no decimals, commas, internal spaces, or other formatting.  Should be simple, right?

Pentaho's field analysis concludes that the fields are of type "Number", format "#.#", and precision "0".  That format string makes me wonder if Pentaho thinks there might be decimals in the columns, but when I run Preview Rows, I get nice clean integers, without decimals, scientific notation, or leading zeros, exactly as I want it.

When I try to consume the data with a PostgreSQL Bulk Loader step, I get error messages like these:

    Import Table.0 - ERROR {0} ERROR:  invalid input syntax for integer: "5.017645082E9"
    Import Table.0 - ERROR {0} ERROR:  invalid input syntax for integer: "1.0"

It seems that Pentaho has taken it on itself to format the integers with scientific notation and/or fixed decimals, and naturally Postgres balks at importing this.

How can I stop Pentaho from formatting these integers?

I'm using Pentaho Data Integration v6.1, and Postgres 9.4.9.  The file uses Windows line breaks and ANSI text.  Changing the field format to "#" or "0" had no effect.  Using a Table Output instead of a PostgreSQL Bulk Load works, but it is much slower.

Jon of All Trades (5987 rep)

Mar 31, 2017, 10:50 PM • Last activity: Feb 8, 2025, 02:01 PM

0 votes

0 answers

58 views

How can I replicate data from a SQL Server VM in Azure to avoid recovery mode for reporting?

sql-server transactional-replication azure python etl

Currently, I'm using SQL Server on an Azure VM (DB B) to read from Power BI. DB B updates via log shipping from a primary SQL Server (DB A) and it locks me out from reading DB B two times an hour. I've considered caching strategies with Power BI to help, but I'm not certain that will solve the probl...

                                  Currently, I'm using SQL Server on an Azure VM (DB B) to read from Power BI. DB B updates via log shipping from a primary SQL Server (DB A) and it locks me out from reading DB B two times an hour. I've considered caching strategies with Power BI to help, but I'm not certain that will solve the problem in the longterm, and the transition from DirectQueryImport mode can be a pain.

Requirements/Notes for Suggested Solution(s):
 - Not an Enterprise user.
 - Changing the log shipping method to another method for updates from DB A is not an option.
 - Migrating DB B to Azure SQL Database/managed DB and eliminating SQL Server on Azure VM is not an option.
 - The log shipping updates happen at the same times each hour.
 - Near(ish) real-time replication would be ideal.
 - Transactional replication may not be a solution because each table does not have a primary key.
 - Minimizing cost would be ideal.
 - Reading from DB B should always be available regardless of data consistency.
 - Standing up a third DB, DB C, is an option.
 - I'm hesitant about enabling CDC on DB B to use as a basis for an ETL solution because of memory on the DB B VM and as not to cause an issue with the log-shipping processes.
 - 5 people may send requests to DB B from time-to-time but never all at once.

I'm thinking about just standing up DuckDB on a VM loaded with Linux and write some Python scripts to update data the few times an hour when DB B is updated via log-shipping. 

What are some of your recommended solutions?
                                

IamTrying (11 rep)

Feb 4, 2025, 03:20 AM • Last activity: Feb 4, 2025, 11:33 AM

0 votes

1 answers

987 views

Azure Data Factory - copy data activity into a Postgresql database (on an IaaS Windows Azure vm, not Azure Database for Postgresql (PaaS))

postgresql azure etl azure-data-factory

I have a Postgresql database on a Windows vm (not a PaaS Azure Database), and I need to import csv files into it daily. ADF seems like a perfect fit, but I see that a standard Postgresql db is not a dataset option for the Sink in the Copy Data activity. Is there any way to get this done in ADF? If n...

                                  I have a Postgresql database on a Windows vm (not a PaaS Azure Database), and I need to import csv files into it daily.  ADF seems like a perfect fit, but I see that a standard Postgresql db is not a dataset option for the Sink in the Copy Data activity.  Is there any way to get this done in ADF?  If not, could I please get some recommendations on other tools that include scheduling and alerting?  Thanks!
                                

Bobogator (95 rep)

Sep 19, 2022, 06:53 PM • Last activity: Jan 23, 2025, 08:00 PM

-1 votes

1 answers

66 views

How can I make the real-time synchronization between main server and target server?

sql-server replication ssis etl

I have occurred in a serious problem, I had inserted some data into table from main server to local server through SSIS ETL technique. All is perfect till scheduling. But the main problem is, if in the main server table if we delete or update a row against any id, that data won't delete or update at...

                                  I have occurred in a serious problem, I had inserted some data into table from main server to local server through SSIS ETL technique. All is perfect till scheduling. But the main problem is, if in the main server table if we delete or update a row against any id, that data won't delete or update at the local server. How can we solve this problem? I mean what happens at the main server, same thing should be happened at the local server after scheduling occurs at the local server. How can I achieve it? Please apologies for the grammatical errors and I'm newbie here

                                

Toriqul Mahal (3 rep)

Jan 15, 2025, 11:09 AM • Last activity: Jan 15, 2025, 01:20 PM

25 votes

3 answers

13961 views

What are the arguments in favor of using ELT process over ETL?

data-warehouse etl business-intelligence

I realized that my company uses an ELT (extract-load-transform) process instead of using an ETL (extract-transform-load) process. What are the differences in the two approaches and in which situations would one be "better" than the other? It would be great if you could provide some examples.

                                  I realized that my company uses an ELT (extract-load-transform) process instead of using an ETL (extract-transform-load) process.     
What are the differences in the two approaches and in which situations would one be "better" than the other? It would be great if you could provide some examples.
                                

HelloWorld1 (797 rep)

Jun 14, 2012, 08:24 AM • Last activity: Jan 9, 2025, 12:26 AM

1 votes

1 answers

672 views

SSIS comparison load between new, updated and deleted records and populate a staging database

sql-server ssis etl

So far, My control flow looks like: Execute SQL Task - This executes a statement that creates a staging table. If one exists, it is dropped so everytime the package is run, the staging table will be able to track the updates that have occurred. My code for this is: IF OBJECT_ID('CDC_Staging', 'U') I...

                                  So far,

My control flow looks like:

Execute SQL Task
- This executes a statement that creates a staging table. If one exists, it is dropped so everytime the package is run, the staging table will be able to track the updates that have occurred. My code for this is:

    IF OBJECT_ID('CDC_Staging', 'U') IS NOT NULL
       DROP TABLE  CDC_Staging;
    
    CREATE TABLE CDC_Staging
    (
    	[Employee_ID] [int] PRIMARY KEY NOT NULL,
    	[FirstName] [nvarchar](255) NULL,
    	[LastName] [nvarchar](255) NULL,
    	[Education] [nvarchar](255) NULL,
    	[Occupation] [nvarchar](255) NULL,
    	[YearlyIncome] [float] NULL,
    	[Sales] [float] NULL
    );

Data Flow Task
- The OLE DB Source is the Source Database. The Lookup uses the results of this query:

    SELECT [Employee_ID] FROM [CDC_Target]

Then I use a conditional split that puts inserts in the Target Database and Updates to Staging



My Final control flow task is an execute SQL Task. I used this query to update the target database.

    UPDATE [dbo].[CDC_Target]
       SET [FirstName] = Staging.[FirstName]
          ,[LastName] = Staging.[LastName]
          ,[Education] = Staging.[Education]
          ,[Occupation] = Staging.[Occupation]
          ,[YearlyIncome] = Staging.[YearlyIncome]
          ,[Sales] = Staging.[Sales]
    FROM [CDC_Target]
       INNER JOIN
         [CDC_Staging] AS Staging 
    	 ON [CDC_Target].Employee_ID = Staging.Employee_ID

I just inserted 4 more records into my source table. After I ran the package, the Source Table and Target table had the correct data but the staging table did not contain the 4 inserts that I committed. 

Am I doing something wrong here? Any suggestions on how to better do this? 

PS. I followed a tutorial explaining how to do this, I'm a Jr. DBA but my team lead is wanting me to learn SSIS.
                                

Patrick (11 rep)

Mar 29, 2019, 07:24 PM • Last activity: Dec 4, 2024, 10:07 AM

1 votes

3 answers

94 views

Make A Table Ignore Trivial Updates

trigger etl

The scenario: I own an ETL process. That process involves ingesting data from disparate sources and making my database sync with the latest. Let's say that I have a table called [catalog]. It contains my company's catalog of products. The source of truth for this table sends me a CSV every four hour...

UPDATE mc 
SET
	  [field1] = tc.[field1]
	, [field2] = tc.[field2]
FROM [my].[catalog] as mc  
INNER JOIN [their].[catalog] as tc
	ON mc.id = tc.id
WHERE mc.[field1]  tc.[field1]
OR mc.[field2]  tc.[field2]

This command will compare every match and filter out only those who will see meaningful change and updates those rows only. I don't want to touch any rows unnecessarily--the perennial task of ETL developers. No trivial updates will be executed (by trivial update, I mean an update to a row that changes *none* of the fields in that row). The rows without changes remain as they were, unmolested. Unmolested is a good word to use here because updating a row while not impacting values comes at an unnecessary cost. In ram and at the disk, an update to a row physically deletes the row with the old values and inserts a new row with the new values. All of these updates have to be logged! So, in the process of doing nothing, we have fragmented and split the index and data pages that we have for the table in both ram and on disk. And we're putting unnecessary traffic through the log. If we're not changing anything, then a trivial update is still causing churn in ram and on disk. Giving a table an *automatic* way to ignore trivial updates would be a big win for many scenarios. Now, let's just say that some ham-fisted developer makes an update through some client or in a procedure change that does something similar to a large table but doesn't include any measures to ignore rows not 'really' needing the update. Is there something that I can do to the table to make it act as if it knows to pass on trivial updates?

Doug Hills (107 rep)

Oct 16, 2024, 10:55 PM • Last activity: Oct 17, 2024, 05:47 PM

7 votes

2 answers

2735 views

SQL Staging Tables: Primary Key Clustered or Heap

sql-server database-design sql-server-2016 performance etl performance-tuning

We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation. create table dbo.CustomerTransaction ( CustomerName varchar(255), PurchaseLocati...

                                  We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation.


    create table dbo.CustomerTransaction
    (
         CustomerName varchar(255),
         PurchaseLocation varchar(255),
         Productid  int,
         AmountSold float,
         CustomerAddress varchar(50)
    )

    create table dbo.CustomerTransaction
    (
         -- discussion for adding this column
         CustomerTransactionId int primary key clustered identity(1,1) 

         CustomerName varchar(255),
         PurchaseLocation varchar(255),
         Productid  int,
         AmountSold float,
         CustomerAddress varchar(50)
    )

    -- both tables have nonclustered indexes
    create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
    create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

    -- Actually have more indexes, tables above are just for sample 

1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts. 

    truncate table dbo.[CustomerTransaction]

2) Then disable all indexes before ETL. 

    alter index all on dbo.[CustomerTransaction] DISABLE

3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.

4) Then reenable all indexes after import is done.

    alter index all on dbo.[CustomerTransaction] REBUILD

5) The staging tables are then selected on join and where clauses, and placed into datawarehouse. This we why we have nonclustered indexes. After data warehouse is loaded, we truncate the staging tables. 


We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below

I am reading conflicting opinions. One says Binary tree clustered are maintenance headaches for import ETL. Other says Heaps have performance issues with fragmentation. Our performance testing does not show much difference, but our data may change later. So we need to make a good design decision.


https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/ 

https://www.mssqltips.com/sqlservertip/4961/sql-server-insert-performance-for-clustered-indexes-vs-heap-tables/ 

http://kejser.org/clustered-indexes-vs-heaps/ 

https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-heaps-and-their-fragmentation/ 

We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.


                                

user162241

Nov 10, 2018, 09:29 AM • Last activity: Jun 3, 2024, 05:31 PM

9 votes

5 answers

9066 views

Copying a table (and all of its data) from one server to another?

sql-server t-sql etl bulk-insert bulkcopy

I have a massive table, let's say 500,000 rows. I want to copy it (schema and data) from one server to another. This is not an upsert or any kind of update; It's a one-off straight copy and paste. What are the idiomatic approaches to this? I've tried: - Restoring a backup from one server on to anoth...

                                  I have a massive table, let's say 500,000 rows. I want to copy it (schema and data) from one server to another. This is not an upsert or any kind of update; It's a one-off straight copy and paste. What are the idiomatic approaches to this?

I've tried:

 - Restoring a backup from one server on to another. This is impractical, because SQL Server notoriously cannot restore tables from a backup; It can only restore databases. And my database is huge!
 - Using SSMS to script the table's data as a sequence of INSERT statements. This is impractical, because the inserts have to be done row by agonising row. I suspect that this also does awful things to the transaction log, but nobody has attacked me for this yet (I'm running such a script right now, it's going to take hours).

J. Mini (1225 rep)

Nov 23, 2023, 07:00 PM • Last activity: Apr 9, 2024, 07:49 PM

Showing page 1 of 20 total questions