Best practices for tracking ingested data sources
0
votes
1
answer
181
views
I am in the process of creating an ingestion pipeline wherein there exists a step of periodically reading new .csv files and storing them into a postgres database. This step is functioning, however it is currently impractical/time-consuming to verify if the data in any certain file has been fully and correctly ingested into the database. I am essentially operating in blind trust that the database is a point of truth, but I would like to be able to be a little more certain.
The first step I was planning to take is to store runtime metadata during ingestion jobs (e.g. filename, time of ingestion, job result) in its own table in the database. While this won't speak to the _integrity_ of the data, it would at least allow some insight into what has been processed.
Any guidance on best practices and what else I can do re: data validation for a setup like this would be greatly appreciated!
Asked by SIRCAVEC
(101 rep)
Oct 31, 2022, 07:31 PM
Last activity: Nov 1, 2022, 02:11 AM
Last activity: Nov 1, 2022, 02:11 AM