Snowflake/S3 Pipeline: ETL architecture Questions

0 votes

1 answer

320 views

                          I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requirements are the following:

1. I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
The monthly pulls should get the last full month (since this should be a monthly pipeline).
2. All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.
3. The data load from S3 (query_01012020,query_01022020,query_01032020) back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
4. I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
5. I want to get any error notifications in real-time when issues in the pipeline occur.

I hope you are able to guide me on what components the pipeline should include. Any relevant documentation/tutorials for this effort are truly appreciated.

Thank you very much.
                        

Asked by cocoo84hh (101 rep)

Jun 14, 2020, 06:54 PM
Last activity: Mar 13, 2025, 06:02 AM

Snowflake/S3 Pipeline: ETL architecture Questions

Related Questions