Sample Header Ad - 728x90

Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

1 votes
1 answers
375 views
Cassandra pool warning displaying continuously
I am using the [Cassandra driver for python][1] in Spyder. I am trying to fetch some data from Cassandra table. Here is my code: from cassandra.cluster import Cluster cluster=Cluster(['some_ip']) session=cluster.connect('some_key_space') df_filtered_10m=session.execute("some query") This is all work...
I am using the Cassandra driver for python in Spyder. I am trying to fetch some data from Cassandra table. Here is my code: from cassandra.cluster import Cluster cluster=Cluster(['some_ip']) session=cluster.connect('some_key_space') df_filtered_10m=session.execute("some query") This is all working fine and I am getting the desired results. The problem is that in the console, this message is continuously popping up: > WARNING:cassandra.pool:Error attempting to reconnect to 10.0.10.91, > scheduling retry in 256.0 seconds: [Errno None] Tried connecting to > [('10.0.10.91', 9042)]. Last error: timed out I have tried cluster.shutdown but it not working as well. How to get rid of it?
Osama Dar (111 rep)
Sep 12, 2018, 05:19 AM • Last activity: Aug 2, 2025, 01:04 AM
0 votes
1 answers
141 views
Booleans, CONSTANTS or mapping table for 'status'-like fields?
I am modelling a User table which needs to have the following information about the users: - `is_active?` - `is_detained?` - `has_voluntarily_deactivated?` - `is_temporarily_suspended?` and so on... Basically, these are `boolean` flags with `true` or `false`. So, I am considering few approaches othe...
I am modelling a User table which needs to have the following information about the users: - is_active? - is_detained? - has_voluntarily_deactivated? - is_temporarily_suspended? and so on... Basically, these are boolean flags with true or false. So, I am considering few approaches other than boolean flags which are as follows: 1. Create a single varchar field with values like 'active', 'detained', 'deactivated', 'suspended', etc. 2. Create a tinyint field and map the integers to another table containing status strings 3. Create a tinyint field and map the integers in code itself using constants, such as ACTIVE = 1, DETAINED = 2, etc. Is Python's enum type the best solution to this? 4. Create a tinyint field and map the integers to status strings in an XML or JSON file Which of the above 4 or the original boolean style approach is preferable, or if there could be a completely different approach or a modified version of the above approaches, please let me know? Also, in my code, how should I call these fields, like: - if (user.status == 1), or something like - if (user.status == STATUS.ACTIVE), or - if (user.status == 'active') (I think this will depend on which approach I follow) These status values are not limited and may be added, edited or removed in future. Request you to answer in a database agnostic way and the programming language that I am using is Python. Thank you for your answers
Forthaction (21 rep)
Sep 18, 2016, 01:16 PM • Last activity: Jul 27, 2025, 10:01 PM
0 votes
2 answers
6413 views
Psycopg2 Errors on SQL statement when trying to copy data from CSV file into PostgreSQL database
I am not a developer or PostgreSQL DB admin, so this may be basic questions. Logistics: Windows 10 server / pgAdmin 4 / Postgres 10 / Python 2.7.13 I'm using a python script to ingest external data, create a CSV file and copy that into Postgres 10. I keep getting the following error: **Psycopg2.Prog...
I am not a developer or PostgreSQL DB admin, so this may be basic questions. Logistics: Windows 10 server / pgAdmin 4 / Postgres 10 / Python 2.7.13 I'm using a python script to ingest external data, create a CSV file and copy that into Postgres 10. I keep getting the following error: **Psycopg2.ProgrammingError: syntax error at or near "VALUES"** I have a two part question - 1) I can not see the syntax error in the following sql statement def insert_csv_data(sqlstmt): with get_conn('pg') as db: cur = db.cursor() sqlcopy = "COPY irwin (fire_id,name,type,acres,date_time,state,county,admin_unit,land_cat,commander,perc_cntnd,cont_date,gacc,lat,long,geom,updated,imo) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,ST_SetSRID(ST_MakePoint(%s, %s),4326)%s,%s) FROM STIN DELIMITER ',' CSV HEADER" with open(csv_file, 'r') as f: #next(f)# Skipping the first line header row cur.copy_expert(sqlcopy, f, size=8000) db.commit() cur.close() And 2) Once that is resolved I'm expecting to get an error about the geometry column in postgres. If someone would also peek at the code snippets and let me know if anything jumps out I would SO APPRECIATE IT! This snippet pulls the external data in order but I don't think I've coded this correctly to pull the lat/long into the geom field. # Lat 15 - double if not attributes['InitialLatitude'] is None: lat = str(attributes['InitialLatitude']).replace('\n', '') else: lat = '0' #Long 16 - double if not attributes['InitialLongitude'] is None: long = str(attributes['InitialLongitude']).replace('\n', '') else: long = '0' # geom is not defined - script is dumping the geometry into the IMO field geom = str(attributes['InitialLatitude']) + ' ' + str(attributes['InitialLongitude']) I added a Geom header to the csv data. Please help - thanks!
Cödingers Cat (1 rep)
Jun 10, 2019, 06:59 PM • Last activity: Jul 25, 2025, 02:04 PM
0 votes
1 answers
160 views
InnoDB Concurrent writes being ignored
Sorry if this isn't the right place, this is my first time posting anything here. Anyway, I know I'm not supposed to but I can think of no other way. I need to use a small MySQL table (70 rows x 6 columns) as a queue. I'm writing a python application that requires a work queue to be shared between p...
Sorry if this isn't the right place, this is my first time posting anything here. Anyway, I know I'm not supposed to but I can think of no other way. I need to use a small MySQL table (70 rows x 6 columns) as a queue. I'm writing a python application that requires a work queue to be shared between process windows (not sure what the proper name for them is). Each job is repeatable, each use must be recorded and cleared at regular intervals (so each job is "weighted" so usage is evenly and fairly distributed). I attempted to base it on another work queue where each job is NOT repeatable, but it seems that multiple writes to the database (up to 19 at once per 12 seconds) are not properly being counted up. Is there an alternative to doing something like this? Perhaps some kind of cache sitting between python and MySQL that would convert many single "job + 1" updates into a singular "job + 19"? I assumed being on a shiny new NVMe drive and a more than sufficient buffer_pool_size would make it plenty fast enough to handle that, but instead of counting up properly, over the course of 60 seconds it may reach 9, instead of 100+ where it should be.
Calvin DiBartolo (1 rep)
Mar 7, 2017, 06:06 PM • Last activity: Jul 24, 2025, 02:06 AM
0 votes
1 answers
150 views
Airflow to BigQueryt data load taking forever
Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI. The problem is t...
Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI. The problem is that the bigger tables (which only have ~ 1000 rows) take about 2 hours to load to BQ, and thus making this impossible to scale, I can´t imagine what will happen in the future when only the deltas are 10000 rows each... This works using pandas and SQLAlchemy by extracting data as a dataframe and using "to_sql" method passing all the BQ connection parameters. I am already uploading only incrementals/delta, that is not the problem. Do you have any advice? Is Airflow the right tool for this? I´ve been searching for solutions for weeks but couldn´t find anything.
Ayrton (1 rep)
Aug 28, 2022, 11:11 PM • Last activity: Jul 17, 2025, 11:03 AM
0 votes
1 answers
175 views
Advice for applying locks in my data processing pipeline
I currently have a Python program that enters rows into a Postgres table that essentially works as a list of data I need to process. These processes create on-disk files and trigger other behavior so I only want it to run once for each row. I then have another script that takes the rows from that ta...
I currently have a Python program that enters rows into a Postgres table that essentially works as a list of data I need to process. These processes create on-disk files and trigger other behavior so I only want it to run once for each row. I then have another script that takes the rows from that table then begins to do the processing. So, for example, there might be 100 rows and each row might take 10-20 minutes to complete and each produces a few output files. I currently am running into the problem where I can only run this script one at a time in fear that running two in parallel might end up with them processing the same data twice. If I create a boolean field that I flip within the application when it's 'busy', I fear having a stale lock due to an abruptly killed process that doesn't end gracefully. If I use locks as built within Postgres, it seems they disappear upon the connection/session ending. But if I'm on an unstable connection, I'm not quite sure what the behavior would be or how I can get the behavior I want? Given these are 10-20 minute processes, I foresee connection being lost within that time frame and thus the lock being lost. Thanks for any advice on where to go. I'm using a Python library called psycopg2 to connect to the Postgres database.
Pensw (1 rep)
May 21, 2022, 06:06 PM • Last activity: Jul 6, 2025, 08:07 PM
0 votes
1 answers
164 views
MySQL 'generator' not closing connections
I have the following function which is leaving connections open but I can't figure out why. The Cursor class creates a python-generator type function, allowing me to iterate over millions of rows. connection=MySQLdb.connect("", "", "", "", cursorclass = MySQLdb.cursors.SSCursor) cursor=connection.cu...
I have the following function which is leaving connections open but I can't figure out why. The Cursor class creates a python-generator type function, allowing me to iterate over millions of rows. connection=MySQLdb.connect("", "", "", "", cursorclass = MySQLdb.cursors.SSCursor) cursor=connection.cursor() cursor.execute("select id, ...", (profile_id, ...)) try: for row in cursor.fetchall(): yield row except: pass finally: connection.close() Where am I going wrong?
Adders (175 rep)
Aug 30, 2018, 01:40 PM • Last activity: Jul 5, 2025, 04:08 PM
1 votes
1 answers
179 views
storing histogram in sqlite database
I am not a database administrator, but only a scientist who would appreciate your help in solving my issue with storing histogram data in an SQLite database. I have several of them to be stored and to be later analysed with pandas (python). Each histogram is made by two arrays, 1. one for the bins o...
I am not a database administrator, but only a scientist who would appreciate your help in solving my issue with storing histogram data in an SQLite database. I have several of them to be stored and to be later analysed with pandas (python). Each histogram is made by two arrays, 1. one for the bins or buckets that are regularly spaced, let's say from min to max with a given step. 2. one for the values. First question: how would you store the two arrays? They are rather long, up to 65k. I don't need to store the bin values, I can in principle recalculate them having the min, max and step. The value array may have several zeros, so it may be convenient to store them sparsely. Second question: I would like to retrieve them with a select returning something like:
bin1, value1
bin2, value2
...
binN, valueN
Sorry if my questions is looking too stupid to you, but I'm scratching my head with this problem since too long without finding any way out. Thanks in advance for your help! ## Update As a preliminary, not really disk space effective solution, I have implemented something like the suggestion of @Whitel Owl. Instead of storing the two arrays as text, I'm storing them as binary BLOBs. HEre is my code:
CREATE TABLE HistogramTable (
  HistogramID as INTEGER PRIMARY KEY,
  ImageID as INTEGER,
  Bins as BLOB,
  Histo as BLOB,
  FOREIGN KEY ImageID REFERENCE ImageTable(ImageID)
 );
To get the two blobs I'm using pickle.
import pickle
import sqlite3
import numpy as np

db.connect('mydb.db')

histo, bins = np.histogram(data)

histo_blob = sqlite3.Binary(pickle.dumps(histo))
bins_blob = sqlite3.Binary(pickle.dumps(bins))
user41796 (111 rep)
Jun 28, 2023, 08:35 PM • Last activity: Jul 5, 2025, 07:06 AM
0 votes
1 answers
32 views
How to resolve an access issue while running SQL Server Agent to execute a Python?
I am trying to use SQL Server Agent to execute a Python file. I am executing Python file using a Powershell script. When I ran the SQL Server Agent job, error message says: Message Executed as user: NT Service\SQLAgent$SQL2022. Start-Process : This command cannot be run due to the error: "Access is...
I am trying to use SQL Server Agent to execute a Python file. I am executing Python file using a Powershell script. When I ran the SQL Server Agent job, error message says: Message Executed as user: NT Service\SQLAgent$SQL2022. Start-Process : This command cannot be run due to the error: "Access is denied At F:\xx.ps1:5 char:1 + Start-Process C:\Users\jdoe\AppData\Local\Programs\Python\Python312\p ... + I am thinking this happens because I originally installed Python using my own local access (username: jdoe). Today, I asked our IT to install as Admin. Currently, python is installed in two locations: (When I installed myself originally): C:\Users\jdoe\AppData\Local\Programs\Python\Python312\python.exe (When IT just installed today): C:\Program Files\Python313\python.exe What do I need to do run Python code SQL Server Agent? Where do I have to configure?
Java (253 rep)
Jun 27, 2025, 09:05 PM • Last activity: Jun 28, 2025, 05:23 AM
2 votes
1 answers
57 views
How to group by with similar group_name in sql
How can I perform a GROUP BY in *SQL* when the `group_name` values are similar but not exactly the same? In my dataset, the group_name values may differ slightly (e.g., "Apple Inc.", "AAPL", "Apple"), but conceptually they refer to the same entity. The similarity might not be obvious or consistent,...
How can I perform a GROUP BY in *SQL* when the group_name values are similar but not exactly the same? In my dataset, the group_name values may differ slightly (e.g., "Apple Inc.", "AAPL", "Apple"), but conceptually they refer to the same entity. The similarity might not be obvious or consistent, so I might need to define a custom rule or function like is_similar() to cluster them. For simple cases, I can extract a common pattern using regex or string functions (e.g., strip suffixes, lowercase, take prefixes). But how should I handle more complex scenarios, like fuzzy or semantic similarity? Case: group_name | val ---------------|----- 'Apple Inc.' | 100 'AAPL' | 50 'Apple' | 30 'Microsoft' | 80 'MSFT' | 70 What I want to achieve: new_group_name | total_val ----------------|---------- 'Apple' | 180 'Microsoft' | 150 What are the best approaches to achieve this in *SQL*? And how would I write a query like this: SELECT some_characteristic(group_name) AS new_group_name, SUM(val) FROM tb1 GROUP BY new_group_name;
Ahamad (1 rep)
May 14, 2025, 08:59 AM • Last activity: May 15, 2025, 05:31 AM
2 votes
1 answers
43 views
improve the implementation of worldquant 101 alpha factors using numpy
I was trying to implement 101 quant trading factors that was published by WorldQuant (https://arxiv.org/pdf/1601.00991.pdf). A typical factor is about processing stocks' price and volume information along with both time dimension and stock dimension. Take the example of alpha factor #4: (-1 * Ts_Ran...
I was trying to implement 101 quant trading factors that was published by WorldQuant (https://arxiv.org/pdf/1601.00991.pdf) . A typical factor is about processing stocks' price and volume information along with both time dimension and stock dimension. Take the example of alpha factor #4: (-1 * Ts_Rank(rank(low), 9)). This is a momentum alpha signal. low is a panel of stocks' low price within certain time period. rank is a cross-sectional process of ranking panel’s each row (a time snapshot). Ts_Rank is a time-series process of moving_rank panel’s each column (a stock) with a specified window. Intuitively, pandas dataframe or numpy matrix should fit for the implementation of 101 alpha factors. Below is the best implementation using numpy I got so far. However, the performance was too low. On my Intel core i7 windows machine, it took around 45 seconds to run the alpha #4 factor with a 5000 (trade dates) by 200 (stocks) matrix as input. I also came across DolphinDB, a time series database with built-in analytics features (https://www.dolphindb.com/downloads.html) . For the same factor Alpha#4, DolphinDB ran for mere 0.04 seconds, 1000 times faster than the numpy version. However, DolphinDB is a commercial software. Does anybody know better python implementations? Or any tips to improve my current python code to achieve performance comparable to DolphinDB? Here is the implementation of python. import numpy as np def rankdata(a, method='average', *, axis=None): # this rankdata refer to scipy.stats.rankdata (https://github.com/scipy/scipy/blob/v1.9.1/scipy/stats/_stats_py.py#L9047-L9153) if method not in ('average', 'min', 'max', 'dense', 'ordinal'): raise ValueError('unknown method "{0}"'.format(method)) if axis is not None: a = np.asarray(a) if a.size == 0: np.core.multiarray.normalize_axis_index(axis, a.ndim) dt = np.float64 if method == 'average' else np.int_ return np.empty(a.shape, dtype=dt) return np.apply_along_axis(rankdata, axis, a, method) arr = np.ravel(np.asarray(a)) algo = 'mergesort' if method == 'ordinal' else 'quicksort' sorter = np.argsort(arr, kind=algo) inv = np.empty(sorter.size, dtype=np.intp) inv[sorter] = np.arange(sorter.size, dtype=np.intp) if method == 'ordinal': return inv + 1 arr = arr[sorter] obs = np.r_[True, arr[1:] != arr[:-1]] dense = obs.cumsum()[inv] if method == 'dense': return dense # cumulative counts of each unique value count = np.r_[np.nonzero(obs), len(obs)] if method == 'max': return count[dense] if method == 'min': return count[dense - 1] + 1 # average method return .5 * (count[dense] + count[dense - 1] + 1) def rank(x): return rankdata(x,method='min',axis=1)/np.size(x, 1) def rolling_rank(na): return rankdata(na.transpose(),method='min',axis=0)[-1].transpose() def ts_rank(x, window=10): a_rolled = np.lib.stride_tricks.sliding_window_view(x, window,axis = 0) return np.append(np.full([window-1,np.size(x, 1)],np.nan),rolling_rank(a_rolled),axis = 0) def alpha004(data): return -1 * ts_rank(rank(data), 9) import time # The input is a 5000 by 200 matrix, where the row index represents trade date and the column index represents security ID. data=np.random.random((5000, 200)) start_time = time.time() alpha004(data) print("--- %s seconds ---" % (time.time() - start_time)) output: 44.85(s)
Huang WeiFeng (31 rep)
May 13, 2025, 09:33 AM • Last activity: May 13, 2025, 09:39 AM
1 votes
1 answers
4026 views
MySQLdb._exceptions.OperationalError: (2006, '') when closing a SQL query. Is it because of the connection?
Salut les administrateurs ! I have a question (and an issue) When closing my SQL query in my ETL I get a connection error while it seemed I was able to get connected: c.execute(""" UPDATE `{table_name}` SET `{column_name}` = CONCAT('hash_', {expression}) WHERE {pk_name} IN ({ids}) """.format( table_...
Salut les administrateurs ! I have a question (and an issue) When closing my SQL query in my ETL I get a connection error while it seemed I was able to get connected: c.execute(""" UPDATE {table_name} SET {column_name} = CONCAT('hash_', {expression}) WHERE {pk_name} IN ({ids}) """.format( table_name=table_name, column_name=column_name, expression=expression, pk_name=pk_name, ids=','.join(ids) )) print('.', end='', flush=True) Indeed, I get: (venv) C:\Users\antoi\Documents\Programming\Work\data-tools>python -m etl.main 2021-06-29 10:59:37.814133 - Connecting to database hozana_data... 2021-06-29 10:59:37.822142 - Connecting to archive database hozana_archive... 2021-06-29 10:59:38.046134 - Start ETL main process 2021-06-29 10:59:38.046134 - users table: 2021-06-29 10:59:38.046134 - Hashing column users.email: done. 2021-06-29 10:59:38.054091 - Hashing column users.email_notification:Traceback (most recent call last): File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\task\anonymization.py", line 17, in hash_column c.execute(""" File "C:\Users\antoi\Documents\Programming\Work\data-tools\venv\lib\site-packages\MySQLdb\cursors.py", line 183, in execute while self.nextset(): File "C:\Users\antoi\Documents\Programming\Work\data-tools\venv\lib\site-packages\MySQLdb\cursors.py", line 137, in nextset nr = db.next_result() MySQLdb._exceptions.OperationalError: (2006, '') During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Users\antoi\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\antoi\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\main.py", line 52, in main() File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\main.py", line 24, in main anonymization.main() File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\task\anonymization.py", line 59, in main hash_column('users', 'email_notification', 'user_id', True) File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\task\anonymization.py", line 50, in hash_column print('.', end='', flush=True) File "C:\Users\antoi\Documents\Programming\Work\data-tools\venv\lib\site-packages\MySQLdb\connections.py", line 239, in __exit__ self.close() MySQLdb._exceptions.OperationalError: (2006, '') I thought it was when the client cannot send a query to the server, most likely because the server itself has closed the connection. However I thought there was already a connection: introducir la descripción de la imagen aquí Even more there are also these lines at the beginning: 2021-06-29 10:59:37.814133 - Connecting to database hozana_data... 2021-06-29 10:59:37.822142 - Connecting to archive database hozana_archive... So when I get a MySQLdb._exceptions.OperationalError: (2006, '') when closing a SQL query. Is it because of the connection?
Revolucion for Monica (677 rep)
Jun 29, 2021, 09:23 AM • Last activity: May 3, 2025, 02:04 PM
0 votes
1 answers
1935 views
error [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified when import excel data to sql server
I work on SQL server 2017 I need to import data from `excel 2016` to `sql server 2017` I using python script to do that I create `odbc` and success test with name `Testserver` path `G:\ImportExportExcel` have `allpackage and every one full control permissions` my instance name is : AHMEDSALAHSQL my...
I work on SQL server 2017 I need to import data from excel 2016 to sql server 2017 I using python script to do that I create odbc and success test with name Testserver path G:\ImportExportExcel have allpackage and every one full control permissions my instance name is : AHMEDSALAHSQL my pc name DESKTOP-L558MLK named pipe enabled true and instance allow remote when run script below declare @ImportPath NVARCHAR(MAX)='G:\ImportExportExcel' declare @DBConnectionString NVARCHAR(MAX) = 'dsn=Testserver;Uid=sa;Pwd=321' declare @ImportAll BIT=0 declare @CombineTarget BIT=0 declare @ExcelFileName NVARCHAR(200)='dbo.studentsdata' declare @ExcelSheetName NVARCHAR(50)='students2' SELECT @ImportPath = CASE WHEN RIGHT(@ImportPath,1) = '\' THEN @ImportPath ELSE CONCAT(@ImportPath,'\') END DECLARE @Serv NVARCHAR(200) = CONCAT(CHAR(39),CHAR(39),@@SERVERNAME,CHAR(39),CHAR(39)) DECLARE @ValidPath TABLE (ValidPathCheck BIT) INSERT @ValidPath EXEC sp_execute_external_script @language =N'Python', @script=N' import pandas as pd d = os.path.isdir(ImportFilePath) OutputDataSet = pd.DataFrame([d],columns=["Filename"])' ,@params = N'@ImportFilePath NVARCHAR(MAX)' ,@ImportFilePath = @ImportPath DECLARE @PythonScript NVARCHAR(MAX) =CONCAT(' import pandas as pd import os import glob from revoscalepy import RxSqlServerData, rx_data_step sqlConnString = "Driver=Testserver;Server=Serv; ',@DBConnectionString,'" Filefolderepath = ImportFilePath+"*.xlsx" if ImportAll ==0: Filename =ImportFilePath+ExcelFileName+".xlsx" exists = os.path.isfile(Filename) if exists and ExcelSheetName in pd.ExcelFile(Filename).sheet_names: Output = pd.read_excel(Filename, sheetname=ExcelSheetName, na_filter=False).astype(str) if not Output.empty: sqlDS = RxSqlServerData(connection_string = sqlConnString,table = "".join(fl for fl in ExcelFileName if fl.isalnum())+"_"+"".join(sh for sh in ExcelSheetName if sh.isalnum())) rx_data_step(input_data = Output, output_file = sqlDS,overwrite = True) else: print("Invalid Excel file or sheet name")') EXEC sp_execute_external_script @language = N'Python' ,@script = @PythonScript ,@params = N'@ImportFilePath NVARCHAR(MAX),@ImportAll BIT,@CombineTarget BIT,@ExcelFileName NVARCHAR(200),@ExcelSheetName NVARCHAR(50),@Serv NVARCHAR(200)' ,@ImportFilePath = @ImportPath ,@ImportAll = @ImportAll ,@CombineTarget = @CombineTarget ,@ExcelFileName = @ExcelFileName ,@ExcelSheetName = @ExcelSheetName ,@Serv = @Serv I get error when run query Msg 39004, Level 16, State 20, Line 0 A 'Python' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004. Msg 39019, Level 16, State 2, Line 0 An external script error occurred: [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified Error in execution. Check the output for more information. DataStep error: [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified **so can any one help me to solve issue ?** I add odbc connection to my pc and test it success Test success connection
user3223372 (1 rep)
Apr 16, 2022, 02:56 AM • Last activity: Apr 16, 2025, 09:02 PM
0 votes
3 answers
1186 views
Overwriting MySQL database to only store 1 month of data
We are logging data on hardware with only small memory (short on disk storage) only 4GB. We only require the data to be stored for 1 month and then be over written in a way that it overrides older data first. The memory on hardware is very small so cannot continue to record indefinitely. We are usin...
We are logging data on hardware with only small memory (short on disk storage) only 4GB. We only require the data to be stored for 1 month and then be over written in a way that it overrides older data first. The memory on hardware is very small so cannot continue to record indefinitely. We are using a MySQL data base, the hardware it is running on is not always powered on as it is in a vehicle. The data will be viewed in a graph to show historical data over time. A few options I have thought of but not sure how to execute it: Let’s assume I will record 1 million rows of data in a month When the table (table1) gets to 1 million rows, move this table to another and start new table (table2). When table2 reaches 1 million rows. Delete table1, move table2 to new table and create table3 etc... This way there will be minimum 1 month of entries. Second option (not sure if possible): When the table gets to 1 million rows it starts to override from row 1 again.
Phil (1 rep)
Jan 29, 2020, 09:09 PM • Last activity: Apr 14, 2025, 03:00 AM
0 votes
1 answers
843 views
IIS web application access SQL DB as service account
I've setup a new Python site on iis using FastCGI handler. The site has windows authentication enabled in iis and the app checks that the AD user belongs to an active directory group when they access the site. If authorisation fails access is denied. Windows authentication uses Kerberos but it is no...
I've setup a new Python site on iis using FastCGI handler. The site has windows authentication enabled in iis and the app checks that the AD user belongs to an active directory group when they access the site. If authorisation fails access is denied. Windows authentication uses Kerberos but it is not a double hop. However the web app reads/writes to a SQL Server database and the DB calls are made using the service account which runs the app pool. The service account has limited access to run the web app and can only access one single database which the web app uses. I've read that impersonation would be better from a DB security perspective using constrained delegation. Although the app does log which user has accesed the db. I wouldn't remember the URL now. but it was essentially stating that the SQL database is checking that the actual AD user who is using the web app has access to the database. As opposed to the database checking that the service account has access. Is there any obvious security risks with the approach I'm using?
DeadlyDan (111 rep)
Apr 13, 2022, 11:10 AM • Last activity: Apr 6, 2025, 12:06 AM
1 votes
3 answers
176 views
ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36
When running a `\copy` (either pgadmin or `aws_s3.table_import_from_s3`) of a 1.6GB file into an AWS Aurora Postgres-compatible database, I'm getting the following error: ``` ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36 CONTEXT: COPY staging, line 99779: "L24000403170365 ACTIVEZONE LL...
When running a \copy (either pgadmin or aws_s3.table_import_from_s3) of a 1.6GB file into an AWS Aurora Postgres-compatible database, I'm getting the following error:
ERROR:  invalid byte sequence for encoding "UTF8": 0xdc 0x36
CONTEXT:  COPY staging, line 99779: "L24000403170365 ACTIVEZONE LLC                                                                      ..."
EDIT: Here's what I could pull for table definition (but let me know if you want more): | column_name | data_type | character_maximum_length | is_nullable | column_default | | ----------- | --------- | ------------------------ | ----------- | -------------- | | raw | text | [null] | YES | [null] | EDIT: I also tried to change the column to bytea with no effect. The source is supposed to be ASCII, but I get the same error with explicit encodings like utf8, latin1, win1251, and win1252. EDIT: As requested in a reply, here's more information about the import commands. In pgadmin4, I'm right-click importing into the table which shows the following under the covers:
--command " "\\copy public.staging (\"raw\") FROM 'C:/data.txt' DELIMITER '|' ENCODING 'UTF8';""
I also use pgadmin4 to trigger the s3 table import by calling the query:
SELECT aws_s3.table_import_from_s3(
   'staging',
   '', 
   '(DELIMITER ''|'')',
   aws_commons.create_s3_uri('data', 'data.txt', 'us-east-1')
);
Under the covers, table_import_from_s3 calls the command:
copy staging from '/rdsdbdata/extensions/aws_s3/{{internal filename}}' with (DELIMITER '|')
The answer to similar questions is to clean up the source data so I pulled up python and tried to find the offending character. I couldn't find any evidence of an unusual character at or around the referenced line. For the sake of argument, I believe the following will scan the entire file (and you can see the results inline):
>>> def charinfile(filename, bytechar):
...     with open(filename, 'rb') as file:
...         byteline = file.readline()
...         while byteline:  # readline returns empty string at EOF
...             if byteline.find(bytechar) != -1:
...                 print("found!")
...                 return byteline
...             byteline = file.readline()
...         else:
...             print("not found")
...
>>> charinfile(filename, b'\xdc')
not found
>>> charinfile(filename, b'\xdc36')
not found
>>> charinfile(filename, b'6') # make sure the code is working
found!
I've also tried versions where I use strings instead of bytes with the same results. I can confirm that there are no blank lines before EOF (have used line counters to verify that I reach ~1m rows). What am I missing?
claytond (123 rep)
Mar 12, 2025, 06:42 PM • Last activity: Mar 24, 2025, 05:13 PM
0 votes
1 answers
320 views
Snowflake/S3 Pipeline: ETL architecture Questions
I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requireme...
I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requirements are the following: 1. I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side? The monthly pulls should get the last full month (since this should be a monthly pipeline). 2. All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc. 3. The data load from S3 (query_01012020,query_01022020,query_01032020) back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker. 4. I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps). 5. I want to get any error notifications in real-time when issues in the pipeline occur. I hope you are able to guide me on what components the pipeline should include. Any relevant documentation/tutorials for this effort are truly appreciated. Thank you very much.
cocoo84hh (101 rep)
Jun 14, 2020, 06:54 PM • Last activity: Mar 13, 2025, 06:02 AM
0 votes
1 answers
1032 views
Find and Insert Missing data in Mongodb Collection
I want to write python3 code to check and insert missing data. My **MongoDB** collection documents have one field named "height" which is **BTC** block number. I want to traverse in a range start to the latest block number and check which number is missing from range. The number which is missing I w...
I want to write python3 code to check and insert missing data. My **MongoDB** collection documents have one field named "height" which is **BTC** block number. I want to traverse in a range start to the latest block number and check which number is missing from range. The number which is missing I want to insert that. Can somebody help me with the logic? I have MongoDB version 4.
Varsh (101 rep)
Jan 4, 2019, 11:53 AM • Last activity: Feb 7, 2025, 05:04 AM
1 votes
1 answers
2927 views
Connecting client application to MariaDB Galera cluster
I currently have a REST API written in Flask that connects to a MariaDB server. I'm thinking about replacing the server with a Galera cluster to improve availability and ensure continuity in case one of the nodes goes down. What I'm having trouble understanding is how client applications connect to...
I currently have a REST API written in Flask that connects to a MariaDB server. I'm thinking about replacing the server with a Galera cluster to improve availability and ensure continuity in case one of the nodes goes down. What I'm having trouble understanding is how client applications connect to the cluster itself. As I'm currently only using a single database server, I can connect using the following code: engine = create_engine('mysql://username:password@hostaddress/database_name) If I were to have three nodes in the cluster with the addresses 192.168.1.1, 192.168.1.2 and 192.168.1.3 how would I connect the application to the cluster? I assume that it would be possible to replace the current hostname value with one of the node IPs e.g. 192.168.1.1 but I'd imagine that if I did that and that node went down, the application would no longer be able to connect to the cluster because it's specifically trying to connect to that one node. How can I ensure that the application continues to function if a node fails? I'm still new to the idea of Galera cluster so apologies if I've misunderstood something about how it works. Any advice would be much appreciated.
user3607758 (61 rep)
Jan 20, 2017, 07:56 AM • Last activity: Feb 5, 2025, 11:01 PM
0 votes
0 answers
58 views
How can I replicate data from a SQL Server VM in Azure to avoid recovery mode for reporting?
Currently, I'm using SQL Server on an Azure VM (DB B) to read from Power BI. DB B updates via log shipping from a primary SQL Server (DB A) and it locks me out from reading DB B two times an hour. I've considered caching strategies with Power BI to help, but I'm not certain that will solve the probl...
Currently, I'm using SQL Server on an Azure VM (DB B) to read from Power BI. DB B updates via log shipping from a primary SQL Server (DB A) and it locks me out from reading DB B two times an hour. I've considered caching strategies with Power BI to help, but I'm not certain that will solve the problem in the longterm, and the transition from DirectQueryImport mode can be a pain. Requirements/Notes for Suggested Solution(s): - Not an Enterprise user. - Changing the log shipping method to another method for updates from DB A is not an option. - Migrating DB B to Azure SQL Database/managed DB and eliminating SQL Server on Azure VM is not an option. - The log shipping updates happen at the same times each hour. - Near(ish) real-time replication would be ideal. - Transactional replication may not be a solution because each table does not have a primary key. - Minimizing cost would be ideal. - Reading from DB B should always be available regardless of data consistency. - Standing up a third DB, DB C, is an option. - I'm hesitant about enabling CDC on DB B to use as a basis for an ETL solution because of memory on the DB B VM and as not to cause an issue with the log-shipping processes. - 5 people may send requests to DB B from time-to-time but never all at once. I'm thinking about just standing up DuckDB on a VM loaded with Linux and write some Python scripts to update data the few times an hour when DB B is updated via log-shipping. What are some of your recommended solutions?
IamTrying (11 rep)
Feb 4, 2025, 03:20 AM • Last activity: Feb 4, 2025, 11:33 AM
Showing page 1 of 20 total questions