Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

1 votes

1 answers

375 views

Cassandra pool warning displaying continuously

I am using the [Cassandra driver for python][1] in Spyder. I am trying to fetch some data from Cassandra table. Here is my code: from cassandra.cluster import Cluster cluster=Cluster(['some_ip']) session=cluster.connect('some_key_space') df_filtered_10m=session.execute("some query") This is all work...

                                  I am using the Cassandra driver for python  in Spyder. I am trying to fetch some data from Cassandra table. Here is my code: 

    from cassandra.cluster import Cluster
    cluster=Cluster(['some_ip'])
    session=cluster.connect('some_key_space')
    df_filtered_10m=session.execute("some query")

This is all working fine and I am getting the desired results. The problem is that in the console, this message is continuously popping up:

>  WARNING:cassandra.pool:Error attempting to reconnect to 10.0.10.91,
> scheduling retry in 256.0 seconds: [Errno None] Tried connecting to
> [('10.0.10.91', 9042)]. Last error: timed out

I have tried cluster.shutdown but it not working as well. 
How to get rid of it? 
                                

Osama Dar (111 rep)

Sep 12, 2018, 05:19 AM • Last activity: Aug 2, 2025, 01:04 AM

0 votes

1 answers

141 views

Booleans, CONSTANTS or mapping table for 'status'-like fields?

mysql database-design python

I am modelling a User table which needs to have the following information about the users: - `is_active?` - `is_detained?` - `has_voluntarily_deactivated?` - `is_temporarily_suspended?` and so on... Basically, these are `boolean` flags with `true` or `false`. So, I am considering few approaches othe...

                                  I am modelling a User table which needs to have the following information about the users:

- is_active?
- is_detained?
- has_voluntarily_deactivated?
- is_temporarily_suspended?

and so on...

Basically, these are boolean flags with true or false. So, I am considering few approaches other than boolean flags which are as follows:

1. Create a single varchar field with values like 'active', 'detained', 'deactivated', 'suspended', etc.
2. Create a tinyint field and map the integers to another table containing status strings
3. Create a tinyint field and map the integers in code itself using constants, such as ACTIVE = 1, DETAINED = 2, etc. Is Python's enum type the best solution to this?
4. Create a tinyint field and map the integers to status strings in an XML or JSON file

Which of the above 4 or the original boolean style approach is preferable, or if there could be a completely different approach or a modified version of the above approaches, please let me know?

Also, in my code, how should I call these fields, like:

- if (user.status == 1), or something like
- if (user.status == STATUS.ACTIVE), or
- if (user.status == 'active')

(I think this will depend on which approach I follow)

These status values are not limited and may be added, edited or removed in future. Request you to answer in a database agnostic way and the programming language that I am using is Python.

Thank you for your answers
                                

Forthaction (21 rep)

Sep 18, 2016, 01:16 PM • Last activity: Jul 27, 2025, 10:01 PM

0 votes

2 answers

6413 views

Psycopg2 Errors on SQL statement when trying to copy data from CSV file into PostgreSQL database

postgresql postgis python

I am not a developer or PostgreSQL DB admin, so this may be basic questions. Logistics: Windows 10 server / pgAdmin 4 / Postgres 10 / Python 2.7.13 I'm using a python script to ingest external data, create a CSV file and copy that into Postgres 10. I keep getting the following error: **Psycopg2.Prog...

                                  I am not a developer or PostgreSQL DB admin, so this may be basic questions.  
Logistics: Windows 10 server / pgAdmin 4 / Postgres 10 / Python 2.7.13

I'm using a python script to ingest external data, create a CSV file and copy that into Postgres 10. I keep getting the following error: 
**Psycopg2.ProgrammingError: syntax error at or near "VALUES"**

I have a two part question - 1) I can not see the syntax error in the following sql statement

    def insert_csv_data(sqlstmt):
    with get_conn('pg') as db:
        cur = db.cursor()
        sqlcopy = "COPY irwin (fire_id,name,type,acres,date_time,state,county,admin_unit,land_cat,commander,perc_cntnd,cont_date,gacc,lat,long,geom,updated,imo) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,ST_SetSRID(ST_MakePoint(%s, %s),4326)%s,%s) FROM STIN DELIMITER ',' CSV HEADER"

        with open(csv_file, 'r') as f:
            #next(f)# Skipping the first line header row
            cur.copy_expert(sqlcopy, f, size=8000)
            db.commit()
            cur.close()

And 2) Once that is resolved I'm expecting to get an error about the geometry column in postgres. If someone would also peek at the code snippets and let me know if anything jumps out I would SO APPRECIATE IT!

This snippet pulls the external data in order but I don't think I've coded this correctly to pull the lat/long into the geom field. 

                # Lat 15 - double
            if not attributes['InitialLatitude'] is None:
                lat = str(attributes['InitialLatitude']).replace('\n', '')
            else:
                lat = '0'

            #Long 16 - double
            if not attributes['InitialLongitude'] is None:
                long = str(attributes['InitialLongitude']).replace('\n', '')
            else:
                long = '0'

            # geom is not defined - script is dumping the geometry into the IMO field
            geom = str(attributes['InitialLatitude']) + ' ' + str(attributes['InitialLongitude'])

I added a Geom header to the csv data. Please help - thanks!


                                

Cödingers Cat (1 rep)

Jun 10, 2019, 06:59 PM • Last activity: Jul 25, 2025, 02:04 PM

0 votes

1 answers

160 views

InnoDB Concurrent writes being ignored

mysql innodb mariadb python

Sorry if this isn't the right place, this is my first time posting anything here. Anyway, I know I'm not supposed to but I can think of no other way. I need to use a small MySQL table (70 rows x 6 columns) as a queue. I'm writing a python application that requires a work queue to be shared between p...

                                  Sorry if this isn't the right place, this is my first time posting anything here. Anyway, I know I'm not supposed to but I can think of no other way.

I need to use a small MySQL table (70 rows x 6 columns) as a queue. 

I'm writing a python application that requires a work queue to be shared between process windows (not sure what the proper name for them is). 

Each job is repeatable, each use must be recorded and cleared at regular intervals (so each job is "weighted" so usage is evenly and fairly distributed). I attempted to base it on another work queue where each job is NOT repeatable, but it seems that multiple writes to the database (up to 19 at once per 12 seconds) are not properly being counted up. 

Is there an alternative to doing something like this? 

Perhaps some kind of cache sitting between python and MySQL that would convert many single "job + 1" updates into a singular "job + 19"? 

I assumed being on a shiny new NVMe drive and a more than sufficient buffer_pool_size would make it plenty fast enough to handle that, but instead of counting up properly, over the course of 60 seconds it may reach 9, instead of 100+ where it should be.

Calvin DiBartolo (1 rep)

Mar 7, 2017, 06:06 PM • Last activity: Jul 24, 2025, 02:06 AM

0 votes

1 answers

150 views

Airflow to BigQueryt data load taking forever

mysql python architecture google-bigquery

Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI. The problem is t...

                                  Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI.

The problem is that the bigger tables (which only have ~ 1000 rows) take about 2 hours to load to BQ, and thus making this impossible to scale, I can´t imagine what will happen in the future when only the deltas are 10000 rows each...

This works using pandas and SQLAlchemy by extracting data as a dataframe and using "to_sql" method passing all the BQ connection parameters.

I am already uploading only incrementals/delta, that is not the problem.

Do you have any advice? Is Airflow the right tool for this? I´ve been searching for solutions for weeks but couldn´t find anything.

Ayrton (1 rep)

Aug 28, 2022, 11:11 PM • Last activity: Jul 17, 2025, 11:03 AM

0 votes

1 answers

175 views

Advice for applying locks in my data processing pipeline

postgresql locking python

I currently have a Python program that enters rows into a Postgres table that essentially works as a list of data I need to process. These processes create on-disk files and trigger other behavior so I only want it to run once for each row. I then have another script that takes the rows from that ta...

                                  I currently have a Python program that enters rows into a Postgres table that essentially works as a list of data I need to process. These processes create on-disk files and trigger other behavior so I only want it to run once for each row.

I then have another script that takes the rows from that table then begins to do the processing. So, for example, there might be 100 rows and each row might take 10-20 minutes to complete and each produces a few output files. 

I currently am running into the problem where I can only run this script one at a time in fear that running two in parallel might end up with them processing the same data twice.

If I create a boolean field that I flip within the application when it's 'busy', I fear having a stale lock due to an abruptly killed process that doesn't end gracefully. If I use locks as built within Postgres, it seems they disappear upon the connection/session ending. But if I'm on an unstable connection, I'm not quite sure what the behavior would be or how I can get the behavior I want? Given these are 10-20 minute processes, I foresee connection being lost within that time frame and thus the lock being lost. Thanks for any advice on where to go. I'm using a Python library called psycopg2 to connect to the Postgres database.

Pensw (1 rep)

May 21, 2022, 06:06 PM • Last activity: Jul 6, 2025, 08:07 PM

0 votes

1 answers

164 views

MySQL 'generator' not closing connections

mysql python

I have the following function which is leaving connections open but I can't figure out why. The Cursor class creates a python-generator type function, allowing me to iterate over millions of rows. connection=MySQLdb.connect("", "", "", "", cursorclass = MySQLdb.cursors.SSCursor) cursor=connection.cu...

                                  I have the following function which is leaving connections open but I can't figure out why.

The Cursor class creates a python-generator type function, allowing me to iterate over millions of rows.

    connection=MySQLdb.connect("", "", "", "",
                               cursorclass = MySQLdb.cursors.SSCursor)
    cursor=connection.cursor()
    cursor.execute("select id, ...", (profile_id, ...))
    try:
        for row in cursor.fetchall():
            yield row
    except:
        pass
    finally:
        connection.close()

Where am I going wrong?
                                

Adders (175 rep)

Aug 30, 2018, 01:40 PM • Last activity: Jul 5, 2025, 04:08 PM

1 votes

1 answers

179 views

storing histogram in sqlite database

sqlite python

I am not a database administrator, but only a scientist who would appreciate your help in solving my issue with storing histogram data in an SQLite database. I have several of them to be stored and to be later analysed with pandas (python). Each histogram is made by two arrays, 1. one for the bins or buckets that are regularly spaced, let's say from min to max with a given step. 2. one for the values. First question: how would you store the two arrays? They are rather long, up to 65k. I don't need to store the bin values, I can in principle recalculate them having the min, max and step. The value array may have several zeros, so it may be convenient to store them sparsely. Second question: I would like to retrieve them with a select returning something like:

bin1, value1
bin2, value2
...
binN, valueN

Sorry if my questions is looking too stupid to you, but I'm scratching my head with this problem since too long without finding any way out. Thanks in advance for your help! ## Update As a preliminary, not really disk space effective solution, I have implemented something like the suggestion of @Whitel Owl. Instead of storing the two arrays as text, I'm storing them as binary BLOBs. HEre is my code:

CREATE TABLE HistogramTable (
  HistogramID as INTEGER PRIMARY KEY,
  ImageID as INTEGER,
  Bins as BLOB,
  Histo as BLOB,
  FOREIGN KEY ImageID REFERENCE ImageTable(ImageID)
 );

To get the two blobs I'm using pickle.

import pickle
import sqlite3
import numpy as np

db.connect('mydb.db')

histo, bins = np.histogram(data)

histo_blob = sqlite3.Binary(pickle.dumps(histo))
bins_blob = sqlite3.Binary(pickle.dumps(bins))

user41796 (111 rep)

Jun 28, 2023, 08:35 PM • Last activity: Jul 5, 2025, 07:06 AM

0 votes

1 answers

32 views

How to resolve an access issue while running SQL Server Agent to execute a Python?

sql-server-agent python sql-server-2022

I am trying to use SQL Server Agent to execute a Python file. I am executing Python file using a Powershell script. When I ran the SQL Server Agent job, error message says: Message Executed as user: NT Service\SQLAgent$SQL2022. Start-Process : This command cannot be run due to the error: "Access is...

                                  I am trying to use SQL Server Agent to execute a Python file.

I am executing Python file using a Powershell script.

When I ran the SQL Server Agent job, error message says: 

    Message
    Executed as user: NT Service\SQLAgent$SQL2022. Start-Process : This command cannot be run       
    due to the error: "Access is denied At F:\xx.ps1:5 char:1 + Start-Process   
    C:\Users\jdoe\AppData\Local\Programs\Python\Python312\p ...  + 

I am thinking this happens because I originally installed Python using my own local access (username: jdoe).

Today, I asked our IT to install as Admin.

Currently, python is installed in two locations:

    (When I installed myself originally):

    C:\Users\jdoe\AppData\Local\Programs\Python\Python312\python.exe

    (When IT just installed today):

    C:\Program Files\Python313\python.exe

What do I need to do run Python code SQL Server Agent?

Where do I have to configure?

Java (253 rep)

Jun 27, 2025, 09:05 PM • Last activity: Jun 28, 2025, 05:23 AM

2 votes

1 answers

57 views

How to group by with similar group_name in sql

group-by python sqlite3

How can I perform a GROUP BY in *SQL* when the `group_name` values are similar but not exactly the same? In my dataset, the group_name values may differ slightly (e.g., "Apple Inc.", "AAPL", "Apple"), but conceptually they refer to the same entity. The similarity might not be obvious or consistent,...

                                  How can I perform a 

     GROUP BY

 in *SQL* when the group_name values are similar but not exactly the same?

In my dataset, the group_name values may differ slightly (e.g., "Apple Inc.", "AAPL", "Apple"), but conceptually they refer to the same entity. The similarity might not be obvious or consistent, so I might need to define a custom rule or function like is_similar() to cluster them.

For simple cases, I can extract a common pattern using regex or string functions (e.g., strip suffixes, lowercase, take prefixes). But how should I handle more complex scenarios, like fuzzy or semantic similarity?

Case: 

group_name     | val
---------------|-----
'Apple Inc.'   | 100
'AAPL'         | 50
'Apple'        | 30
'Microsoft'    | 80
'MSFT'         | 70

What I want to achieve: 

new_group_name | total_val
----------------|----------
'Apple'         | 180
'Microsoft'     | 150

What are the best approaches to achieve this in *SQL*?
And how would I write a query like this:

    SELECT some_characteristic(group_name) AS new_group_name,
           SUM(val)
    FROM tb1
    GROUP BY new_group_name;

Ahamad (1 rep)

May 14, 2025, 08:59 AM • Last activity: May 15, 2025, 05:31 AM

2 votes

1 answers

43 views

improve the implementation of worldquant 101 alpha factors using numpy

python

I was trying to implement 101 quant trading factors that was published by WorldQuant (https://arxiv.org/pdf/1601.00991.pdf). A typical factor is about processing stocks' price and volume information along with both time dimension and stock dimension. Take the example of alpha factor #4: (-1 * Ts_Ran...

                                  I was trying to implement 101 quant trading factors that was published by WorldQuant (https://arxiv.org/pdf/1601.00991.pdf) . 

A typical factor is about processing stocks' price and volume information along with both time dimension and stock dimension.  Take the example of alpha factor #4: (-1 * Ts_Rank(rank(low), 9)).  This is a momentum alpha signal.  low is a panel of stocks' low price within certain time period. rank is a cross-sectional process of ranking panel’s each row (a time snapshot). Ts_Rank is a time-series process of moving_rank panel’s each column (a stock) with a specified window.

Intuitively, pandas dataframe or numpy matrix should fit for the implementation of 101 alpha factors. Below is the best implementation using numpy I got so far.  However, the performance was too low. On my Intel core i7 windows machine, it took around 45 seconds to run the alpha #4 factor with a 5000 (trade dates) by 200 (stocks) matrix as input.

I also came across  DolphinDB, a time series database with built-in analytics features (https://www.dolphindb.com/downloads.html) .  For the same factor Alpha#4,  DolphinDB ran for mere 0.04 seconds, 1000 times faster than the numpy version. However, DolphinDB is a commercial software. Does anybody know better python implementations? Or any tips to improve my current python code to achieve performance comparable to DolphinDB?

Here is the implementation of python.

    import numpy as np
    
    def rankdata(a, method='average', *, axis=None):
        # this rankdata refer to scipy.stats.rankdata (https://github.com/scipy/scipy/blob/v1.9.1/scipy/stats/_stats_py.py#L9047-L9153) 
    
        if method not in ('average', 'min', 'max', 'dense', 'ordinal'):
        raise ValueError('unknown method "{0}"'.format(method))
        
        
        if axis is not None:
            a = np.asarray(a)
            if a.size == 0:
                np.core.multiarray.normalize_axis_index(axis, a.ndim)
                dt = np.float64 if method == 'average' else np.int_
                return np.empty(a.shape, dtype=dt)
            return np.apply_along_axis(rankdata, axis, a, method)
        
        arr = np.ravel(np.asarray(a))
        algo = 'mergesort' if method == 'ordinal' else 'quicksort'
        sorter = np.argsort(arr, kind=algo)
        
        inv = np.empty(sorter.size, dtype=np.intp)
        inv[sorter] = np.arange(sorter.size, dtype=np.intp)
        
        if method == 'ordinal':
            return inv + 1
        
        arr = arr[sorter]
        obs = np.r_[True, arr[1:] != arr[:-1]]
        dense = obs.cumsum()[inv]
        
        if method == 'dense':
            return dense
        
        # cumulative counts of each unique value
        count = np.r_[np.nonzero(obs), len(obs)]
        
        if method == 'max':
            return count[dense]
        
        if method == 'min':
            return count[dense - 1] + 1
        
        # average method
        return .5 * (count[dense] + count[dense - 1] + 1)
    
    def rank(x):   return rankdata(x,method='min',axis=1)/np.size(x, 1)
    
    def rolling_rank(na):   return rankdata(na.transpose(),method='min',axis=0)[-1].transpose()
    
    def ts_rank(x, window=10):   a_rolled = np.lib.stride_tricks.sliding_window_view(x, window,axis = 0)   return np.append(np.full([window-1,np.size(x, 1)],np.nan),rolling_rank(a_rolled),axis = 0)
    
    def alpha004(data):   return -1 * ts_rank(rank(data), 9)
    
    import time
    
    # The input is a 5000 by 200 matrix, where the row index represents trade date and the column index represents security ID.
    
    data=np.random.random((5000, 200)) start_time = time.time() alpha004(data) print("--- %s seconds ---" % (time.time() - start_time))

output: 44.85(s)
                                

Huang WeiFeng (31 rep)

May 13, 2025, 09:33 AM • Last activity: May 13, 2025, 09:39 AM

1 votes

1 answers

4026 views

MySQLdb._exceptions.OperationalError: (2006, '') when closing a SQL query. Is it because of the connection?

mysql-5.7 python

Salut les administrateurs ! I have a question (and an issue) When closing my SQL query in my ETL I get a connection error while it seemed I was able to get connected: c.execute(""" UPDATE `{table_name}` SET `{column_name}` = CONCAT('hash_', {expression}) WHERE {pk_name} IN ({ids}) """.format( table_...

                                  Salut les administrateurs ! I have a question (and an issue) When closing my SQL query in my ETL I get a connection error while it seemed I was able to get connected:

        c.execute("""
            UPDATE {table_name} 
            SET {column_name} = CONCAT('hash_', {expression})
            WHERE {pk_name} IN ({ids})
        """.format(
            table_name=table_name, column_name=column_name, expression=expression, pk_name=pk_name,
            ids=','.join(ids)
        ))

        print('.', end='', flush=True)

Indeed, I get:

    (venv) C:\Users\antoi\Documents\Programming\Work\data-tools>python -m etl.main
    2021-06-29 10:59:37.814133 - Connecting to database hozana_data...
    2021-06-29 10:59:37.822142 - Connecting to archive database hozana_archive...
    2021-06-29 10:59:38.046134 - Start ETL main process
    2021-06-29 10:59:38.046134 - users table:
    2021-06-29 10:59:38.046134 - Hashing column users.email: done.
    2021-06-29 10:59:38.054091 - Hashing column users.email_notification:Traceback (most recent call last):
      File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\task\anonymization.py", line 17, in hash_column
        c.execute("""
      File "C:\Users\antoi\Documents\Programming\Work\data-tools\venv\lib\site-packages\MySQLdb\cursors.py", line 183, in execute
        while self.nextset():
      File "C:\Users\antoi\Documents\Programming\Work\data-tools\venv\lib\site-packages\MySQLdb\cursors.py", line 137, in nextset
        nr = db.next_result()
    MySQLdb._exceptions.OperationalError: (2006, '')
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\antoi\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "C:\Users\antoi\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\main.py", line 52, in 
        main()
      File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\main.py", line 24, in main
        anonymization.main()
      File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\task\anonymization.py", line 59, in main
        hash_column('users', 'email_notification', 'user_id', True)
      File "C:\Users\antoi\Documents\Programming\Work\data-tools\etl\task\anonymization.py", line 50, in hash_column
        print('.', end='', flush=True)
      File "C:\Users\antoi\Documents\Programming\Work\data-tools\venv\lib\site-packages\MySQLdb\connections.py", line 239, in __exit__
        self.close()
    MySQLdb._exceptions.OperationalError: (2006, '')

I thought it was when the client cannot send a query to the server, most likely because the server itself has closed the connection. However I thought there was already a connection:



Even more there are also these lines at the beginning:

    2021-06-29 10:59:37.814133 - Connecting to database hozana_data...
    2021-06-29 10:59:37.822142 - Connecting to archive database hozana_archive...

So when I get a MySQLdb._exceptions.OperationalError: (2006, '') when closing a SQL query. Is it because of the connection?
                                

Revolucion for Monica (677 rep)

Jun 29, 2021, 09:23 AM • Last activity: May 3, 2025, 02:04 PM

0 votes

1 answers

1935 views

error [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified when import excel data to sql server

sql-server sql-server-2017 python odbc

I work on SQL server 2017 I need to import data from `excel 2016` to `sql server 2017` I using python script to do that I create `odbc` and success test with name `Testserver` path `G:\ImportExportExcel` have `allpackage and every one full control permissions` my instance name is : AHMEDSALAHSQL my...

                                  I work on SQL server 2017 I need to import data from excel 2016 to sql server 2017

I using python script to do that 

I create odbc and success test with name Testserver

path G:\ImportExportExcel have allpackage and every one full control permissions

     my instance name is : AHMEDSALAHSQL

     my pc name DESKTOP-L558MLK

    named pipe enabled true

    and instance allow remote

 

when run script below 

    declare @ImportPath NVARCHAR(MAX)='G:\ImportExportExcel'
     declare @DBConnectionString NVARCHAR(MAX) = 'dsn=Testserver;Uid=sa;Pwd=321'
     declare @ImportAll BIT=0
     declare @CombineTarget BIT=0
     declare @ExcelFileName NVARCHAR(200)='dbo.studentsdata'
     declare @ExcelSheetName NVARCHAR(50)='students2'
     
    
      
    
     
     SELECT @ImportPath = CASE WHEN RIGHT(@ImportPath,1) = '\' THEN @ImportPath ELSE CONCAT(@ImportPath,'\') END
     DECLARE @Serv NVARCHAR(200) = CONCAT(CHAR(39),CHAR(39),@@SERVERNAME,CHAR(39),CHAR(39))
     
      DECLARE @ValidPath TABLE (ValidPathCheck BIT)
     
    INSERT @ValidPath
    EXEC sp_execute_external_script
    @language =N'Python',
    @script=N'
    import pandas as pd
    d = os.path.isdir(ImportFilePath)
    OutputDataSet = pd.DataFrame([d],columns=["Filename"])'
    ,@params = N'@ImportFilePath NVARCHAR(MAX)'
    ,@ImportFilePath = @ImportPath
     
    
            
         
    DECLARE @PythonScript NVARCHAR(MAX) =CONCAT('
    import pandas as pd
    import os
    import glob
    from revoscalepy import RxSqlServerData, rx_data_step
    sqlConnString = "Driver=Testserver;Server=Serv; ',@DBConnectionString,'"
    Filefolderepath = ImportFilePath+"*.xlsx"
    
    if ImportAll ==0:
       Filename =ImportFilePath+ExcelFileName+".xlsx"
       exists = os.path.isfile(Filename)
       if exists and ExcelSheetName in pd.ExcelFile(Filename).sheet_names:
             Output = pd.read_excel(Filename, sheetname=ExcelSheetName, na_filter=False).astype(str)
             if not Output.empty:
                 sqlDS = RxSqlServerData(connection_string = sqlConnString,table = "".join(fl for fl in ExcelFileName if fl.isalnum())+"_"+"".join(sh for sh in ExcelSheetName if sh.isalnum()))
                 rx_data_step(input_data = Output, output_file = sqlDS,overwrite = True)
       else:
          print("Invalid Excel file or sheet name")')
      
    EXEC   sp_execute_external_script
          @language = N'Python'
         ,@script = @PythonScript
         ,@params = N'@ImportFilePath NVARCHAR(MAX),@ImportAll BIT,@CombineTarget BIT,@ExcelFileName NVARCHAR(200),@ExcelSheetName NVARCHAR(50),@Serv NVARCHAR(200)'
         ,@ImportFilePath = @ImportPath
         ,@ImportAll = @ImportAll
         ,@CombineTarget = @CombineTarget
         ,@ExcelFileName = @ExcelFileName
         ,@ExcelSheetName = @ExcelSheetName
         ,@Serv = @Serv

 I get error when run query 

    Msg 39004, Level 16, State 20, Line 0
    A 'Python' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
    Msg 39019, Level 16, State 2, Line 0
    An external script error occurred: 
    
    [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified
    
    
    
    Error in execution.  Check the output for more information.
    DataStep error: [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified


**so can any one help me to solve issue ?**

I add odbc connection to my pc
and test it success


                                

user3223372 (1 rep)

Apr 16, 2022, 02:56 AM • Last activity: Apr 16, 2025, 09:02 PM

0 votes

3 answers

1186 views

Overwriting MySQL database to only store 1 month of data

mysql database-recommendation python

We are logging data on hardware with only small memory (short on disk storage) only 4GB. We only require the data to be stored for 1 month and then be over written in a way that it overrides older data first. The memory on hardware is very small so cannot continue to record indefinitely. We are usin...

                                  We are logging data on hardware with only small memory (short on disk storage) only 4GB. 

We only require the data to be stored for 1 month and then be over written in a way that it overrides older data first. The memory on hardware is very small so cannot continue to record indefinitely.

We are using a MySQL data base, the hardware it is running on is not always powered on as it is in a vehicle. The data will be viewed in a graph to show historical data over time.

A few options I have thought of but not sure how to execute it:

Let’s assume I will record 1 million rows of data in a month

When the table (table1) gets to 1 million rows, move this table to another and start new table (table2). When table2 reaches 1 million rows. Delete table1, move table2 to new table and create table3 etc...

This way there will be minimum 1 month of entries.

Second option (not sure if possible):

When the table gets to 1 million rows it starts to override from row 1 again.

Phil (1 rep)

Jan 29, 2020, 09:09 PM • Last activity: Apr 14, 2025, 03:00 AM

0 votes

1 answers

843 views

IIS web application access SQL DB as service account

sql-server security python active-directory web-server

I've setup a new Python site on iis using FastCGI handler. The site has windows authentication enabled in iis and the app checks that the AD user belongs to an active directory group when they access the site. If authorisation fails access is denied. Windows authentication uses Kerberos but it is no...

                                  I've setup a new Python site on iis using FastCGI handler. The site has windows authentication enabled in iis and the app checks that the AD user belongs to an active directory group when they access the site. If authorisation fails access is denied.

Windows authentication uses Kerberos but it is not a double hop.  

 However the web app reads/writes to a SQL Server database and the DB calls are made using the service account which runs the app pool. The service account has limited access to run the web app and can only access one single database which the web app uses. 

I've read that impersonation would be better from a DB security perspective using constrained delegation.  Although the app does log which user has accesed the db.

I wouldn't remember the URL now. but it was essentially stating that the SQL database is checking that the actual AD user who is using the web app has access to the database. As opposed to the database checking that the service account has access.

Is there any obvious security risks with the approach I'm using?

DeadlyDan (111 rep)

Apr 13, 2022, 11:10 AM • Last activity: Apr 6, 2025, 12:06 AM

1 votes

3 answers

176 views

ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36

postgresql python aws-aurora encoding copy

When running a `\copy` (either pgadmin or `aws_s3.table_import_from_s3`) of a 1.6GB file into an AWS Aurora Postgres-compatible database, I'm getting the following error: ``` ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36 CONTEXT: COPY staging, line 99779: "L24000403170365 ACTIVEZONE LL...

When running a \copy (either pgadmin or aws_s3.table_import_from_s3) of a 1.6GB file into an AWS Aurora Postgres-compatible database, I'm getting the following error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xdc 0x36
CONTEXT:  COPY staging, line 99779: "L24000403170365 ACTIVEZONE LLC                                                                      ..."

EDIT: Here's what I could pull for table definition (but let me know if you want more): | column_name | data_type | character_maximum_length | is_nullable | column_default | | ----------- | --------- | ------------------------ | ----------- | -------------- | | raw | text | [null] | YES | [null] | EDIT: I also tried to change the column to bytea with no effect. The source is supposed to be ASCII, but I get the same error with explicit encodings like utf8, latin1, win1251, and win1252. EDIT: As requested in a reply, here's more information about the import commands. In pgadmin4, I'm right-click importing into the table which shows the following under the covers:

--command " "\\copy public.staging (\"raw\") FROM 'C:/data.txt' DELIMITER '|' ENCODING 'UTF8';""

I also use pgadmin4 to trigger the s3 table import by calling the query:

SELECT aws_s3.table_import_from_s3(
   'staging',
   '', 
   '(DELIMITER ''|'')',
   aws_commons.create_s3_uri('data', 'data.txt', 'us-east-1')
);

Under the covers, table_import_from_s3 calls the command:

copy staging from '/rdsdbdata/extensions/aws_s3/{{internal filename}}' with (DELIMITER '|')

The answer to similar questions is to clean up the source data so I pulled up python and tried to find the offending character. I couldn't find any evidence of an unusual character at or around the referenced line. For the sake of argument, I believe the following will scan the entire file (and you can see the results inline):

>>> def charinfile(filename, bytechar):
...     with open(filename, 'rb') as file:
...         byteline = file.readline()
...         while byteline:  # readline returns empty string at EOF
...             if byteline.find(bytechar) != -1:
...                 print("found!")
...                 return byteline
...             byteline = file.readline()
...         else:
...             print("not found")
...
>>> charinfile(filename, b'\xdc')
not found
>>> charinfile(filename, b'\xdc36')
not found
>>> charinfile(filename, b'6') # make sure the code is working
found!

I've also tried versions where I use strings instead of bytes with the same results. I can confirm that there are no blank lines before EOF (have used line counters to verify that I reach ~1m rows). What am I missing?

claytond (123 rep)

Mar 12, 2025, 06:42 PM • Last activity: Mar 24, 2025, 05:13 PM

0 votes

1 answers

320 views

Snowflake/S3 Pipeline: ETL architecture Questions

aws etl python snowflake

I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requireme...

                                  I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requirements are the following:

1. I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
The monthly pulls should get the last full month (since this should be a monthly pipeline).
2. All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.
3. The data load from S3 (query_01012020,query_01022020,query_01032020) back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
4. I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
5. I want to get any error notifications in real-time when issues in the pipeline occur.

I hope you are able to guide me on what components the pipeline should include. Any relevant documentation/tutorials for this effort are truly appreciated.

Thank you very much.
                                

cocoo84hh (101 rep)

Jun 14, 2020, 06:54 PM • Last activity: Mar 13, 2025, 06:02 AM

0 votes

1 answers

1032 views

Find and Insert Missing data in Mongodb Collection

mongodb python

I want to write python3 code to check and insert missing data. My **MongoDB** collection documents have one field named "height" which is **BTC** block number. I want to traverse in a range start to the latest block number and check which number is missing from range. The number which is missing I w...

                                  I want to write python3 code to check and insert missing data. My **MongoDB** collection documents have one field named "height" which is **BTC** block number. I want to traverse in a range start to the latest block number and check which number is missing from range. The number which is missing I want to insert that. Can somebody help me with the logic?
I have MongoDB version 4.
                                

Varsh (101 rep)

Jan 4, 2019, 11:53 AM • Last activity: Feb 7, 2025, 05:04 AM

1 votes

1 answers

2927 views

Connecting client application to MariaDB Galera cluster

mariadb galera python

I currently have a REST API written in Flask that connects to a MariaDB server. I'm thinking about replacing the server with a Galera cluster to improve availability and ensure continuity in case one of the nodes goes down. What I'm having trouble understanding is how client applications connect to...

                                  I currently have a REST API written in Flask that connects to a MariaDB server. I'm thinking about replacing the server with a Galera cluster to improve availability and ensure continuity in case one of the nodes goes down. What I'm having trouble understanding is how client applications connect to the cluster itself. As I'm currently only using a single database server, I can connect using the following code:

    engine = create_engine('mysql://username:password@hostaddress/database_name)

If I were to have three nodes in the cluster with the addresses 192.168.1.1, 192.168.1.2 and 192.168.1.3 how would I connect the application to the cluster? I assume that it would be possible to replace the current hostname value with one of the node IPs e.g. 192.168.1.1 but I'd imagine that if I did that and that node went down, the application would no longer be able to connect to the cluster because it's specifically trying to connect to that one node. How can I ensure that the application continues to function if a node fails?

I'm still new to the idea of Galera cluster so apologies if I've misunderstood something about how it works. Any advice would be much appreciated.

user3607758 (61 rep)

Jan 20, 2017, 07:56 AM • Last activity: Feb 5, 2025, 11:01 PM

0 votes

0 answers

58 views

How can I replicate data from a SQL Server VM in Azure to avoid recovery mode for reporting?

sql-server transactional-replication azure python etl

Currently, I'm using SQL Server on an Azure VM (DB B) to read from Power BI. DB B updates via log shipping from a primary SQL Server (DB A) and it locks me out from reading DB B two times an hour. I've considered caching strategies with Power BI to help, but I'm not certain that will solve the probl...

                                  Currently, I'm using SQL Server on an Azure VM (DB B) to read from Power BI. DB B updates via log shipping from a primary SQL Server (DB A) and it locks me out from reading DB B two times an hour. I've considered caching strategies with Power BI to help, but I'm not certain that will solve the problem in the longterm, and the transition from DirectQueryImport mode can be a pain.

Requirements/Notes for Suggested Solution(s):
 - Not an Enterprise user.
 - Changing the log shipping method to another method for updates from DB A is not an option.
 - Migrating DB B to Azure SQL Database/managed DB and eliminating SQL Server on Azure VM is not an option.
 - The log shipping updates happen at the same times each hour.
 - Near(ish) real-time replication would be ideal.
 - Transactional replication may not be a solution because each table does not have a primary key.
 - Minimizing cost would be ideal.
 - Reading from DB B should always be available regardless of data consistency.
 - Standing up a third DB, DB C, is an option.
 - I'm hesitant about enabling CDC on DB B to use as a basis for an ETL solution because of memory on the DB B VM and as not to cause an issue with the log-shipping processes.
 - 5 people may send requests to DB B from time-to-time but never all at once.

I'm thinking about just standing up DuckDB on a VM loaded with Linux and write some Python scripts to update data the few times an hour when DB B is updated via log-shipping. 

What are some of your recommended solutions?
                                

IamTrying (11 rep)

Feb 4, 2025, 03:20 AM • Last activity: Feb 4, 2025, 11:33 AM

Showing page 1 of 20 total questions