Parallel execution of a SQL Server external script
0
votes
0
answers
565
views
# TLDR for a reader with the same problem
If your external script is too slow, you may do something in your scritpt which can be implemented in TSQL, which is a big problem. If this is not the case use CLR or process your data in a external application where the functionality is present. Normally you should not rely on external scripts to achieve high performance. CLR is also more secure .
I am not allowed to do the above, because of workspace/interpersonal reasons.
# Problem
I have to write a query which must rely on regular expressions and I am not allowed to [use CLR](https://stackoverflow.com/questions/61488458/what-is-regexp-replace-equivalent-in-sql-server) or use anything outside the SQL Server environment. So I must use a R or Python external script. I have experience with Python. So I went with that.
I wrote the script it works but the input data is extremely large so the external script execution takes a long time. I already heavily optimized the script, but it is still slow. So I plan to parallelize it.
Python's [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) module doesn't work inside the external script. So I got a strange idea:
What if I horizontally partition the temp table which holds the input data and execute the external script on the partitions using [asynchronous procedure execution](https://stackoverflow.com/questions/4571823/sql-server-tsql-is-it-possible-to-exec-statements-in-parallel), then I just get back their union?
Or perhaps there is a better way to parallel execution of an exterenal script?
I work on SQL Server 2019 Development version we also have an enterprise version which I guess doesn't matter here.
# Details
@J.D rightly asked for more details. I am only allowed to share a limited amount:
The temp table providing the input has different columns for the actual-, birth- and mother names for different people.
I also have a dataframe in the external script containing valid first names and associated genders parsed from a JSON which is passed as a nvarchar(max). This method described here .
My script extracts titles, surnames, given names and guess biological sex and martial status. The names are not English names.
I apply a function to the first axis of the input Pandas Dataframe which does the extraction.
I use only Pandas provided methods.
With a standard python environment I would use the [np.array_split](https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe) function to split the dataframe and use the multiprocessing module to apply extraction function on the splits in parallel processes, then [concatenate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) the dataframes. As you can see this is a text processing problem and cannot be vectorized as mathematical one could be. So tools like Dask [cannot be used here](https://stackoverflow.com/questions/45545110/make-pandas-dataframe-apply-use-all-cores) .
# Why not CLR?
Using CLR would be a sane choice. So a colleague of mine and I also recommended to use CLR. Our expert said using CLR is very dangerous. Yes I know about the CLR strict security option , but our expert may not. So I am forced to use external script against my better judgement.
# Why not use multi-threading
Python's Global Interpreter Lock only allows one thread to be executed by the interpreter. So multi-threading with python will not provide parallel computation, at least with the standard implementation. This is not a limitation for Jython or IronPython, but this is a different story.
Asked by atevm
(337 rep)
Jan 12, 2021, 08:32 PM
Last activity: Jan 20, 2021, 10:21 AM
Last activity: Jan 20, 2021, 10:21 AM