Sample Header Ad - 728x90

Parallel PostgreSQL Queries with R

1 vote
2 answers
1387 views
I'm running a large query that for various reasons, I've broken into a series of smaller queries. The example below is just to show how the query is broken up by id; the actual query is much more complex: UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 0 AND id = 10 AND id = 20 AND id =', num, 'AND id <', num2, sep = "") dbSendQuery(con, q1) dbDisconnect(con) } lapply(num, query.func) Rather than having this long sequence of queries run one after another, I wanted to try and take advantage of my server's multiple CPU's and run them in parallel. Using the R library parallel I've created a cluster and sent multiple queries to the cluster simultaneously: no_cores <- detectCores() - 1 cl <- makeCluster(no_cores) clusterExport(cl, "query.func") clusterExport(cl, "num") clusterEvalQ(cl, library(RPostgreSQL)) parLapply(cl, num, query.func) stopCluster() Where query.func is defined as above. When I run this R script at the terminal I receive no errors and if I check pg_stat_activity I see a list of active queries, each incrementally operating on a separate block of data. However, when I use top to check the resource management on my system, I still only see one postgres process. When I look at the CPU usage, I only see one CPU active at a time. Are these queries really running simultaneously on separate CPU's? My understanding was that, prior for 9.6, a single postgres query could not be split across multiple cores but each connection could utilize a separate core (related question) . Does the process I've outlined above open multiple connections and send a query to the database via each connection? PostgreSQL 9.3/ Ubuntu 14.04 / R 3.3.2
Asked by Matt (291 rep)
Jan 9, 2017, 06:20 PM
Last activity: Dec 8, 2021, 08:31 AM