I am trying to find out what is state of the art with database, python, and big data.
My starting point began with a SQL server, and multiprocessing pandas, and dask. Imagine I need to maintain a database with more than 1 billion rows, and I need to keep inserting into it and even perform multiprocessing, larger than memory complex analysis on them.
Some drawbacks includes that, SQL server is very slow in inserting data and extracting data.
Inserting 100k rows takes 1 second, reading 1M head rows takes 5s+. The speed is very dissatisfactory compared with dask with parquet. However, for dask with parquet, I cannot keep inserting into this "more than 1 billion rows database". Multiindex/None-clustered index is also not supported even making some previously fast sql join slower....
I looked around and found apache sql, pyspark. But I'm a bit unsure if that is the correct step forward. Any suggestion? Thanks!
Asked by thinker
(121 rep)
Aug 1, 2021, 04:23 AM
Last activity: Aug 1, 2021, 04:59 AM
Last activity: Aug 1, 2021, 04:59 AM