Sample Header Ad - 728x90

Best way to store millions of measurements with hundreds of columns

1 vote
2 answers
176 views
I work with infrared spectrometers (several hundred of them) to analyze chemical compounds, and I am looking for an efficient solution to store the measured data so that I can process it later on. Each time a spectrum is measured by a spectrometer, it gives me what I call an “acquisition”. An acquisition is made of the following data: - The spectrometer ID (array of max. 15 characters); - ID of the measured chemical compound (array of characters) - Date and time of the measurement (as an array of characters e.g. 2023-05-19 13:24:00) ; - Internal temperatures, voltages, etc… measured by the spectrometer (about 20 measurements) ; - A vector with the 500 wavelengths we wanted to measure - Example : 2001, 2002, …, 2499, 2500 - A vector with the 500 wavelengths which were actually measured - Example : 2000.9456, 2002.5498, …, 2499.7648, 2500.0195 - A vector with 500 values of light intensities measured by the spectrometer (one value per wavelength) - Example : 11614.9756, 11611.9512, …, 16084.7073, 16127.7561 - A vector with 500 values which represent the stability of each previous light intensity measurement - Example : 2.6314, 3.2165, …, 2.2051, 1.9872 Note that while the first vector of target wavelengths is constant, the other three vectors will never be the same from one acquisition to another. All those data are saved as a .csv file, one acquisition = one csv file. The content of each .csv typically looks like that: | Spectrometer ID | Product ID | Date/time | Temperature | Voltage | | --- | --- | --- | --- | --- | | S_A01_A023 | Ethanol | 2023-05-19 13:24:00 | 23.4 °C | 35.8912 V | | 2001 | 2002 | … | 2499 | 2500 | | 2000.9456 | 2002.5498 | … | 2499.7648 | 2500.0195 | | 11614.9756 | 11611.9512 | … | 16084.7073 | 16127.7561 | | 2.6314 | 3.2165 | … | 2.2051 | 1.9872 | I already have about 6 millions of such acquisition files and this number will increase by about 1 million every month. So far I only performed statistics on one spectrometer at a time and this is easily manageable with Matlab as I have at most 50k acquisition per spectrometer. This is done by parsing the content of each .csv to a matlab structure and save it to a .mat file which is a few hundred megabytes at most. On my machine, loading the .mat file takes at most 5 seconds, which is acceptable. However, I now want to perform some statistics on acquisitions coming from different case scenarios. One such scenario could be, for instance, to perform some calculations on every ethanol spectrum measured at a temperature above 50°C between January and May 2023 and which measured light intensity at 2005 nm is below 6000. In Matlab, this would mean loading a file where the 6+ million acquisitions are saved and discarding most of them to keep only the 2k samples that are of interest for the above test case. This solution is not acceptable since retrieving the data would be inefficient and time-consuming. Moreover, many different scenarios will need to be tested every day. Therefore, I am looking for the best solution to store my data and which could be interfaced with Matlab (or Python) to quickly extract the relevant data. It is important to note that this data needs to be accessible by multiple people, not just myself, and often two or three people will access the data simultaneously. Note that all statistical computations will be performed with Matlab or Python. I do not know much about data bases, data warehouses, SQL, Hadoop or other systems to store and access big quantities of data. However, I am willing to learn and I would be grateful if you could help me with this topic. I hope that my explanations are clear enough. If not, do not hesitate to ask for clarification.
Asked by Xav59130 (11 rep)
May 19, 2023, 03:24 PM
Last activity: Jul 7, 2025, 12:06 AM