Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
0
votes
1
answers
150
views
Airflow to BigQueryt data load taking forever
Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI. The problem is t...
Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI.
The problem is that the bigger tables (which only have ~ 1000 rows) take about 2 hours to load to BQ, and thus making this impossible to scale, I can´t imagine what will happen in the future when only the deltas are 10000 rows each...
This works using pandas and SQLAlchemy by extracting data as a dataframe and using "to_sql" method passing all the BQ connection parameters.
I am already uploading only incrementals/delta, that is not the problem.
Do you have any advice? Is Airflow the right tool for this? I´ve been searching for solutions for weeks but couldn´t find anything.
Ayrton
(1 rep)
Aug 28, 2022, 11:11 PM
• Last activity: Jul 17, 2025, 11:03 AM
0
votes
1
answers
678
views
Monthly Growth - Bigquery
I need your help again. I need to calculate the monthly growth for the trips. The query that I have written is below: ``` SELECT EXTRACT(YEAR FROM DATE (stoptime)) AS Year, EXTRACT(MONTH FROM DATE (stoptime)) AS Month, COUNT(bikeid) AS Trips, ROUND(100* (COUNT(bikeid)- LAG(COUNT(bikeid),1,0) OVER (P...
I need your help again. I need to calculate the monthly growth for the trips.
The query that I have written is below:
I do not know which part of my query is wrong. Kindly help me with this.
My desired output example:
Thank you!
SELECT EXTRACT(YEAR FROM DATE (stoptime)) AS Year, EXTRACT(MONTH FROM DATE (stoptime)) AS Month,
COUNT(bikeid) AS Trips,
ROUND(100* (COUNT(bikeid)- LAG(COUNT(bikeid),1,0) OVER (PARTITION BY 'Year' ORDER BY 'Year', 'Month'))/lag(COUNT(bikeid),1) OVER
(PARTITION BY 'Year' ORDER BY 'Year', 'Month')) ||'%' AS growth
FROM bigquery-public-data.new_york_citibike.citibike_trips
WHERE end_station_name LIKE '%University%'
GROUP BY Year, Month
ORDER BY Year DESC, Month DESC
However, the query below cannot give me the right output. The output from the query above as below:
!

snsd94
(1 rep)
Feb 3, 2021, 05:07 PM
• Last activity: Jun 9, 2025, 07:08 AM
1
votes
1
answers
4924
views
Retrieving executed query list in BigQuery via SQL
Is there a table similar to v$sql in Oracle where I can retrieve data associated with a particular query that was run in BigQuery using SQL?
Is there a table similar to v$sql in Oracle where I can retrieve data associated with a particular query that was run in BigQuery using SQL?
kpod
(11 rep)
Jun 10, 2019, 12:30 AM
• Last activity: Jan 8, 2025, 07:03 AM
0
votes
0
answers
82
views
BigQuery can't connection to MySQL Cloud SQL after upgrade from 5.7 to 8.0
I have the a similar problem to [this here][1] , but the suggestions there don't seem to work. When trying to execute a query I get this: ``` Invalid table-valued function EXTERNAL_QUERY Failed to connect to MySQL database. Error: MysqlErrorCode(2059): Authentication plugin 'caching_sha2_passwo...
I have the a similar problem to this here , but the suggestions there don't seem to work. When trying to execute a query I get this:
Invalid table-valued function EXTERNAL_QUERY Failed to connect to MySQL database. Error:
MysqlErrorCode(2059):
Authentication plugin 'caching_sha2_password' cannot be loaded:
/usr/lib/plugin/caching_sha2_password.so:
cannot open shared object file: No such file or directory at [1:15]
When trying to query from BigQuery to a MySQL 8.0.37 Cloud SQL db. Tried with two different users. Works fine with those credentials from the Cloud SQL Studio. This used to work before upgrading to 8.0 from 5.7. I did change the auth type for the user to caching_sha2_password and the db is running with caching_sha2_password as the default auth.
The Cloud SQL server log shows:
[Note] [MY-010914] [Server] Got an error reading communication packets
for each query attempt.
External (mysql commandline and Intellij datagrid) and cloud run JDBC connections work too, just not the BigQuery External access.
Any ideas?
cputoaster
(101 rep)
Oct 29, 2024, 01:47 PM
• Last activity: Oct 29, 2024, 02:34 PM
0
votes
0
answers
49
views
How can a data warehouse like BigQuery parallelize queries that Postgres cannot?
In order to get more practice with database management and reporting tools, I have been working to load the recent SSN data leak into Postgres. Since it is such a large dataset, it has served as a really good learning opportunity so far. The dataset is a csv file that is formatted slightly incorrect...
In order to get more practice with database management and reporting tools, I have been working to load the recent SSN data leak into Postgres. Since it is such a large dataset, it has served as a really good learning opportunity so far.
The dataset is a csv file that is formatted slightly incorrectly. I decided the best route forward was to import each line as a string and use some basic regex to extract the columns. I ran a query in postgres to extract a single column from the dataset with my regex, and it took more than 12 hours to run on an M3 Max. I didn't let it complete, so I don't know how long it would have actually taken. I did some research and did what I could find to optimize postgres for large database modifications, including writing to a new table instead of editing in place, and removing all indexes. These operations always ran on a single thread. From what I found in my research, if the postgres query planner can run a query on multiple threads it will, but most queries cannot be parallelized in this way.
Then I tried another method for cleaning the data: BigQuery. It took some time to get the dataset into GCP Storage, but once I did, it was incredibly fast. The total time for BigQuery to load 350GB of csv, run 12 different regular expressions against each, and output that to a new csv was about 10 minutes.
Here is my question: How can BigQuery complete the same query so much faster than postgres? What is architecturally different about a data warehouse that allows it to parallelize workflows that postgres cannot? Why doesn't postgres apply these lessons and parallelize their own queries?
Edit: Are there any projects that allow me to run a data warehouse like database locally? Or are they all tied to proprietary clouds like BigQuery/Snowflake/etc?
Spenser
(1 rep)
Aug 21, 2024, 01:46 PM
• Last activity: Aug 21, 2024, 02:10 PM
0
votes
1
answers
38
views
Google Datastream errors on larger MySQL tables
I have set up a Datastream service, in order to replicate data from Cloud SQL (MySQL) to BigQuery. Everything is set up correctly, connection works. But the weird thing is that only tables < 10mb size are replicated without issues. The larger tables (100+ MB) all fail. When checking the error status...
I have set up a Datastream service, in order to replicate data from Cloud SQL (MySQL) to BigQuery.
Everything is set up correctly, connection works. But the weird thing is that only tables < 10mb size are replicated without issues.
The larger tables (100+ MB) all fail. When checking the error status, it only says "Timed-out while waiting for the query to complete."
I have not found anything useful regarding this error. What approaches can I try? Backfilling a specific table give the same error.
Source (MySQL) database connection flags are set to the recommended values:
net_read_timeout: 3600 seconds (1 hour)
net_write_timeout: 3600 seconds (1 hour)
wait_timeout: 86400 seconds (24 hours)
Spock
(101 rep)
Mar 30, 2024, 02:57 PM
• Last activity: Apr 3, 2024, 06:40 AM
-1
votes
1
answers
4693
views
How to return missing rows from LEFT JOIN in BigQuery
How can we get BigQuery to return the rows in a LEFT JOIN which exist in TABLE A but are NULL in TABLE B? -- find missing users. return rows which exist in A but not in B select a.user_id from database_a a left outer join database_b b on a.user_id = b.user_id where b.user_id is null
How can we get BigQuery to return the rows in a LEFT JOIN which exist in TABLE A but are NULL in TABLE B?
-- find missing users. return rows which exist in A but not in B
select a.user_id
from database_a a
left outer join database_b b on a.user_id = b.user_id
where b.user_id is null
crazy8
(131 rep)
Aug 18, 2021, 11:53 PM
• Last activity: Jan 7, 2024, 12:04 AM
1
votes
1
answers
249
views
Aggregate SCD Type 2 data "as of" each day
## Problem When working with SCD type 2 data it's easy to see the state of a table "as of" a given point in time by using your date columns, eg: `valid_from` and `valid_to`. For example: ```sql select * from table where '2023-11-01' between valid_from and valid_to ``` You can then group by or use wi...
## Problem
When working with SCD type 2 data it's easy to see the state of a table "as of" a given point in time by using your date columns, eg:
valid_from
and valid_to
. For example:
select * from table
where '2023-11-01' between valid_from and valid_to
You can then group by or use window functions to aggregate.
But what if I want to do that repeatedly, for each date in a range (eg: daily)? I do not need to aggregate *across* these dates, just within each one.
## Example
Let's say I have a table that tracks the quantity
of people for each reservation_id
. Both the quantity
of people and reservation_status
can change over time, as tracked by the table. Each reservation is attached to an event_id
. event_date
is included to help limit scope if needed (see assumptions below).
| reservation_id | reservation_status | quantity | event_id | event_date | valid_from | valid_to |
| -------------- | ------------------ | -------- | -------- | ---------- | ---------- | -------- |
| 1 | active | 4 | 100 | 2023-05-25 | 2023-01-01 | 2023-01-02 |
| 2 | active | 2 | 200 | 2024-01-07 | 2023-01-01 | 2023-01-03 |
| 3 | active | 7 | 100 | 2023-05-25 | 2023-01-02 | 9999-12-31 |
| 4 | active | 1 | 200 | 2024-01-07 | 2023-01-02 | 9999-12-31 |
| 1 | active | 5 | 100 | 2023-05-25 | 2023-01-03 | 9999-12-31 |
| 5 | active | 8 | 100 | 2023-05-25 | 2023-01-03 | 9999-12-31 |
| 2 | cancelled | 2 | 200 | 2024-01-07 | 2023-01-04 | 9999-12-31 |
| 6 | active | 3 | 100 | 2023-05-25 | 2023-01-06 | 9999-12-31 |
PostgreSQL since BigQuery is more difficult to test – db-fiddle / SQL:
-sql
CREATE TABLE Reservations (
"reservation_id" INTEGER,
"reservation_status" VARCHAR(9),
"quantity" INTEGER,
"event_id" INTEGER,
"event_date" DATE,
"valid_from" DATE,
"valid_to" DATE
);
INSERT INTO Reservations
("reservation_id", "reservation_status", "quantity", "event_id", "event_date", "valid_from", "valid_to")
VALUES
('1', 'active', '4', '100', '2023-05-25', '2023-01-01', '2023-01-02'),
('2', 'active', '2', '200', '2024-01-07', '2023-01-01', '2023-01-03'),
('3', 'active', '7', '100', '2023-05-25', '2023-01-02', '9999-12-31'),
('4', 'active', '1', '200', '2024-01-07', '2023-01-02', '9999-12-31'),
('1', 'active', '5', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
('5', 'active', '8', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
('2', 'cancelled', '2', '200', '2024-01-07', '2023-01-04', '9999-12-31'),
('6', 'active', '3', '100', '2023-05-25', '2023-01-06', '9999-12-31');
While this is ultimately for BigQuery, answer will be accepted in any dialect as long as it's somewhat generic.
##### Assumptions
- "As of" dates can be a list or range based off valid_from
min/max
- valid_to
of 9999-12-31
is the most recent data
- All reservations for a given event will be between event_date - INTERVAL '2 years'
and event_date
. This doesn't change anything for this example, but maybe is useful for scaling(?)
### Desired output
I'd like to know the sum of quantity
grouped by event_id
and reservation_status
*as of* each interval (day).
| as_of_date | event_id | reservation_status | sum_quantity |
| ---------- | -------- | ------------------ | ------------ |
| 2023-01-01 | 100 | active | 4 |
| 2023-01-01 | 200 | active | 2 |
| 2023-01-02 | 100 | active | 11 |
| 2023-01-02 | 200 | active | 2 |
| 2023-01-03 | 100 | active | 20 |
| 2023-01-03 | 200 | active | 3 |
| 2023-01-04 | 100 | active | 20 |
| 2023-01-04 | 200 | active | 1 |
| 2023-01-04 | 200 | cancelled | 2 |
| 2023-01-06 | 100 | active | 23 |
*Rough estimate of row values. Will be different if using complete range of dates.*
I essentially want to do the following:
-sql
/* Invalid SQL, just for conceptual purposes */
-- Given a list of dates, for each "date":
select
event_id,
reservation_status,
sum(quantity)
from table
where {{date}} between valid_from and valid_to
group by
event_id,
reservation_status
I believe this can be done using procedural language, such as a for loop, but I feel like I am overthinking that, and just having trouble combining simpler concepts.
camtech
(143 rep)
Nov 28, 2023, 05:58 AM
• Last activity: Nov 29, 2023, 08:03 PM
0
votes
1
answers
59
views
Does the BigQuery API offer a way to retrieve info on scheduled queries?
Using the BigQuery C# API, I can [retrieve a list of job IDs][1]: BigQueryClient _client = BigQueryClient.Create(...); ... foreach (var page in _client.ListJobs(projectId).AsRawResponses()) if (page.Jobs != null) // This does happen occasionally foreach (var job in page.Jobs) yield return job.Id; Th...
Using the BigQuery C# API, I can retrieve a list of job IDs :
BigQueryClient _client = BigQueryClient.Create(...);
...
foreach (var page in _client.ListJobs(projectId).AsRawResponses())
if (page.Jobs != null) // This does happen occasionally
foreach (var job in page.Jobs)
yield return job.Id;
This seems to give me all jobs ever run (or at least within some significant time horizon; it's tens of thousands of records). Still, I'd like to get the details for some jobs and see if I'm at least on the right track. I can retrieve a
BigQueryJob
object using BigQueryClient.GetJob()
(there's no C# doc, but the Java sample code is similar ), but the information returned is very limited: current state, any errors encountered, some basic statistics, etc. There's nothing about schedules.
Is there a separate command to retrieve details on scheduled queries? I can't find any such methods in the client.
Jon of All Trades
(5987 rep)
Sep 13, 2023, 03:16 PM
1
votes
1
answers
1587
views
Using COUNT returns the same value when counting two different columns
I am new to SQL and am using it for a school project and would like some help. I have a table with multiple columns about a fictional company that provides a bike sharing service. One of the rows I am trying to count is called member_casual which returns a string with either casual or member, that r...
I am new to SQL and am using it for a school project and would like some help.
I have a table with multiple columns about a fictional company that provides a bike sharing service. One of the rows I am trying to count is called member_casual which returns a string with either casual or member, that reports whether a user is a member or not. The second column I am trying to count is a string column called rideable_type that reports what kind of bike it has. I want to create a table that returns the number of kinds of bikes per each kind or rider, casual or member.
Here is my code:
SELECT member_casual,
COUNT(member_casual) AS member_type,
COUNT(rideable_type) AS electric_classic
FROM coursera-practice-356200.cycle_data.cycle_table1
GROUP BY member_casual
Here is my result:
| member_casual | member_type | electric_classic |
|---------------|-------------|------------------|
| casual | 21416 | 21416 |
|. member. |. 94193. |. 94193. |
I know my code is flawed, but I do not know where it is flawed. I also do not know why I got the same number when I am counting two different columns (member_casual
and rideable_type
) which should be very different numbers.
I guess what I am asking is for assistance on where my code is wrong and how to fix it.
YellingYeti
(11 rep)
Mar 1, 2023, 11:06 PM
• Last activity: Mar 2, 2023, 01:05 PM
0
votes
0
answers
253
views
Optimising a recursive SQL query that processes several million records in BQ
I need help optimizing a recursive SQL query in BQ. I have hierarchical data stored in a table as parent-child relationships, i.e [![enter image description here][1]][1] will be stored as | parent_item_id | child_item_id | | ---------------| --------------| | 1 | 1 | | 1 | 2 | | 1 | 3 | | 3 | 4 | |...
I need help optimizing a recursive SQL query in BQ.
I have hierarchical data stored in a table as parent-child relationships, i.e
will be stored as
| parent_item_id | child_item_id |
| ---------------| --------------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 3 | 4 |
| 3 | 5 |
| ... | ... |
when

parent_item_id = child_item_id
it means that this is a root node. I need to produce the following resulting table from the parent-child mapping:
| root_id | parent_item_id | child_item_id | level |
| --------| ---------------| --------------| ------ |
| 1 | 1 | 1 | 0 |
| 1 | 1 | 2 | 1 |
| 1 | 1 | 3 | 1 |
| 1 | 3 | 4 | 2 |
| 1 | 3 | 5 | 2 |
| ... | ... | ... | ... |
To do that I first prepared separate tables for parents and children:
-- parents
SELECT * FROM parent_child_mapping WHERE parent_item_id = child_item_id;
-- children
SELECT * FROM parent_child_mapping WHERE parent_item_id child_item_id;
Then I created a table with root nodes only. By definition, a root node is a top-level node which means it can't be a child of any other node. To identify root nodes we can calculate the set difference between parents and children tables:
-- root nodes
SELECT p.* FROM parents p
LEFT JOIN children c
ON
p.parent_item_id = c.child_item_id
WHERE
c.child_item_id IS NULL
Finally, I prepared a recursive query:
SELECT
i.root_id,
i.parent_item_id,
i.child_item_id
0 AS level
FROM roots i
UNION ALL
SELECT
t.root_id,
t.parent_item_id,
t.child_item_id,
t.level + 1,
FROM children AS i
INNER JOIN objects_tree AS t
ON
t.child_item_id = i.parent_item_id
This is the final SQL query:
WITH RECURSIVE
parents AS (
SELECT * FROM parent_child_mapping WHERE parent_item_id = child_item_id;
),
children AS (
SELECT * FROM parent_child_mapping WHERE parent_item_id child_item_id;
),
roots AS (
SELECT p.* FROM parents p
LEFT JOIN children c
ON
p.parent_item_id = c.child_item_id
WHERE
c.child_item_id IS NULL
),
objects_tree AS (
SELECT
i.root_id,
i.parent_item_id,
i.child_item_id
0 AS level
FROM roots i
UNION ALL
SELECT
t.root_id,
t.parent_item_id,
t.child_item_id,
t.level + 1,
FROM children AS i
INNER JOIN objects_tree AS t
ON
t.child_item_id = i.parent_item_id
)
SELECT * FROM objects_tree
This works perfectly on a relatively small amount of data (~100K records in parent-child mapping). However, the query will take several days to run on a bigger dataset (this is shown in the execution details in BQ web UI).
The total amount of root nodes is ~45K records, whereas the children table contains ~560K entries. I assume that the issue is here:
...
FROM children AS i
INNER JOIN objects_tree AS t
ON
t.child_item_id = i.parent_item_id
Would it be possible to optimize it? Would it be possible to optimize it? Maybe there is a better way to write the SQL query? Or maybe there are some BQ built-in mechanisms?
**UPD**: I managed to grab some screenshots of the execution graph:



Roman Dryndik
(101 rep)
Jan 24, 2023, 08:43 PM
• Last activity: Jan 25, 2023, 10:41 AM
0
votes
1
answers
211
views
Converting EAVT table into SCD type 2
After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension. Here's the issue: I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/...
After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension.
Here's the issue:
I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/valid_to).
So for every object (Company, product...etc) I have a table with the current state that is in a relational model, and another history table that contains all value changes to all attributes with a valid_from/valid_to column for validity of the values themselves.
I want to be able to merge these two tables into an SCD table with a Valid_To/Valid_From and a column per attribute.
To give an example:
Company has two tables:
- **Current state of the Companies:**
| company_id | name | number_of_employees | city |
| -------- | -------------- | -------------- | -------------- |
| 1 | Company 1 | 500 | Paris |
| 2 | Company 2 | 500 | Paris |
- **History Table:**
| company_id | attribute | value | valid_from | valid_to |
| -------- | -------------- | -------------- | -------------- | -------------- |
| 1 | city | New York | 01/01/2020 | 01/05/2022 |
| 1 | city | Paris | 01/05/2022 | 12/31/9999 |
| 1 | number_of_employees | 50 | 01/01/2021 | 01/01/2022 |
| 1 | number_of_employees | 100 | 01/01/2022 | 12/31/9999 |
What I want to have as a result is the following:
| company_id | name| city | number_of_employees | valid_from | valid_to | is_active |
| -------- | -------------- | -------------- | -------------- | -------------- | -------------- | -------------- |
| 1 | Company 1 | New York | null | 01/01/2020 | 01/01/2021 | false |
| 1 | Company 1 | New York | 50 | 01/01/2021 | 01/01/2022 | false |
| 1 | Company 1 | New York | 100 | 01/01/2022 | 01/01/2022 | false |
| 1 | Company 1 | Paris | 100 | 01/05/2022 | 12/31/9999 | true |
So based on this example, we have a company that started on 01/01/2020 with New York as city and number of employees wasn't populated at that time.
We then modified our company to add 50 as the number of employees, this happened on 01/01/2021.
We modified our company again on 01/01/2022 to change the number of employees to 100, only to change the city of the company from New York to Paris on 01/05/2021.
This gives us 4 states for the company, so our SCD should contain a row per state or 4 rows.
The dates should be calculated to overlap and valid_from should be set to the valid_to of the attribute that changed from the "history" table, and valid_to should be set to the valid_from of the attribute that changed from the "history" table.
To add more complexity to the task, imagine we have about 120 attributes but also if a company was never changed (just created and still has the same state from creation) then it won't exist in the "Current State" table. So in our example, Company 2 will not exist in the history table at all and will have to be read from the first table into the SCD (union between current table and history result table). Fun right! :)
To give you a sense of the technical environment, the CRM is hubspot, data is replicated from hubspot to BigQuery and the reporting tool is Power BI.
I have tried to use pivoting in both Power BI and BigQuery, which is the standard solution when it comes to EAV model tables, but I'm stuck at the calculation of the valid/from valid/to in the result SCD. ( example of using the pivoting here: https://dba.stackexchange.com/questions/20275/solutions-for-reporting-off-of-an-eav-structured-database )
I need one process that can be applied to multiple tables (because this example is only for company, but I have also other objects that I need to convert into SCD).
So what is the best way to convert this EAVT data into an SCD without falling into a labyrinth of hard code and performance issues? And how to calculate the valid_from/valid_to dynamically<
Whether it's BigQuery or Power Query or just theoretical, any solutions, tips, ideas or just plain opinion is highly appreciated as this is the last step into the adoption of a whole data culture in the company I work for, and if I cannot make this, well... my credibility will be hit! so please help a fellow lost IT professional! :D
Rami Ouanes
(1 rep)
May 4, 2022, 02:07 PM
• Last activity: Dec 10, 2022, 06:04 AM
0
votes
1
answers
190
views
Does truncating table refresh materialized view on BigQuery?
Just need to make sure before truncating a table, will all data on materialized view that depend on that table wiped too? Also if materialized view recreated (dropped and created again) after inserts happened on that table, will old rows of the source table present on the materialized view?
Just need to make sure before truncating a table, will all data on materialized view that depend on that table wiped too?
Also if materialized view recreated (dropped and created again) after inserts happened on that table, will old rows of the source table present on the materialized view?
Kokizzu
(1403 rep)
Oct 7, 2022, 04:30 PM
• Last activity: Oct 31, 2022, 07:12 PM
0
votes
1
answers
62
views
How to order strings in BigQuery to have lowercased characters ordered before uppercased characters?
I'm updating a reporting tool implementation so that it queries BigQuery instead of PostgreSQL. The PostgreSQL data is ingested in BigQuery. One requirement is for the report results to stay exactly the same after this update, i.e., a report query should output the exact same results when run on Big...
I'm updating a reporting tool implementation so that it queries BigQuery instead of PostgreSQL. The PostgreSQL data is ingested in BigQuery. One requirement is for the report results to stay exactly the same after this update, i.e., a report query should output the exact same results when run on BigQuery than on PostgreSQL.
In particular, I need BigQuery to order strings the same way as my PostgreSQL server currently does. It's not an option to change how my PostgreSQL server orders strings.
For example, with this table:
|name|id|
|---------|--|
|activity |1 |
|Activity |2 |
|activity |3 |
|foo |4 |
and the query
SELECT name FROM table ORDER BY name, id
my PostgreSQL returns
> activity, activity, Activity, foo
while BigQuery returns
> Activity, activity, activity, foo
I've tried
SELECT name FROM table ORDER BY NORMALIZE_AND_CASEFOLD(name), id
but BigQuery returns
> activity, Activity, activity, foo
Florent2
(349 rep)
Oct 20, 2022, 12:22 PM
• Last activity: Oct 26, 2022, 07:49 AM
0
votes
2
answers
2195
views
How to Select The Lowest Date in one field that's higher than the date in another field
The below aim isn't actually what my query is for, but I'm using it as an analogy to explain more simply what I am trying to achieve: I am trying to build a BigQuery Script which looks at all arcade machines which have had a vandalism alarm triggered, it then identifies all of those machines which h...
The below aim isn't actually what my query is for, but I'm using it as an analogy to explain more simply what I am trying to achieve:
I am trying to build a BigQuery Script which looks at all arcade machines which have had a vandalism alarm triggered, it then identifies all of those machines which have either had no vends made since the alarm, or those where the first vend was more than 28 days after the alarm was triggerred, in order to identify machines where free use could have occurred after the alarm was triggerred.
SO far, I have highlighted all of these instances, however where I have joined the Payments table to the query, all vends made >28 days after the alarm per machine are being returned, where I am only interested in returning the first vend after the alarm.
I have 3 tables
alarm
| machine_num| alarm_date |
| ---------- | ---------- |
| 111 | 2022-01-20 |
| 222 | 2022-01-20 |
| 123 | 2022-01-20 |
| 456 | 2022-01-20 |
Customer
| cust_num | machine_num |
| -------- | ------------|
| 1 | 111 |
| 2 | 222 |
| 3 | 123 |
| 4 | 456 |
payments
| cust_num | vend_date |
| -------- | -------------|
| 1 | 2022-01-10 |
| 1 | 2022-01-21 |
| 1 | 2022-02-21 |
| 2 | 2022-01-11 |
| 2 | 2022-01-19 |
| 3 | 2022-01-01 |
| 3 | 2022-01-10 |
| 3 | 2022-03-01 |
| 3 | 2022-03-03 |
| 3 | 2022-03-04 |
| 4 | 2022-01-19 |
| 4 | 2022-04-20 |
| 4 | 2022-04-21 |
So in this case:
cust_num "1" woudln't be returned, as there was a vend less than 28 days after the alarm
cust_num "2" Would be returned with the vend date as NULL since no Vends have been made since the alarm
cust_num "3" Would be returned with the vend date as "2022-03-01" since this is the first Vend after the alarm
cust_num "4" Would be returned with the vend date as "2022-04-20" since this is the first Vend after the alarm
I need to return all 3 fields, so based on the above example, my output would be
| cust_num | machine_num | alarm_date | vend_date |
| -------- | ------------| ---------- | -------------|
| 2 | 222 | 2022-01-20 | NULL |
| 3 | 333 | 2022-01-20 | 2022-03-01 |
| 4 | 444 | 2022-01-20 | 2022-04-20 |
I've tried adding a sub-query in my select statement similar to the below:
(SELECT
MIN(vend_date)
FROM payments AS paymin
WHERE paymin.vend_date > alarm_date)
AS vend_date
However this just tends to cause bigquery to run longer than I have the patience to wait for it for when I add the subquery to my existing query.
I've never asked for help on one of these sites before, so apologies if I am asking in the wrong place or in the wrong way! I'm still relatively new to BQ and work very distantly from any analysts in the business.
Any help is really appreciated!
Cheers
__ EDIT:
So I gave up with the sub query, it was too intensive as it was querying almost 2 million rows!
I tried using simple MIN and grouping everything together, similar to this simplified example
SELECT
payments.cust_num,
alarm.machine_num, p.PAN, alarm.alarm_date, MIN(payments.vend_date) as vend_Date
FROM alarm
LEFT JOIN customer
ON alarm.machine_num = customer.machine_num
INNER JOIN payments
ON customer.cust_num =payments.cust_num
WHERE
vend_Date > DATE_ADD(alarm.alarm_date, INTERVAL +28 DAY) OR
vend_Date IS NULL
GROUP BY payments.cust_num, alarm.machine_num, alarm.alarm_date
ORDER BY payments.cust_num
however This isn't picking up instances where first vend_date after alarm_date is NULL
It also only returns the first date that is newer than 28 days after the alarm date, not accounts WHEN the first vend_date is newer than 28 days after the alarm_date
Phil Tomlinson
(1 rep)
Sep 7, 2022, 03:51 PM
• Last activity: Oct 14, 2022, 01:47 PM
0
votes
1
answers
2680
views
Counting uniques on a week by week basis
I've managed to find the unique # of session_ids within the first week of the the year with this query. How do I get each successive week's unique count up to the last week of June? SELECT COUNT (DISTINCT session_id), MAX(timestamp_local) FROM `myTable` WHERE timestamp >= '2018-01-01' AND timestamp...
I've managed to find the unique # of session_ids within the first week of the the year with this query.
How do I get each successive week's unique count up to the last week of June?
SELECT COUNT (DISTINCT session_id),
MAX(timestamp_local)
FROM
myTable
WHERE
timestamp >= '2018-01-01'
AND timestamp <= '2018-01-07'
Appreciate your help!
spamdavy
(11 rep)
Jul 14, 2018, 06:52 PM
• Last activity: Oct 2, 2022, 12:04 PM
4
votes
2
answers
1374
views
How to query for certain text in log messages from Google App Engine that are exported to BigQuery?
I have used the 'Create Export' in the Logging section of Google Cloud Platform to export log entries into Big Query for analysis [![enter image description here][1]][1] Here is part of the structure of the table in BigQuery that logs the log entry [![enter image description here][2]][2] I want to s...
I have used the 'Create Export' in the Logging section of Google Cloud Platform to export log entries into Big Query for analysis
Here is part of the structure of the table in BigQuery that logs the log entry
I want to search for entries that contains this user id 14994838999176.
In BigQuery, I issue the following query (in standard SQL dialect),
SELECT * FROM gae_log.appengine_googleapis_com_request_log_20180323
where protoPayload.line.logMessage like '%14994838999176%'
LIMIT 1000
It failed with the error message:
Query Failed
Error: Cannot access field logMessage on a value with type
ARRAY> at [3:25]
How can I correct my sql statement?


Anthony Kong
(141 rep)
Mar 23, 2018, 04:28 AM
• Last activity: Sep 3, 2022, 11:27 AM
0
votes
0
answers
96
views
Generate unique ids in BigQuery data import
I am using Zapier to import woocommerce orders into BigQuery and build a datawarehouse with paid orders from multiple sites. However, multiple sites generate an ID per site, not a unique id globally for all, so importing into the same table data from all the woocommerce orders will not have unique i...
I am using Zapier to import woocommerce orders into BigQuery and build a datawarehouse with paid orders from multiple sites. However, multiple sites generate an ID per site, not a unique id globally for all, so importing into the same table data from all the woocommerce orders will not have unique ids.
I would like BigQuery to automatically assign a unique id to every new order that enters the table. Is this possible?
tokyographer
(1 rep)
Aug 22, 2022, 07:26 PM
• Last activity: Aug 22, 2022, 09:09 PM
0
votes
1
answers
289
views
Oracle golden gate to BigQuery
Im trying to setting up the Golden gate to sync the data to BigQuery. When I start pushing the initial load, my extractor exported all the data and even from the replicat stats Im able to see the records are received but its not pushed to Bigquery. ## Extractor ``` GGSCI (bhuvi-oracle) 12> info extr...
Im trying to setting up the Golden gate to sync the data to BigQuery. When I start pushing the initial load, my extractor exported all the data and even from the replicat stats Im able to see the records are received but its not pushed to Bigquery.
## Extractor
GGSCI (bhuvi-oracle) 12> info extract load1
EXTRACT LOAD1 Last Started 2020-05-07 16:52 Status STOPPED
Checkpoint Lag Not Available
Log Read Checkpoint Table HR.TEST
2020-05-07 16:52:41 Record 32768
Task SOURCEISTABLE
## Replicat stats
GGSCI (bhuvi-1) 1> stats bq
Sending STATS request to REPLICAT BQ ...
gg.handlerlist=bigquery
Start of Statistics at 2020-05-07 16:53:33.
Replicating from HR.TEST to bhuvi_orcl.test:
*** Total statistics since 2020-05-07 16:52:42 ***
Total inserts 32768.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 32768.00
## Log file on BigData Golen Gate:
2020-05-07T16:59:04.895+0000 INFO OGG-00987 Oracle GoldenGate Command Interpreter: GGSCI command (root): start bq.
2020-05-07T16:59:04.904+0000 INFO OGG-00963 Oracle GoldenGate Manager, mgr.prm: Command received from GGSCI on host [10.18.0.61]:42130 (START REPLICAT BQ ).
2020-05-07T16:59:04.905+0000 INFO OGG-00975 Oracle GoldenGate Manager, mgr.prm: REPLICAT BQ starting.
2020-05-07T16:59:04.964+0000 INFO OGG-00995 Oracle GoldenGate Delivery, bq.prm: REPLICAT BQ starting.
2020-05-07T16:59:04.967+0000 INFO OGG-03059 Oracle GoldenGate Delivery, bq.prm: Operating system character set identified as UTF-8.
2020-05-07T16:59:04.967+0000 INFO OGG-02695 Oracle GoldenGate Delivery, bq.prm: ANSI SQL parameter syntax is used for parameter parsing.
2020-05-07T16:59:05.173+0000 INFO OGG-15052 Oracle GoldenGate Delivery, bq.prm: Using Java class path: .:ggjava/ggjava.jar:ggjava/resources/lib/optional/log4j-api-2.9.1.jar:ggjava/resources/lib/optional/log4j-core-2.9.1.jar:ggjava/resources/lib/optional/log4j-slf4j-impl-2.9.1.jar.
2020-05-07T16:59:10.243+0000 INFO OGG-01815 Oracle GoldenGate Delivery, bq.prm: Virtual Memory Facilities for: COM
anon alloc: mmap(MAP_ANON) anon free: munmap
file alloc: mmap(MAP_SHARED) file free: munmap
target directories:
/opt/ogg/dirtmp.
2020-05-07T16:59:10.243+0000 INFO OGG-00996 Oracle GoldenGate Delivery, bq.prm: REPLICAT BQ started.
2020-05-07T16:59:10.251+0000 INFO OGG-02243 Oracle GoldenGate Delivery, bq.prm: Opened trail file /opt/ogg/dirdat/oo000000 at 2020-05-07 16:59:10.249063.
2020-05-07T16:59:10.251+0000 INFO OGG-03506 Oracle GoldenGate Delivery, bq.prm: The source database character set, as determined from the trail file, is UTF-8.
2020-05-07T16:59:10.253+0000 INFO OGG-06505 Oracle GoldenGate Delivery, bq.prm: MAP resolved (entry HR.TEST): MAP "HR"."TEST", TARGET bhuvi_orcl.test.
2020-05-07T16:59:10.253+0000 INFO OGG-02756 Oracle GoldenGate Delivery, bq.prm: The definition for table HR.TEST is obtained from the trail file.
2020-05-07T16:59:10.253+0000 INFO OGG-06511 Oracle GoldenGate Delivery, bq.prm: Using following columns in default map by name: ID.
2020-05-07T16:59:10.253+0000 INFO OGG-06510 Oracle GoldenGate Delivery, bq.prm: Using the following key columns for target table bhuvi_orcl.test: ID.
TheDataGuy
(1986 rep)
May 7, 2020, 05:15 PM
• Last activity: May 6, 2022, 12:58 PM
0
votes
1
answers
841
views
Calculate Percentage with JOINS and original value
I need your help. I'm trying to calculate percentages by using results from JOINING divided by the original value. I do not know how to combine both values. I try to find trips with station id which do not exist in the table station and then find the percentage of that trips. I only can write the qu...
I need your help. I'm trying to calculate percentages by using results from JOINING divided by the original value. I do not know how to combine both values. I try to find trips with station id which do not exist in the table station and then find the percentage of that trips.
I only can write the query to this extend:
SELECT COUNT(*)
FROM (SELECT a.start_station_id, b.station_id, a.end_station_id,a.bikeid
FROM
bigquery-public-data.new_york_citibike.citibike_trips
as a
LEFT OUTER JOIN bigquery-public-data.new_york_citibike.citibike_stations
as b
ON a.start_station_id = b.station_id
WHERE b.station_id IS NULL)
I need to divide the result COUNT from the query above with COUNT(*) before JOIN.
Kindly help me with this.
Thank you!!
snsd94
(1 rep)
Feb 3, 2021, 04:23 PM
• Last activity: Dec 19, 2021, 04:07 PM
Showing page 1 of 20 total questions