Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

150 views

Airflow to BigQueryt data load taking forever

mysql python architecture google-bigquery

Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI. The problem is t...

                                  Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI.

The problem is that the bigger tables (which only have ~ 1000 rows) take about 2 hours to load to BQ, and thus making this impossible to scale, I can´t imagine what will happen in the future when only the deltas are 10000 rows each...

This works using pandas and SQLAlchemy by extracting data as a dataframe and using "to_sql" method passing all the BQ connection parameters.

I am already uploading only incrementals/delta, that is not the problem.

Do you have any advice? Is Airflow the right tool for this? I´ve been searching for solutions for weeks but couldn´t find anything.

Ayrton (1 rep)

Aug 28, 2022, 11:11 PM • Last activity: Jul 17, 2025, 11:03 AM

0 votes

1 answers

678 views

Monthly Growth - Bigquery

google-bigquery

I need your help again. I need to calculate the monthly growth for the trips. The query that I have written is below: ``` SELECT EXTRACT(YEAR FROM DATE (stoptime)) AS Year, EXTRACT(MONTH FROM DATE (stoptime)) AS Month, COUNT(bikeid) AS Trips, ROUND(100* (COUNT(bikeid)- LAG(COUNT(bikeid),1,0) OVER (P...

I need your help again. I need to calculate the monthly growth for the trips. The query that I have written is below:

SELECT EXTRACT(YEAR FROM DATE (stoptime)) AS Year, EXTRACT(MONTH FROM DATE (stoptime)) AS Month,
    COUNT(bikeid) AS Trips,
    ROUND(100* (COUNT(bikeid)- LAG(COUNT(bikeid),1,0) OVER (PARTITION BY 'Year' ORDER BY 'Year', 'Month'))/lag(COUNT(bikeid),1) OVER 
    (PARTITION BY 'Year' ORDER BY 'Year', 'Month')) ||'%' AS growth
FROM bigquery-public-data.new_york_citibike.citibike_trips
WHERE end_station_name LIKE '%University%'
GROUP BY Year, Month
ORDER BY Year DESC, Month DESC

However, the query below cannot give me the right output. The output from the query above as below: !

I do not know which part of my query is wrong. Kindly help me with this. My desired output example:

Thank you!

snsd94 (1 rep)

Feb 3, 2021, 05:07 PM • Last activity: Jun 9, 2025, 07:08 AM

1 votes

1 answers

4924 views

Retrieving executed query list in BigQuery via SQL

google-bigquery

Is there a table similar to v$sql in Oracle where I can retrieve data associated with a particular query that was run in BigQuery using SQL?

                                  Is there a table similar to v$sql in Oracle where I can retrieve data associated with a particular query that was run in BigQuery using SQL?
                                

kpod (11 rep)

Jun 10, 2019, 12:30 AM • Last activity: Jan 8, 2025, 07:03 AM

0 votes

0 answers

82 views

BigQuery can't connection to MySQL Cloud SQL after upgrade from 5.7 to 8.0

mysql mysql-8.0 google-cloud-sql google-bigquery

I have the a similar problem to [this here][1] , but the suggestions there don't seem to work. When trying to execute a query I get this: ``` Invalid table-valued function EXTERNAL_QUERY Failed to connect to MySQL database. Error: MysqlErrorCode(2059): Authentication plugin 'caching_sha2_passwo...

I have the a similar problem to this here , but the suggestions there don't seem to work. When trying to execute a query I get this:

Invalid table-valued function EXTERNAL_QUERY Failed to connect to MySQL database. Error: 
MysqlErrorCode(2059): 
Authentication plugin 'caching_sha2_password' cannot be loaded:
/usr/lib/plugin/caching_sha2_password.so: 
cannot open shared object file: No such file or directory at [1:15]

When trying to query from BigQuery to a MySQL 8.0.37 Cloud SQL db. Tried with two different users. Works fine with those credentials from the Cloud SQL Studio. This used to work before upgrading to 8.0 from 5.7. I did change the auth type for the user to caching_sha2_password and the db is running with caching_sha2_password as the default auth. The Cloud SQL server log shows:

[Note] [MY-010914] [Server] Got an error reading communication packets

for each query attempt. External (mysql commandline and Intellij datagrid) and cloud run JDBC connections work too, just not the BigQuery External access. Any ideas?

cputoaster (101 rep)

Oct 29, 2024, 01:47 PM • Last activity: Oct 29, 2024, 02:34 PM

0 votes

0 answers

49 views

How can a data warehouse like BigQuery parallelize queries that Postgres cannot?

postgresql performance postgresql-performance google-bigquery

In order to get more practice with database management and reporting tools, I have been working to load the recent SSN data leak into Postgres. Since it is such a large dataset, it has served as a really good learning opportunity so far. The dataset is a csv file that is formatted slightly incorrect...

                                  In order to get more practice with database management and reporting tools, I have been working to load the recent SSN data leak into Postgres. Since it is such a large dataset, it has served as a really good learning opportunity so far. 

The dataset is a csv file that is formatted slightly incorrectly. I decided the best route forward was to import each line as a string and use some basic regex to extract the columns. I ran a query in postgres to extract a single column from the dataset with my regex, and it took more than 12 hours to run on an M3 Max. I didn't let it complete, so I don't know how long it would have actually taken. I did some research and did what I could find to optimize postgres for large database modifications, including writing to a new table instead of editing in place, and removing all indexes. These operations always ran on a single thread. From what I found in my research, if the postgres query planner can run a query on multiple threads it will, but most queries cannot be parallelized in this way. 

Then I tried another method for cleaning the data: BigQuery. It took some time to get the dataset into GCP Storage, but once I did, it was incredibly fast. The total time for BigQuery to load 350GB of csv, run 12 different regular expressions against each, and output that to a new csv was about 10 minutes. 

Here is my question: How can BigQuery complete the same query so much faster than postgres? What is architecturally different about a data warehouse that allows it to parallelize workflows that postgres cannot? Why doesn't postgres apply these lessons and parallelize their own queries?

Edit: Are there any projects that allow me to run a data warehouse like database locally? Or are they all tied to proprietary clouds like BigQuery/Snowflake/etc?

Spenser (1 rep)

Aug 21, 2024, 01:46 PM • Last activity: Aug 21, 2024, 02:10 PM

0 votes

1 answers

38 views

Google Datastream errors on larger MySQL tables

mysql google-cloud-platform google-bigquery

I have set up a Datastream service, in order to replicate data from Cloud SQL (MySQL) to BigQuery. Everything is set up correctly, connection works. But the weird thing is that only tables < 10mb size are replicated without issues. The larger tables (100+ MB) all fail. When checking the error status...

                                  I have set up a Datastream service, in order to replicate data from Cloud SQL (MySQL) to BigQuery. 

Everything is set up correctly, connection works. But the weird thing is that only tables < 10mb size are replicated without issues.

The larger tables (100+ MB) all fail. When checking the error status, it only says "Timed-out while waiting for the query to complete."

I have not found anything useful regarding this error. What approaches can I try? Backfilling a specific table give the same error.

Source (MySQL) database connection flags are set to the recommended values:

net_read_timeout: 3600 seconds (1 hour)  
net_write_timeout: 3600 seconds (1 hour)  
wait_timeout: 86400 seconds (24 hours)

Spock (101 rep)

Mar 30, 2024, 02:57 PM • Last activity: Apr 3, 2024, 06:40 AM

-1 votes

1 answers

4693 views

How to return missing rows from LEFT JOIN in BigQuery

join google-bigquery

How can we get BigQuery to return the rows in a LEFT JOIN which exist in TABLE A but are NULL in TABLE B? -- find missing users. return rows which exist in A but not in B select a.user_id from database_a a left outer join database_b b on a.user_id = b.user_id where b.user_id is null

                                  How can we get BigQuery to return the rows in a LEFT JOIN which exist in TABLE A but are NULL in TABLE B?

    -- find missing users. return rows which exist in A but not in B
    select          a.user_id
    from            database_a a  
    left outer join database_b b on a.user_id = b.user_id 
    where           b.user_id is null
                                

crazy8 (131 rep)

Aug 18, 2021, 11:53 PM • Last activity: Jan 7, 2024, 12:04 AM

1 votes

1 answers

249 views

Aggregate SCD Type 2 data "as of" each day

postgresql aggregate google-bigquery

## Problem When working with SCD type 2 data it's easy to see the state of a table "as of" a given point in time by using your date columns, eg: `valid_from` and `valid_to`. For example: ```sql select * from table where '2023-11-01' between valid_from and valid_to ``` You can then group by or use wi...

## Problem When working with SCD type 2 data it's easy to see the state of a table "as of" a given point in time by using your date columns, eg: valid_from and valid_to. For example:

select * from table
where '2023-11-01' between valid_from and valid_to

You can then group by or use window functions to aggregate. But what if I want to do that repeatedly, for each date in a range (eg: daily)? I do not need to aggregate *across* these dates, just within each one. ## Example Let's say I have a table that tracks the quantity of people for each reservation_id. Both the quantity of people and reservation_status can change over time, as tracked by the table. Each reservation is attached to an event_id. event_date is included to help limit scope if needed (see assumptions below). | reservation_id | reservation_status | quantity | event_id | event_date | valid_from | valid_to | | -------------- | ------------------ | -------- | -------- | ---------- | ---------- | -------- | | 1 | active | 4 | 100 | 2023-05-25 | 2023-01-01 | 2023-01-02 | | 2 | active | 2 | 200 | 2024-01-07 | 2023-01-01 | 2023-01-03 | | 3 | active | 7 | 100 | 2023-05-25 | 2023-01-02 | 9999-12-31 | | 4 | active | 1 | 200 | 2024-01-07 | 2023-01-02 | 9999-12-31 | | 1 | active | 5 | 100 | 2023-05-25 | 2023-01-03 | 9999-12-31 | | 5 | active | 8 | 100 | 2023-05-25 | 2023-01-03 | 9999-12-31 | | 2 | cancelled | 2 | 200 | 2024-01-07 | 2023-01-04 | 9999-12-31 | | 6 | active | 3 | 100 | 2023-05-25 | 2023-01-06 | 9999-12-31 | PostgreSQL since BigQuery is more difficult to test – db-fiddle / SQL:

-sql
CREATE TABLE Reservations (
  "reservation_id" INTEGER,
  "reservation_status" VARCHAR(9),
  "quantity" INTEGER,
  "event_id" INTEGER,
  "event_date" DATE,
  "valid_from" DATE,
  "valid_to" DATE
);

INSERT INTO Reservations
  ("reservation_id", "reservation_status", "quantity", "event_id", "event_date", "valid_from", "valid_to")
VALUES
  ('1', 'active', '4', '100', '2023-05-25', '2023-01-01', '2023-01-02'),
  ('2', 'active', '2', '200', '2024-01-07', '2023-01-01', '2023-01-03'),
  ('3', 'active', '7', '100', '2023-05-25', '2023-01-02', '9999-12-31'),
  ('4', 'active', '1', '200', '2024-01-07', '2023-01-02', '9999-12-31'),
  ('1', 'active', '5', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
  ('5', 'active', '8', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
  ('2', 'cancelled', '2', '200', '2024-01-07', '2023-01-04', '9999-12-31'),
  ('6', 'active', '3', '100', '2023-05-25', '2023-01-06', '9999-12-31');

While this is ultimately for BigQuery, answer will be accepted in any dialect as long as it's somewhat generic. ##### Assumptions - "As of" dates can be a list or range based off valid_from min/max - valid_to of 9999-12-31 is the most recent data - All reservations for a given event will be between event_date - INTERVAL '2 years' and event_date. This doesn't change anything for this example, but maybe is useful for scaling(?) ### Desired output I'd like to know the sum of quantity grouped by event_id and reservation_status *as of* each interval (day). | as_of_date | event_id | reservation_status | sum_quantity | | ---------- | -------- | ------------------ | ------------ | | 2023-01-01 | 100 | active | 4 | | 2023-01-01 | 200 | active | 2 | | 2023-01-02 | 100 | active | 11 | | 2023-01-02 | 200 | active | 2 | | 2023-01-03 | 100 | active | 20 | | 2023-01-03 | 200 | active | 3 | | 2023-01-04 | 100 | active | 20 | | 2023-01-04 | 200 | active | 1 | | 2023-01-04 | 200 | cancelled | 2 | | 2023-01-06 | 100 | active | 23 | *Rough estimate of row values. Will be different if using complete range of dates.* I essentially want to do the following:

-sql
/* Invalid SQL, just for conceptual purposes */

-- Given a list of dates, for each "date":
select
  event_id,
  reservation_status,
  sum(quantity)
from table
where {{date}} between valid_from and valid_to
group by
  event_id,
  reservation_status

I believe this can be done using procedural language, such as a for loop, but I feel like I am overthinking that, and just having trouble combining simpler concepts.

camtech (143 rep)

Nov 28, 2023, 05:58 AM • Last activity: Nov 29, 2023, 08:03 PM

0 votes

1 answers

59 views

Does the BigQuery API offer a way to retrieve info on scheduled queries?

c# google-bigquery scheduled-tasks api

Using the BigQuery C# API, I can [retrieve a list of job IDs][1]: BigQueryClient _client = BigQueryClient.Create(...); ... foreach (var page in _client.ListJobs(projectId).AsRawResponses()) if (page.Jobs != null) // This does happen occasionally foreach (var job in page.Jobs) yield return job.Id; Th...

                                  Using the BigQuery C# API, I can retrieve a list of job IDs :

	BigQueryClient  _client = BigQueryClient.Create(...);
	...
	foreach (var page in _client.ListJobs(projectId).AsRawResponses())
		if (page.Jobs != null)  // This does happen occasionally
			foreach (var job in page.Jobs)
				yield return job.Id;

This seems to give me all jobs ever run (or at least within some significant time horizon; it's tens of thousands of records).  Still, I'd like to get the details for some jobs and see if I'm at least on the right track.  I can retrieve a BigQueryJob object using BigQueryClient.GetJob() (there's no C# doc, but the Java sample code is similar ), but the information returned is very limited: current state, any errors encountered, some basic statistics, etc.  There's nothing about schedules.

Is there a separate command to retrieve details on scheduled queries?  I can't find any such methods in the client.

                                

Jon of All Trades (5987 rep)

Sep 13, 2023, 03:16 PM

1 votes

1 answers

1587 views

Using COUNT returns the same value when counting two different columns

count syntax google-bigquery

I am new to SQL and am using it for a school project and would like some help. I have a table with multiple columns about a fictional company that provides a bike sharing service. One of the rows I am trying to count is called member_casual which returns a string with either casual or member, that reports whether a user is a member or not. The second column I am trying to count is a string column called rideable_type that reports what kind of bike it has. I want to create a table that returns the number of kinds of bikes per each kind or rider, casual or member. Here is my code:

SELECT member_casual,
  COUNT(member_casual) AS member_type,
  COUNT(rideable_type) AS electric_classic
FROM coursera-practice-356200.cycle_data.cycle_table1 
GROUP BY member_casual

Here is my result: | member_casual | member_type | electric_classic | |---------------|-------------|------------------| | casual | 21416 | 21416 | |. member. |. 94193. |. 94193. | I know my code is flawed, but I do not know where it is flawed. I also do not know why I got the same number when I am counting two different columns (member_casual and rideable_type) which should be very different numbers. I guess what I am asking is for assistance on where my code is wrong and how to fix it.

YellingYeti (11 rep)

Mar 1, 2023, 11:06 PM • Last activity: Mar 2, 2023, 01:05 PM

0 votes

0 answers

253 views

Optimising a recursive SQL query that processes several million records in BQ

query-performance data-warehouse recursive google-bigquery

I need help optimizing a recursive SQL query in BQ. I have hierarchical data stored in a table as parent-child relationships, i.e [![enter image description here][1]][1] will be stored as | parent_item_id | child_item_id | | ---------------| --------------| | 1 | 1 | | 1 | 2 | | 1 | 3 | | 3 | 4 | |...

I need help optimizing a recursive SQL query in BQ. I have hierarchical data stored in a table as parent-child relationships, i.e

will be stored as | parent_item_id | child_item_id | | ---------------| --------------| | 1 | 1 | | 1 | 2 | | 1 | 3 | | 3 | 4 | | 3 | 5 | | ... | ... | when parent_item_id = child_item_id it means that this is a root node. I need to produce the following resulting table from the parent-child mapping: | root_id | parent_item_id | child_item_id | level | | --------| ---------------| --------------| ------ | | 1 | 1 | 1 | 0 | | 1 | 1 | 2 | 1 | | 1 | 1 | 3 | 1 | | 1 | 3 | 4 | 2 | | 1 | 3 | 5 | 2 | | ... | ... | ... | ... | To do that I first prepared separate tables for parents and children:

-- parents
SELECT * FROM parent_child_mapping WHERE parent_item_id = child_item_id;

-- children
SELECT * FROM parent_child_mapping WHERE parent_item_id  child_item_id;

Then I created a table with root nodes only. By definition, a root node is a top-level node which means it can't be a child of any other node. To identify root nodes we can calculate the set difference between parents and children tables:

-- root nodes
SELECT p.* FROM parents p
LEFT JOIN children c
ON 
  p.parent_item_id = c.child_item_id
WHERE 
  c.child_item_id IS NULL

Finally, I prepared a recursive query:

SELECT
  i.root_id,
  i.parent_item_id,
  i.child_item_id
  0 AS level
FROM roots i 

UNION ALL

SELECT
  t.root_id,
  t.parent_item_id,
  t.child_item_id,
  t.level + 1,
FROM children AS i

INNER JOIN objects_tree AS t
ON 
  t.child_item_id = i.parent_item_id

This is the final SQL query:

WITH RECURSIVE

parents AS (
  SELECT * FROM parent_child_mapping WHERE parent_item_id = child_item_id;
),

children AS (
  SELECT * FROM parent_child_mapping WHERE parent_item_id  child_item_id;
),

roots AS (
  SELECT p.* FROM parents p
  LEFT JOIN children c
  ON 
    p.parent_item_id = c.child_item_id
  WHERE 
    c.child_item_id IS NULL
),

objects_tree AS (
  SELECT
    i.root_id,
    i.parent_item_id,
    i.child_item_id
    0 AS level
  FROM roots i 

  UNION ALL

  SELECT
    t.root_id,
    t.parent_item_id,
    t.child_item_id,
    t.level + 1,
  FROM children AS i
  INNER JOIN objects_tree AS t
  ON 
    t.child_item_id = i.parent_item_id
)

SELECT * FROM objects_tree

This works perfectly on a relatively small amount of data (~100K records in parent-child mapping). However, the query will take several days to run on a bigger dataset (this is shown in the execution details in BQ web UI). The total amount of root nodes is ~45K records, whereas the children table contains ~560K entries. I assume that the issue is here:

...
FROM children AS i
INNER JOIN objects_tree AS t
ON 
  t.child_item_id = i.parent_item_id

Would it be possible to optimize it? Would it be possible to optimize it? Maybe there is a better way to write the SQL query? Or maybe there are some BQ built-in mechanisms? **UPD**: I managed to grab some screenshots of the execution graph:

Roman Dryndik (101 rep)

Jan 24, 2023, 08:43 PM • Last activity: Jan 25, 2023, 10:41 AM

0 votes

1 answers

211 views

Converting EAVT table into SCD type 2

pivot eav google-bigquery powerbi slowly-changing-dimension

After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension. Here's the issue: I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/...

                                  After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension.

Here's the issue:
I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/valid_to).
So for every object (Company, product...etc) I have a table with the current state that is in a relational model, and another history table that contains all value changes to all attributes with a valid_from/valid_to column for validity of the values themselves.

I want to be able to merge these two tables into an SCD table with a Valid_To/Valid_From and a column per attribute.

To give an example:
Company has two tables:

- **Current state of the Companies:**

| company_id | name | number_of_employees | city |
| -------- | -------------- | -------------- | -------------- |
| 1    | Company 1            | 500    | Paris    |
| 2    | Company 2            | 500    | Paris    |

- **History Table:**

| company_id | attribute | value | valid_from | valid_to |
| -------- | -------------- | -------------- | -------------- | -------------- |
| 1    | city            | New York    | 01/01/2020    | 01/05/2022    |
| 1    | city            | Paris    | 01/05/2022    | 12/31/9999    |
| 1   | number_of_employees            | 50   | 01/01/2021    | 01/01/2022    |
| 1   | number_of_employees            | 100    | 01/01/2022    | 12/31/9999    |


What I want to have as a result is the following:

| company_id | name| city | number_of_employees | valid_from | valid_to | is_active |
| -------- | -------------- | -------------- | -------------- | -------------- | -------------- | -------------- |
| 1    | Company 1    | New York            | null    | 01/01/2020    | 01/01/2021    |  false  |
| 1    | Company 1    | New York            | 50    | 01/01/2021    | 01/01/2022    |  false  |
| 1   | Company 1    |  New York            | 100    | 01/01/2022    | 01/01/2022    |  false  |
| 1   | Company 1    |  Paris            | 100    | 01/05/2022    | 12/31/9999    |  true  |


So based on this example, we have a company that started on 01/01/2020 with New York as city and number of employees wasn't populated at that time.
We then modified our company to add 50 as the number of employees, this happened on 01/01/2021.
We modified our company again on 01/01/2022 to change the number of employees to 100, only to change the city of the company from New York to Paris on 01/05/2021.

This gives us 4 states for the company, so our SCD should contain a row per state or 4 rows.
The dates should be calculated to overlap and valid_from should be set to the valid_to of the attribute that changed from the "history" table, and valid_to should be set to the valid_from of the attribute that changed from the "history" table.

To add more complexity to the task, imagine we have about 120 attributes but also if a company was never changed (just created and still has the same state from creation) then it won't exist in the "Current State" table. So in our example, Company 2 will not exist in the history table at all and will have to be read from the first table into the SCD (union between current table and history result table). Fun right! :) 


To give you a sense of the technical environment, the CRM is hubspot, data is replicated from hubspot to BigQuery and the reporting tool is Power BI.

I have tried to use pivoting in both Power BI and BigQuery, which is the standard solution when it comes to EAV model tables, but I'm stuck at the calculation of the valid/from valid/to in the result SCD. ( example of using the pivoting here: https://dba.stackexchange.com/questions/20275/solutions-for-reporting-off-of-an-eav-structured-database  )

I need one process that can be applied to multiple tables (because this example is only for company, but I have also other objects that I need to convert into SCD).
So what is the best way to convert this EAVT data into an SCD without falling into a labyrinth of hard code and performance issues? And how to calculate the valid_from/valid_to dynamically<

Whether it's BigQuery or Power Query or just theoretical, any solutions, tips, ideas or just plain opinion is highly appreciated as this is the last step into the adoption of a whole data culture in the company I work for, and if I cannot make this, well... my credibility will be hit! so please help a fellow lost IT professional! :D 


                                

Rami Ouanes (1 rep)

May 4, 2022, 02:07 PM • Last activity: Dec 10, 2022, 06:04 AM

0 votes

1 answers

190 views

Does truncating table refresh materialized view on BigQuery?

google-bigquery

Just need to make sure before truncating a table, will all data on materialized view that depend on that table wiped too? Also if materialized view recreated (dropped and created again) after inserts happened on that table, will old rows of the source table present on the materialized view?

                                  Just need to make sure before truncating a table, will all data on materialized view that depend on that table wiped too? 

Also if materialized view recreated (dropped and created again) after inserts happened on that table, will old rows of the source table present on the materialized view?

Kokizzu (1403 rep)

Oct 7, 2022, 04:30 PM • Last activity: Oct 31, 2022, 07:12 PM

0 votes

1 answers

62 views

How to order strings in BigQuery to have lowercased characters ordered before uppercased characters?

order-by google-bigquery

I'm updating a reporting tool implementation so that it queries BigQuery instead of PostgreSQL. The PostgreSQL data is ingested in BigQuery. One requirement is for the report results to stay exactly the same after this update, i.e., a report query should output the exact same results when run on BigQuery than on PostgreSQL. In particular, I need BigQuery to order strings the same way as my PostgreSQL server currently does. It's not an option to change how my PostgreSQL server orders strings. For example, with this table: |name|id| |---------|--| |activity |1 | |Activity |2 | |activity |3 | |foo |4 | and the query

SELECT name FROM table ORDER BY name, id

my PostgreSQL returns > activity, activity, Activity, foo while BigQuery returns > Activity, activity, activity, foo I've tried

SELECT name FROM table ORDER BY NORMALIZE_AND_CASEFOLD(name), id

but BigQuery returns > activity, Activity, activity, foo

Florent2 (349 rep)

Oct 20, 2022, 12:22 PM • Last activity: Oct 26, 2022, 07:49 AM

0 votes

2 answers

2195 views

How to Select The Lowest Date in one field that's higher than the date in another field

join subquery date google-bigquery

The below aim isn't actually what my query is for, but I'm using it as an analogy to explain more simply what I am trying to achieve: I am trying to build a BigQuery Script which looks at all arcade machines which have had a vandalism alarm triggered, it then identifies all of those machines which have either had no vends made since the alarm, or those where the first vend was more than 28 days after the alarm was triggerred, in order to identify machines where free use could have occurred after the alarm was triggerred. SO far, I have highlighted all of these instances, however where I have joined the Payments table to the query, all vends made >28 days after the alarm per machine are being returned, where I am only interested in returning the first vend after the alarm. I have 3 tables alarm | machine_num| alarm_date | | ---------- | ---------- | | 111 | 2022-01-20 | | 222 | 2022-01-20 | | 123 | 2022-01-20 | | 456 | 2022-01-20 | Customer | cust_num | machine_num | | -------- | ------------| | 1 | 111 | | 2 | 222 | | 3 | 123 | | 4 | 456 | payments | cust_num | vend_date | | -------- | -------------| | 1 | 2022-01-10 | | 1 | 2022-01-21 | | 1 | 2022-02-21 | | 2 | 2022-01-11 | | 2 | 2022-01-19 | | 3 | 2022-01-01 | | 3 | 2022-01-10 | | 3 | 2022-03-01 | | 3 | 2022-03-03 | | 3 | 2022-03-04 | | 4 | 2022-01-19 | | 4 | 2022-04-20 | | 4 | 2022-04-21 | So in this case: cust_num "1" woudln't be returned, as there was a vend less than 28 days after the alarm cust_num "2" Would be returned with the vend date as NULL since no Vends have been made since the alarm cust_num "3" Would be returned with the vend date as "2022-03-01" since this is the first Vend after the alarm cust_num "4" Would be returned with the vend date as "2022-04-20" since this is the first Vend after the alarm I need to return all 3 fields, so based on the above example, my output would be | cust_num | machine_num | alarm_date | vend_date | | -------- | ------------| ---------- | -------------| | 2 | 222 | 2022-01-20 | NULL | | 3 | 333 | 2022-01-20 | 2022-03-01 | | 4 | 444 | 2022-01-20 | 2022-04-20 | I've tried adding a sub-query in my select statement similar to the below:

(SELECT
MIN(vend_date)
FROM payments AS paymin
WHERE paymin.vend_date > alarm_date)
AS vend_date

However this just tends to cause bigquery to run longer than I have the patience to wait for it for when I add the subquery to my existing query. I've never asked for help on one of these sites before, so apologies if I am asking in the wrong place or in the wrong way! I'm still relatively new to BQ and work very distantly from any analysts in the business. Any help is really appreciated! Cheers __ EDIT: So I gave up with the sub query, it was too intensive as it was querying almost 2 million rows! I tried using simple MIN and grouping everything together, similar to this simplified example

SELECT
 payments.cust_num,
 alarm.machine_num, p.PAN, alarm.alarm_date, MIN(payments.vend_date) as vend_Date
 
 FROM alarm
    
LEFT JOIN customer
    ON alarm.machine_num = customer.machine_num
 INNER JOIN payments
    ON customer.cust_num =payments.cust_num 

WHERE 
vend_Date > DATE_ADD(alarm.alarm_date, INTERVAL +28 DAY) OR 
vend_Date IS NULL

GROUP BY payments.cust_num, alarm.machine_num, alarm.alarm_date
ORDER BY  payments.cust_num

however This isn't picking up instances where first vend_date after alarm_date is NULL It also only returns the first date that is newer than 28 days after the alarm date, not accounts WHEN the first vend_date is newer than 28 days after the alarm_date

Phil Tomlinson (1 rep)

Sep 7, 2022, 03:51 PM • Last activity: Oct 14, 2022, 01:47 PM

0 votes

1 answers

2680 views

Counting uniques on a week by week basis

google-bigquery

I've managed to find the unique # of session_ids within the first week of the the year with this query. How do I get each successive week's unique count up to the last week of June? SELECT COUNT (DISTINCT session_id), MAX(timestamp_local) FROM `myTable` WHERE timestamp >= '2018-01-01' AND timestamp...

                                  I've managed to find the unique # of session_ids within the first week of the the year with this query.  

How do I get each successive week's unique count up to the last week of June? 

    SELECT COUNT (DISTINCT session_id),
      MAX(timestamp_local)
    FROM
      myTable
    WHERE
      timestamp >= '2018-01-01'
      AND timestamp <= '2018-01-07'

Appreciate your help!
                                

spamdavy (11 rep)

Jul 14, 2018, 06:52 PM • Last activity: Oct 2, 2022, 12:04 PM

4 votes

2 answers

1374 views

How to query for certain text in log messages from Google App Engine that are exported to BigQuery?

google-cloud-platform google-bigquery

I have used the 'Create Export' in the Logging section of Google Cloud Platform to export log entries into Big Query for analysis [![enter image description here][1]][1] Here is part of the structure of the table in BigQuery that logs the log entry [![enter image description here][2]][2] I want to s...

                                  I have used the 'Create Export' in the Logging section of Google Cloud Platform to export log entries into Big Query for analysis

Here is part of the structure of the table in BigQuery that logs the log entry 

I want to search for entries that contains this user id 14994838999176. 

In BigQuery, I issue the following query (in standard SQL dialect),

    SELECT * FROM gae_log.appengine_googleapis_com_request_log_20180323
    
    where protoPayload.line.logMessage  like '%14994838999176%'
    
    LIMIT 1000 

It failed with the error message:

     Query Failed
    
     Error: Cannot access field logMessage on a value with type 
     ARRAY> at [3:25]

How can I correct my sql statement?

Anthony Kong (141 rep)

Mar 23, 2018, 04:28 AM • Last activity: Sep 3, 2022, 11:27 AM

0 votes

0 answers

96 views

Generate unique ids in BigQuery data import

data-warehouse uniqueidentifier google-bigquery

I am using Zapier to import woocommerce orders into BigQuery and build a datawarehouse with paid orders from multiple sites. However, multiple sites generate an ID per site, not a unique id globally for all, so importing into the same table data from all the woocommerce orders will not have unique i...

                                  I am using Zapier to import woocommerce orders into BigQuery and build a datawarehouse with paid orders from multiple sites. However, multiple sites generate an ID per site, not a unique id globally for all, so importing into the same table data from all the woocommerce orders will not have unique ids. 

I would like BigQuery to automatically assign a unique id to every new order that enters the table. Is this possible?

tokyographer (1 rep)

Aug 22, 2022, 07:26 PM • Last activity: Aug 22, 2022, 09:09 PM

0 votes

1 answers

289 views

Oracle golden gate to BigQuery

oracle oracle-12c google-bigquery goldengate

Im trying to setting up the Golden gate to sync the data to BigQuery. When I start pushing the initial load, my extractor exported all the data and even from the replicat stats Im able to see the records are received but its not pushed to Bigquery. ## Extractor ``` GGSCI (bhuvi-oracle) 12> info extr...

GGSCI (bhuvi-oracle) 12> info extract load1

EXTRACT    LOAD1     Last Started 2020-05-07 16:52   Status STOPPED
Checkpoint Lag       Not Available
Log Read Checkpoint  Table HR.TEST
                     2020-05-07 16:52:41  Record 32768
Task                 SOURCEISTABLE

## Replicat stats

GGSCI (bhuvi-1) 1> stats bq

Sending STATS request to REPLICAT BQ ...
gg.handlerlist=bigquery

Start of Statistics at 2020-05-07 16:53:33.

Replicating from HR.TEST to bhuvi_orcl.test:

*** Total statistics since 2020-05-07 16:52:42 ***
        Total inserts                                  32768.00
        Total updates                                      0.00
        Total deletes                                      0.00
        Total discards                                     0.00
        Total operations                               32768.00

## Log file on BigData Golen Gate:

2020-05-07T16:59:04.895+0000  INFO    OGG-00987  Oracle GoldenGate Command Interpreter:  GGSCI command (root): start bq.
2020-05-07T16:59:04.904+0000  INFO    OGG-00963  Oracle GoldenGate Manager, mgr.prm:  Command received from GGSCI on host [10.18.0.61]:42130 (START REPLICAT BQ ).
2020-05-07T16:59:04.905+0000  INFO    OGG-00975  Oracle GoldenGate Manager, mgr.prm:  REPLICAT BQ starting.
2020-05-07T16:59:04.964+0000  INFO    OGG-00995  Oracle GoldenGate Delivery, bq.prm:  REPLICAT BQ starting.
2020-05-07T16:59:04.967+0000  INFO    OGG-03059  Oracle GoldenGate Delivery, bq.prm:  Operating system character set identified as UTF-8.
2020-05-07T16:59:04.967+0000  INFO    OGG-02695  Oracle GoldenGate Delivery, bq.prm:  ANSI SQL parameter syntax is used for parameter parsing.
2020-05-07T16:59:05.173+0000  INFO    OGG-15052  Oracle GoldenGate Delivery, bq.prm:  Using Java class path: .:ggjava/ggjava.jar:ggjava/resources/lib/optional/log4j-api-2.9.1.jar:ggjava/resources/lib/optional/log4j-core-2.9.1.jar:ggjava/resources/lib/optional/log4j-slf4j-impl-2.9.1.jar.
2020-05-07T16:59:10.243+0000  INFO    OGG-01815  Oracle GoldenGate Delivery, bq.prm:  Virtual Memory Facilities for: COM
                                  anon alloc: mmap(MAP_ANON)  anon free: munmap
                                  file alloc: mmap(MAP_SHARED)  file free: munmap
                                  target directories:
                                  /opt/ogg/dirtmp.
2020-05-07T16:59:10.243+0000  INFO    OGG-00996  Oracle GoldenGate Delivery, bq.prm:  REPLICAT BQ started.
2020-05-07T16:59:10.251+0000  INFO    OGG-02243  Oracle GoldenGate Delivery, bq.prm:  Opened trail file /opt/ogg/dirdat/oo000000 at 2020-05-07 16:59:10.249063.
2020-05-07T16:59:10.251+0000  INFO    OGG-03506  Oracle GoldenGate Delivery, bq.prm:  The source database character set, as determined from the trail file, is UTF-8.
2020-05-07T16:59:10.253+0000  INFO    OGG-06505  Oracle GoldenGate Delivery, bq.prm:  MAP resolved (entry HR.TEST): MAP "HR"."TEST", TARGET bhuvi_orcl.test.
2020-05-07T16:59:10.253+0000  INFO    OGG-02756  Oracle GoldenGate Delivery, bq.prm:  The definition for table HR.TEST is obtained from the trail file.
2020-05-07T16:59:10.253+0000  INFO    OGG-06511  Oracle GoldenGate Delivery, bq.prm:  Using following columns in default map by name: ID.
2020-05-07T16:59:10.253+0000  INFO    OGG-06510  Oracle GoldenGate Delivery, bq.prm:  Using the following key columns for target table bhuvi_orcl.test: ID.

TheDataGuy (1986 rep)

May 7, 2020, 05:15 PM • Last activity: May 6, 2022, 12:58 PM

0 votes

1 answers

841 views

Calculate Percentage with JOINS and original value

join google-bigquery

I need your help. I'm trying to calculate percentages by using results from JOINING divided by the original value. I do not know how to combine both values. I try to find trips with station id which do not exist in the table station and then find the percentage of that trips. I only can write the qu...

                                  I need your help. I'm trying to calculate percentages by using results from JOINING divided by the original value. I do not know how to combine both values. I try to find trips with station id which do not exist in the table station and then find the percentage of that trips.

I only can write the query to this extend:

    SELECT COUNT(*)
    FROM (SELECT a.start_station_id, b.station_id, a.end_station_id,a.bikeid
    FROM bigquery-public-data.new_york_citibike.citibike_trips as a
    LEFT OUTER JOIN  bigquery-public-data.new_york_citibike.citibike_stations as b 
    ON a.start_station_id = b.station_id
    WHERE b.station_id IS NULL)

I need to divide the result COUNT from the query above with COUNT(*) before JOIN.

Kindly help me with this.

Thank you!!

snsd94 (1 rep)

Feb 3, 2021, 04:23 PM • Last activity: Dec 19, 2021, 04:07 PM

Showing page 1 of 20 total questions