Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

18 views

How to count the number of campaigns per day based on the start and end dates of the campaigns

databricks

I need to count the number of campaigns per day based on the start and end dates of the campaigns Columns: Campaign Name, Start Date, End Date How do I need to write the SQL command in databricks?

                                  I need to count the number of campaigns per day based on the start and end dates of the campaigns

Columns:
Campaign Name, Start Date, End Date

How do I need to write the SQL command in databricks?

Level11Data (11 rep)

Jan 7, 2025, 07:20 PM • Last activity: Jan 7, 2025, 07:21 PM

0 votes

1 answers

69 views

Databricks SQL warehouse is failing to launch saying it "cannot fetch secrets", what is going on?

configuration databricks

I have a Databricks SQL warehouse. When I try to start it, I get the following error: > Clusters are failing to launch. Cluster launch will be retried. > > Details for the latest failure: Error: Cannot fetch secrets referred > in the Spark configuration. Please check that the secrets exists and > th...

                                  I have a Databricks SQL warehouse. When I try to start it, I get the following error:

> Clusters are failing to launch. Cluster launch will be retried.
> 
> Details for the latest failure: Error: Cannot fetch secrets referred
> in the Spark configuration. Please check that the secrets exists and
> the cluster's owner has read permissions. Type: CLIENT_ERROR Code:
> INVALID_ARGUMENT

I am not sure what's wrong, can somebody explain?
                                

Kyle Hale (216 rep)

Jan 7, 2025, 06:42 PM

0 votes

0 answers

34 views

Connect to Create a New Unity Catalog using a onprem postgres database connect

aws azure jdbc databricks

1. Have datbricks on azure platform with admin acces. 2. I have serverless sql warehouse where i have imported some csv data into a catalog. 3. Now i need to access postgres data on onprem linux box. 4. Need to connect this db from datbticks add connection to create a new catalog. 5. I would like to...

                                   1. Have datbricks on azure platform with admin acces.
 2. I have serverless sql warehouse where i have imported some csv data into a catalog.
 3. Now i need to access postgres data on onprem linux box.
 4. Need to connect this db from datbticks add connection to create a new catalog.
 5. I would like to use databeicks genei to acces the tables added from posgres db into the catlog.

 How do i procced now.


                                

malcolm richard (1 rep)

Dec 19, 2024, 11:15 AM • Last activity: Dec 19, 2024, 06:57 PM

1 votes

0 answers

210 views

Writing large dataset from spark dataframe

postgresql python jdbc apache-spark databricks

We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,0...

                                  We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,000). This table does have a handful of indexes on it, so inserts take awhile. It is dozens of hours to complete this operation (assuming if finishes successfully at all). 

I feel like it would make more sense to use COPY to load the data into the database, but I don't see any well establish patterns for doing that in databricks. 

I don't have a ton of spark or databricks experience, so any tips are appreciated.

Kyle Chamberlin (13 rep)

Feb 16, 2024, 12:57 AM

1 votes

1 answers

433 views

DESCRIBE TABLE in databricks piped into dataframe

databricks

Does anyone know of a method to pipe the "DESCRIBE TABLE" output in databricks into dataframe? (or other usable format which could be used for further analysis/computation)?

                                  Does anyone know of a method to pipe the "DESCRIBE TABLE" output in databricks into dataframe? (or other usable format which could be used for further analysis/computation)?
                                

Doc (121 rep)

Dec 7, 2021, 02:04 PM • Last activity: Feb 15, 2024, 03:05 AM

0 votes

1 answers

21 views

Next Business Date Column

sum databricks

I have a dataset that looks like this. [![dataset sample][1]][1] Where `business_day` indicates whether the `transaction_created_date` is a business day or not. I'm trying to sum the `line_amount` so that values that occurred over the holiday or weekend gets added to the next business day to look so...

                                  I have a dataset that looks like this.

Where business_day indicates whether the transaction_created_date is a business day or not. I'm trying to sum the line_amount so that values that occurred over the holiday or weekend gets added to the next business day to look something like this:

Essentially, if I can capture the next business day where business_day = 0 then I can just do a sum over partition.

Lena Zheng (3 rep)

Jan 10, 2024, 12:28 AM • Last activity: Jan 10, 2024, 08:40 AM

-1 votes

1 answers

50 views

How to create "On this day in history" query

databricks

I'm using Databricks and I have a table with a list of event from various years. I want to return the event most recent to today's date from each year. For example, Today's date is 6th May and my table is thus: |Year (int)|Date (date)|Event (str)| |----------|-----------|-----------| |2021 |2021-08-...

                                  I'm using Databricks and I have a table with a list of event from various years.  I want to return the event most recent to today's date from each year.  For example,

Today's date is 6th May and my table is thus:

|Year (int)|Date (date)|Event (str)|
|----------|-----------|-----------|
|2021      |2021-08-04|Ate apple|
|2021      |2021-04-16|Flew plane|
|2020      |2020-10-11|Swam 100 miles|
|2020      |2020-03-07|Did backflip|
|2020      |2020-01-01|Tidied room|
|2019      |2019-09-30|Found 10 pence|
|2018      |2018-02-22|Lost 10 pence|

So I would want to return:

**On this day in history your most recent achievements were:**

|Year|Date|Event|
|----|----|-----|
|2021|2021-04-16|Flew plane|
|2020|2020-03-07|Did backflip|
|2018|2018-02-22|Lost 10 pence|

Is there a neat way of doing this?...and by neat I mean, without creating extra columns or tables i.e. by comparing CURRENT_DATE to my Date field.

                                

ben_al (1 rep)

May 6, 2022, 12:01 PM • Last activity: May 6, 2022, 03:00 PM

1 votes

1 answers

2569 views

How to call python file in repo in databricks from data factory outside DBFS?

azure python azure-data-factory databricks

In Azure Databricks I have I have a repo cloned which contains python files, not notebooks. In Azure Data Factory I want to configure a step to run a Databricks Python file. However when I enter the /Repos/..../myfile.py (which works for Databricks Notebooks) it gives me the error " DBFS URI must st...

                                  In Azure Databricks I have I have a repo cloned which contains python files, not notebooks.

In Azure Data Factory I want to configure a step to run a Databricks Python file. However when I enter the /Repos/..../myfile.py   (which works for Databricks Notebooks) it gives me the error "
DBFS URI must starts with 'dbfs:'"

How can I reference a python file from a report which is not in dbfs?

NOTE I see a duplicate question here but the answer was just to wrap it in a Databricks Notebook - OK workaround but when I do it I get "No module named 'my_python_file'"

https://stackoverflow.com/questions/70096408/how-to-create-a-databricks-job-using-a-python-file-outside-of-dbfs

Brendan Hill (301 rep)

Dec 1, 2021, 08:14 AM • Last activity: Jan 7, 2022, 07:51 AM

2 votes

0 answers

696 views

Troubleshoting slow running queries/jobs in Azure Databricks

performance azure troubleshooting databricks

I have Azure Databricks workspace with cluster configured to run Standard 6.4 runtime (Apache Spark 2.4.5, Scala 2.11). Cluster uses shared metastore (Azure MySQL). I'm trying to figure out possible way to troubleshoot sporadically slow execution of jobs/queires - I have a test SELECT query which no...

                                  I have Azure Databricks workspace with cluster configured to run Standard 6.4 runtime (Apache Spark 2.4.5, Scala 2.11). Cluster uses shared metastore (Azure MySQL). I'm trying to figure out possible way to troubleshoot sporadically slow execution of jobs/queires - I have a test SELECT query which normally runs within 2-3 minutes but couple of times a day it takes 15 minutes. What would be the best way to troubleshoot this?
                                

Mike (747 rep)

Aug 20, 2021, 01:54 PM • Last activity: Aug 23, 2021, 02:28 PM

Showing page 1 of 9 total questions