Questions tagged [databricks]
339 questions
1
votes
1
answer
177
Views
How to delete all files from folder with Databricks dbutils
Can someone let me know how to use the databricks dbutils to delete all files from a folder.
I have tried the following but unfortunately, Databricks doesn't support wildcards.
dbutils.fs.rm('adl://azurelake.azuredatalakestore.net/landing/stageone/*')
Thanks
1
votes
1
answer
13
Views
How to convert a type Any List to a type Double (Scala)
I am new to Scala and I would like to understand some basic stuff.
First of all, I need to calculate the average of a certain column of a DataFrame and use the result as a double type variable.
After some Internet research I was able to calculate the average and at the same time pass it into a Lis...
1
votes
1
answer
52
Views
How to convert dataframe rows values to dynamic columns?
I have a dataFrame as below
-----------------------------
| A | B | C |
-----------------------------
| 1 | col_1 | val1 |
| 1 | col_2 | val2 |
| 1 | col_3 | val3 |
| 1 | col_4 | val4 |
-----------------------------
I need to convert...
1
votes
1
answer
781
Views
Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out
I'm trying to get a pretty large table from mysql so I can manipulate using spark/databricks. I can't get it to load into spark - I have tried taking smaller subsets, but even at the smallest reasonable unit, it still fails to load.
I have tried playing with the wait_timeout and interactive_timeout...
1
votes
0
answer
325
Views
Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame
I am reading a data stream from Event Hub in Spark (using Databricks). My goal is to be able to write the streamed data to a CosmosDB. However I get the following error:
org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame.
Is this scenario not supported?...
1
votes
0
answer
33
Views
Running in spark MLlib in Databricks , how to interpret the one more weighs in logistic regression
I have 12 feature variables, but why are there 13 weighs shown here for logistic regression in spark MLlib Databricks? How can I interpret it?
Following this link it says:
'intercept – Intercept computed for this model. (Only used in Binary Logistic Regression, the intercepts will not bea single v...
1
votes
0
answer
152
Views
Databricks get JSON without schema
What's the typical approach for getting JSON from REST API using databricks?
It returns nested structure, which can change over time and doesn't have any schema:
{ 'page': '1',
'total': '10',
'payload': [
{ 'param1': 'value1',
'param2': 'value2'
},
{ 'param2': 'value2',
'param3': 'value3'
}
]
}
I'm...
1
votes
1
answer
306
Views
spark.cores.max but per node?
I'm using a autoscaling cluster from Azure Databricks. The pyspark job has to call an external process, so I'm hoping I can leave some percentage of each node 'unused' by spark. I have found spark.cores.max, but this is the total number of cores, not the total per node. Is there an equivalent arg...
1
votes
1
answer
286
Views
Loading XML file Pyspark on databricks
I am trying to use the databricks spark xml library to import the following XML file: https://s3.eu-west-2.amazonaws.com/kieranw/Badges.xml.
xml_posts = spark.read.format('xml').options(rootTag='badges').load('s3a://%s:%[email protected]%s/Badges.xml'% (ACCESS_KEY, ENCODED_SECRET_KEY, BUCKET_NAME))
xml_posts.print...
1
votes
1
answer
160
Views
Why is Pyspark expecting type basestring when exporting a dataframe to csv or txt file?
I am using Pyspark in the community version of Databricks, using Python 2.7 and Spark 2.2.1. I have a Pyspark dataframe 'top100m':
In: type(movie_ratings_top100m)
Out: pyspark.sql.dataframe.DataFrame
Which has 3 numeric type columns:
In: top100m.printSchema()
Out: root
|-- userId: long (nullable = t...
1
votes
1
answer
63
Views
How to write two Spark DataFrames to Redshift atomically?
I am using Databricks spark-redshift to write DataFrames to Redshift. I have two DataFrames that get appended to two separate tables, but I need this to happen atomically, i.e. if the second DataFrame fails to write to its table, I'll need the first one to be undone as well. Is there any way to do t...
1
votes
2
answer
216
Views
Databrick Azure broadcast variables not serializable
So I am trying to create a extremely simple spark notebook using Azure Databricks and would like to make use of a simple RDD map call.
This is just for messing around, so the example is a bit contrived, but I can not get a value to work in the RDD map call unless it is a static constant value
I hav...
1
votes
0
answer
107
Views
SparkR org.apache.spark.SparkException: R worker exited unexpectedly
I am trying to execute a SparkR gapply , essentially when I attempt to run this with my input file limited to about 300k rows it works, however scaling up to about 1.2m rows I get the following recurring exception in the stderr in many executor tasks - roughly 70% of tasks complete while the others...
1
votes
0
answer
699
Views
Connection from Spark to snowflake
I am writing this not for asking the question, but sharing the knowledge.
I was using Spark to connect to snowflake. But I could not access snowflake. It seemed like there was something wrong with internal JDBC driver in databricks.
Here was the error I got.
java.lang.NoClassDefFoundError:net/snowfl...
1
votes
1
answer
71
Views
Use recursive globbing to extract XML documents as strings in pyspark
The goal is to extract XML documents, given an XPath expression, from a group of text files as strings. The difficulty is the variance of forms the text files may be in. Might be:
single zip / tar file with 100 files, each 1 XML document
one file, with 100 XML documents (aggregate document)
one zi...
1
votes
0
answer
33
Views
How to preserve a XML data in databricks.xml scala
Below is my Scala code that would write the content as XML
`export.coalesce(1).select('*').write
.format('com.databricks.spark.xml')
.option('attributePrefix', '@')
.option('valueTag', '#text')
.option('rootTag', 'row')
.option('rowTag', 'content')
.save('sample.xml')`
where I am fetching the data...
1
votes
2
answer
792
Views
Create Spark SQL tables from multiple parquet paths
I use databricks. I am trying to create a table as below
` target_table_name = 'test_table_1'
spark.sql('''
drop table if exists %s
''' % target_table_name)
spark.sql('''
create table if not exists {0}
USING org.apache.spark.sql.parquet
OPTIONS (
path ('/mnt/sparktables/ds=*/name=xyz/')
)
'''....
1
votes
0
answer
104
Views
Data loss while writing from Spark Databricks to Azure Cosmos db
I am sending dataframes from databricks to a graph collection present in azure cosmos db using azure cosmos spark connector and write configuration provided at https://github.com/Azure/azure-cosmosdb-spark . But I am facing data loss while this data transfer occurs. I tried to write 800k records but...
1
votes
0
answer
35
Views
print topics from LDA
I have generated following python syntax :
Create a new CountVectorizer model without the stopwords
cv = CountVectorizer(inputCol='filtered', outputCol='rawFeatures', vocabSize = 1000)
cvmodel = cv.fit(wordsDataFrame)
df_vect = cvmodel.transform(wordsDataFrame)
vacab=cvmodel.vocabulary
idf = IDF(inp...
1
votes
1
answer
481
Views
NameError: name 'dbutils' is not defined in pyspark
I am running a pyspark job in databricks cloud. I need to write some of the csv files to databricks filesystem (dbfs) as part of this job and also i need to use some of the dbutils native commands like,
#mount azure blob to dbfs location
dbutils.fs.mount (source='...',mount_point='/mnt/...',extra_co...
1
votes
1
answer
403
Views
Spark cluster OutOfMemoryError with way more memory then needed
I have a spark cluster with a 28gb driver and 8x 56gb workers. I am attempting to process a 4gb file. I can successfully process this file without the use of spark on my 16gb of memory on my own laptop so there is no memory leak causing the full 56gb to be used, it can also process smaller sample...
1
votes
0
answer
71
Views
convert scala code to python LDA
I have created a LDA model with pyspark's ML library. I am in the final steps to review topics. I need some help in converting Scala syntax to python
scala code
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
my python equivalent syntax
topics=...
1
votes
1
answer
134
Views
creating max date function using sql in databricks
I am writing queries in databricks using sql on views and would like to calculate max of dates of update timestamp column across multiple views. for instance i am joining table a with table b and would like to know max(a.updt_ts,b.updt_ts). since max function can not have more than one columns ment...
1
votes
1
answer
269
Views
Databricks spark-xml when reading tags ending in “/>” return values are null
I'm using the latest version of spark-xml (0.4.1) with scala 11, when I read some xml that contains tags ending with '/>' the corresponding values are null, fallow the example:
XML:
Dataframe:
+----+------+----+--------------------+
| _ID| _name|_age| Operation|
+----+------+----+---...
1
votes
0
answer
143
Views
customize spark csv line terminator
I am using pyspark code to generate csv from a dataframe using below code,
df.repartition(1).write.format('com.databricks.spark.csv').option('header','true').mode('overwrite').save('/user/test')
But, when i open and see the line terminator in notepad++, it is coming with default line terminator '\n'...
1
votes
1
answer
89
Views
Databricks SQL: Why is subquery in left join causing error msg
I am attempting to use a subquery in a left join condition, but am getting an error message that reads: 'Error in SQL statement: AnalysisException: Table or view not found: TableD;' and points to the FROM TableD D2 statement in my subquery.
SELECT D1.Code, D1.Description, C.InstanceKey
FROM TableA A...
1
votes
1
answer
139
Views
Create New Table from DBFS mount to Azure Data Lake
I have a directory on Azure Data Lake mountedd to an Azure Data Bricks cluster. Browsing through the file system using the CLI tools or just running dbfs utils through a notebook, I can see that there are files and data in that directory. Further - executing queries against those files is successful...
1
votes
1
answer
277
Views
Writing to Cosmos DB Graph API from Databricks (Apache Spark)
I have a DataFrame in Databricks which I want to use to create a graph in Cosmos, with one row in the DataFrame equating to 1 vertex in Cosmos.
When I write to Cosmos I can't see any properties on the vertices, just a generated id.
Get data:
data = spark.sql('select * from graph.testgraph')
Configu...
1
votes
1
answer
175
Views
How to authenticate Databrics API using .netrc file
I have created '.netrc ' file on my machine and trying below databricks rest api call. But it always giving an unauthorized error. How to create .netrc file in Databricks?
curl -n -X GET https:///api/2.0/token/list
How to use .netrc file with databricks api?
1
votes
0
answer
25
Views
Can we use existing AWS EC2 instance with Databricks?
We already have our AWS EC2 instance up and running.
Is there a way to use this instance with databricks Jobs Api instead of creating a separate Databricks cluster ?
1
votes
2
answer
203
Views
Finding difference between two time stamp in pyspark sql
Below table structure, you can notice the column name
cal_avg_latency = spark.sql('SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM `SFSC_Incident_Census_view` WHERE EXTRACT(DATE from ReceivedDtTmTS) == EXTRACT(DATE from...
1
votes
1
answer
420
Views
How does one authenticate with Azure Databricks through .Net?
I am trying to figure out how to send HTTP requests to Azure Databricks for my application. Currently I am stuck in authenticating; Every request returns a 401 Unauthorzied error.
I have followed their guide and I created a Personal Access Token and retrieved its secret key. This is my code which wo...
1
votes
2
answer
442
Views
How to load databricks package dbutils in pyspark
I was trying to run the below code in pyspark.
dbutils.widgets.text('config', '', 'config')
It was throwing me an error saying
Traceback (most recent call last):
File '', line 1, in
NameError: name 'dbutils' is not defined
so, Is there any way I can run it in pyspark by including the databricks p...
1
votes
3
answer
161
Views
Can't Save Dataframe to Local Mac Machine
I am using a Databricks notebook and trying to export my dataframe as CSV to my local machine after querying it. However, it does not save my CSV to my local machine. Why?
Connect to Database
#SQL Connector
import pandas as pd
import psycopg2
import numpy as np
from pyspark.sql import *
#Connection...
1
votes
0
answer
165
Views
spark-csv fails parsing with embedded html and quotes
I have this csv file which contains description of several cities:
Cities_information_extract.csv
I can parse this file just fine using python pandas.read_csv or R read.csv methods. They both return 693 rows for 25 columns.
I am trying, unsuccessfully, to load the csv using Spark 1.6.0 and scala.
Fo...
1
votes
1
answer
41
Views
Spark — Return identity value from SQL Server from Spark 2.3
I need to insert a row into a SQL table from Spark running on Azure Databricks and want to know if there is a way to return the identity value from the primary key that gets generated from SQL Server (@@IDENTITY) back to Databricks
1
votes
0
answer
19
Views
Executing multiple Pyspark scripts in parallel
How can I initiate execution of multiple Pyspark scripts from one notebook, in parallel?
Note: I'm currently using Azure's Databricks(enterprize edition)
1
votes
1
answer
150
Views
On Azure Databricks how can I tell which blob store is mounted
I have inherited a notebook which writes to a mounted Azure blob storage, using syntax:
instrumentDf.write.json('/mnt/blobdata/cosmosdata/instrumentjson')
How can I find the name of the Azure blob storage it has written to ?
Thanks !
1
votes
0
answer
61
Views
Partially update Document with Pyspark in CosmosDB with MongoDB API
I'm using Azure Databricks with Pyspark and a CosmosDB with the MongoDB API.
The following Pyspark command is being used to store a data_frame in the CosmosDB which works fine:
def storeCollection(self, collection, data_frame, save_mode='append'):
data_frame.write.format(
'com.mongodb.spark.sql.Defa...
1
votes
0
answer
158
Views
Consume Secure Kafka from databricks spark cluster
I am trying to consume from a secure Kafka Topic ( using SASL_PLAINTEXT, ScramLogin Method).
Spark Version 2.3.1
Scala 2.11
Kafka latest
I am using the Spark Structured stream to construct the stream. For this purpose I imported the library : spark-sql-kafka-0-10_2.11-2.3.1
This imports the older v...