Questions tagged [databricks]

1

votes
1

answer
177

Views

How to delete all files from folder with Databricks dbutils

Can someone let me know how to use the databricks dbutils to delete all files from a folder. I have tried the following but unfortunately, Databricks doesn't support wildcards. dbutils.fs.rm('adl://azurelake.azuredatalakestore.net/landing/stageone/*') Thanks
Carltonp
1

votes
1

answer
13

Views

How to convert a type Any List to a type Double (Scala)

I am new to Scala and I would like to understand some basic stuff. First of all, I need to calculate the average of a certain column of a DataFrame and use the result as a double type variable. After some Internet research I was able to calculate the average and at the same time pass it into a Lis...
Aris Kantas
1

votes
1

answer
52

Views

How to convert dataframe rows values to dynamic columns?

I have a dataFrame as below ----------------------------- | A | B | C | ----------------------------- | 1 | col_1 | val1 | | 1 | col_2 | val2 | | 1 | col_3 | val3 | | 1 | col_4 | val4 | ----------------------------- I need to convert...
Shyam
1

votes
1

answer
781

Views

Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

I'm trying to get a pretty large table from mysql so I can manipulate using spark/databricks. I can't get it to load into spark - I have tried taking smaller subsets, but even at the smallest reasonable unit, it still fails to load. I have tried playing with the wait_timeout and interactive_timeout...
1

votes
0

answer
325

Views

Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

I am reading a data stream from Event Hub in Spark (using Databricks). My goal is to be able to write the streamed data to a CosmosDB. However I get the following error: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame. Is this scenario not supported?...
fbeltrao
1

votes
0

answer
33

Views

Running in spark MLlib in Databricks , how to interpret the one more weighs in logistic regression

I have 12 feature variables, but why are there 13 weighs shown here for logistic regression in spark MLlib Databricks? How can I interpret it? Following this link it says: 'intercept – Intercept computed for this model. (Only used in Binary Logistic Regression, the intercepts will not bea single v...
Mia
1

votes
0

answer
152

Views

Databricks get JSON without schema

What's the typical approach for getting JSON from REST API using databricks? It returns nested structure, which can change over time and doesn't have any schema: { 'page': '1', 'total': '10', 'payload': [ { 'param1': 'value1', 'param2': 'value2' }, { 'param2': 'value2', 'param3': 'value3' } ] } I'm...
Kertis van Kertis
1

votes
1

answer
306

Views

spark.cores.max but per node?

I'm using a autoscaling cluster from Azure Databricks. The pyspark job has to call an external process, so I'm hoping I can leave some percentage of each node 'unused' by spark. I have found spark.cores.max, but this is the total number of cores, not the total per node. Is there an equivalent arg...
rsmith54
1

votes
1

answer
286

Views

Loading XML file Pyspark on databricks

I am trying to use the databricks spark xml library to import the following XML file: https://s3.eu-west-2.amazonaws.com/kieranw/Badges.xml. xml_posts = spark.read.format('xml').options(rootTag='badges').load('s3a://%s:%[email protected]%s/Badges.xml'% (ACCESS_KEY, ENCODED_SECRET_KEY, BUCKET_NAME)) xml_posts.print...
Kieran White
1

votes
1

answer
160

Views

Why is Pyspark expecting type basestring when exporting a dataframe to csv or txt file?

I am using Pyspark in the community version of Databricks, using Python 2.7 and Spark 2.2.1. I have a Pyspark dataframe 'top100m': In: type(movie_ratings_top100m) Out: pyspark.sql.dataframe.DataFrame Which has 3 numeric type columns: In: top100m.printSchema() Out: root |-- userId: long (nullable = t...
Naim
1

votes
1

answer
63

Views

How to write two Spark DataFrames to Redshift atomically?

I am using Databricks spark-redshift to write DataFrames to Redshift. I have two DataFrames that get appended to two separate tables, but I need this to happen atomically, i.e. if the second DataFrame fails to write to its table, I'll need the first one to be undone as well. Is there any way to do t...
lfk
1

votes
2

answer
216

Views

Databrick Azure broadcast variables not serializable

So I am trying to create a extremely simple spark notebook using Azure Databricks and would like to make use of a simple RDD map call. This is just for messing around, so the example is a bit contrived, but I can not get a value to work in the RDD map call unless it is a static constant value I hav...
sacha
1

votes
0

answer
107

Views

SparkR org.apache.spark.SparkException: R worker exited unexpectedly

I am trying to execute a SparkR gapply , essentially when I attempt to run this with my input file limited to about 300k rows it works, however scaling up to about 1.2m rows I get the following recurring exception in the stderr in many executor tasks - roughly 70% of tasks complete while the others...
and_apo
1

votes
0

answer
699

Views

Connection from Spark to snowflake

I am writing this not for asking the question, but sharing the knowledge. I was using Spark to connect to snowflake. But I could not access snowflake. It seemed like there was something wrong with internal JDBC driver in databricks. Here was the error I got. java.lang.NoClassDefFoundError:net/snowfl...
Chao Mu
1

votes
1

answer
71

Views

Use recursive globbing to extract XML documents as strings in pyspark

The goal is to extract XML documents, given an XPath expression, from a group of text files as strings. The difficulty is the variance of forms the text files may be in. Might be: single zip / tar file with 100 files, each 1 XML document one file, with 100 XML documents (aggregate document) one zi...
ghukill
1

votes
0

answer
33

Views

How to preserve a XML data in databricks.xml scala

Below is my Scala code that would write the content as XML `export.coalesce(1).select('*').write .format('com.databricks.spark.xml') .option('attributePrefix', '@') .option('valueTag', '#text') .option('rootTag', 'row') .option('rowTag', 'content') .save('sample.xml')` where I am fetching the data...
R.Illa
1

votes
2

answer
792

Views

Create Spark SQL tables from multiple parquet paths

I use databricks. I am trying to create a table as below ` target_table_name = 'test_table_1' spark.sql(''' drop table if exists %s ''' % target_table_name) spark.sql(''' create table if not exists {0} USING org.apache.spark.sql.parquet OPTIONS ( path ('/mnt/sparktables/ds=*/name=xyz/') ) '''....
SpaceOddity
1

votes
0

answer
104

Views

Data loss while writing from Spark Databricks to Azure Cosmos db

I am sending dataframes from databricks to a graph collection present in azure cosmos db using azure cosmos spark connector and write configuration provided at https://github.com/Azure/azure-cosmosdb-spark . But I am facing data loss while this data transfer occurs. I tried to write 800k records but...
sopho-saksham
1

votes
0

answer
35

Views

print topics from LDA

I have generated following python syntax : Create a new CountVectorizer model without the stopwords cv = CountVectorizer(inputCol='filtered', outputCol='rawFeatures', vocabSize = 1000) cvmodel = cv.fit(wordsDataFrame) df_vect = cvmodel.transform(wordsDataFrame) vacab=cvmodel.vocabulary idf = IDF(inp...
lpt
1

votes
1

answer
481

Views

NameError: name 'dbutils' is not defined in pyspark

I am running a pyspark job in databricks cloud. I need to write some of the csv files to databricks filesystem (dbfs) as part of this job and also i need to use some of the dbutils native commands like, #mount azure blob to dbfs location dbutils.fs.mount (source='...',mount_point='/mnt/...',extra_co...
Krishna Reddy
1

votes
1

answer
403

Views

Spark cluster OutOfMemoryError with way more memory then needed

I have a spark cluster with a 28gb driver and 8x 56gb workers. I am attempting to process a 4gb file. I can successfully process this file without the use of spark on my 16gb of memory on my own laptop so there is no memory leak causing the full 56gb to be used, it can also process smaller sample...
jayjay93
1

votes
0

answer
71

Views

convert scala code to python LDA

I have created a LDA model with pyspark's ML library. I am in the final steps to review topics. I need some help in converting Scala syntax to python scala code val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5) val vocabList = vectorizer.vocabulary my python equivalent syntax topics=...
lpt
1

votes
1

answer
134

Views

creating max date function using sql in databricks

I am writing queries in databricks using sql on views and would like to calculate max of dates of update timestamp column across multiple views. for instance i am joining table a with table b and would like to know max(a.updt_ts,b.updt_ts). since max function can not have more than one columns ment...
codewalker
1

votes
1

answer
269

Views

Databricks spark-xml when reading tags ending in “/>” return values are null

I'm using the latest version of spark-xml (0.4.1) with scala 11, when I read some xml that contains tags ending with '/>' the corresponding values ​​are null, fallow the example: XML: Dataframe: +----+------+----+--------------------+ | _ID| _name|_age| Operation| +----+------+----+---...
Thiago Zolinger
1

votes
0

answer
143

Views

customize spark csv line terminator

I am using pyspark code to generate csv from a dataframe using below code, df.repartition(1).write.format('com.databricks.spark.csv').option('header','true').mode('overwrite').save('/user/test') But, when i open and see the line terminator in notepad++, it is coming with default line terminator '\n'...
Krishna Reddy
1

votes
1

answer
89

Views

Databricks SQL: Why is subquery in left join causing error msg

I am attempting to use a subquery in a left join condition, but am getting an error message that reads: 'Error in SQL statement: AnalysisException: Table or view not found: TableD;' and points to the FROM TableD D2 statement in my subquery. SELECT D1.Code, D1.Description, C.InstanceKey FROM TableA A...
Erica Shapland
1

votes
1

answer
139

Views

Create New Table from DBFS mount to Azure Data Lake

I have a directory on Azure Data Lake mountedd to an Azure Data Bricks cluster. Browsing through the file system using the CLI tools or just running dbfs utils through a notebook, I can see that there are files and data in that directory. Further - executing queries against those files is successful...
Ben Savage
1

votes
1

answer
277

Views

Writing to Cosmos DB Graph API from Databricks (Apache Spark)

I have a DataFrame in Databricks which I want to use to create a graph in Cosmos, with one row in the DataFrame equating to 1 vertex in Cosmos. When I write to Cosmos I can't see any properties on the vertices, just a generated id. Get data: data = spark.sql('select * from graph.testgraph') Configu...
SAB
1

votes
1

answer
175

Views

How to authenticate Databrics API using .netrc file

I have created '.netrc ' file on my machine and trying below databricks rest api call. But it always giving an unauthorized error. How to create .netrc file in Databricks? curl -n -X GET https:///api/2.0/token/list How to use .netrc file with databricks api?
Rohi_Dev_1.0
1

votes
0

answer
25

Views

Can we use existing AWS EC2 instance with Databricks?

We already have our AWS EC2 instance up and running. Is there a way to use this instance with databricks Jobs Api instead of creating a separate Databricks cluster ?
Ritesh
1

votes
2

answer
203

Views

Finding difference between two time stamp in pyspark sql

Below table structure, you can notice the column name cal_avg_latency = spark.sql('SELECT UnitType, ROUND(AVG(TIMESTAMP_DIFF(OnSceneDtTmTS, ReceivedDtTmTS, MINUTE)), 2) as latency, count(*) as total_count FROM `SFSC_Incident_Census_view` WHERE EXTRACT(DATE from ReceivedDtTmTS) == EXTRACT(DATE from...
Amar Desai
1

votes
1

answer
420

Views

How does one authenticate with Azure Databricks through .Net?

I am trying to figure out how to send HTTP requests to Azure Databricks for my application. Currently I am stuck in authenticating; Every request returns a 401 Unauthorzied error. I have followed their guide and I created a Personal Access Token and retrieved its secret key. This is my code which wo...
Mingyao Xiao
1

votes
2

answer
442

Views

How to load databricks package dbutils in pyspark

I was trying to run the below code in pyspark. dbutils.widgets.text('config', '', 'config') It was throwing me an error saying Traceback (most recent call last): File '', line 1, in NameError: name 'dbutils' is not defined so, Is there any way I can run it in pyspark by including the databricks p...
Babu
1

votes
3

answer
161

Views

Can't Save Dataframe to Local Mac Machine

I am using a Databricks notebook and trying to export my dataframe as CSV to my local machine after querying it. However, it does not save my CSV to my local machine. Why? Connect to Database #SQL Connector import pandas as pd import psycopg2 import numpy as np from pyspark.sql import * #Connection...
Tina
1

votes
0

answer
165

Views

spark-csv fails parsing with embedded html and quotes

I have this csv file which contains description of several cities: Cities_information_extract.csv I can parse this file just fine using python pandas.read_csv or R read.csv methods. They both return 693 rows for 25 columns. I am trying, unsuccessfully, to load the csv using Spark 1.6.0 and scala. Fo...
revy
1

votes
1

answer
41

Views

Spark — Return identity value from SQL Server from Spark 2.3

I need to insert a row into a SQL table from Spark running on Azure Databricks and want to know if there is a way to return the identity value from the primary key that gets generated from SQL Server (@@IDENTITY) back to Databricks
user3241068
1

votes
0

answer
19

Views

Executing multiple Pyspark scripts in parallel

How can I initiate execution of multiple Pyspark scripts from one notebook, in parallel? Note: I'm currently using Azure's Databricks(enterprize edition)
aneeshaasc
1

votes
1

answer
150

Views

On Azure Databricks how can I tell which blob store is mounted

I have inherited a notebook which writes to a mounted Azure blob storage, using syntax: instrumentDf.write.json('/mnt/blobdata/cosmosdata/instrumentjson') How can I find the name of the Azure blob storage it has written to ? Thanks !
Simon W
1

votes
0

answer
61

Views

Partially update Document with Pyspark in CosmosDB with MongoDB API

I'm using Azure Databricks with Pyspark and a CosmosDB with the MongoDB API. The following Pyspark command is being used to store a data_frame in the CosmosDB which works fine: def storeCollection(self, collection, data_frame, save_mode='append'): data_frame.write.format( 'com.mongodb.spark.sql.Defa...
tom1991te
1

votes
0

answer
158

Views

Consume Secure Kafka from databricks spark cluster

I am trying to consume from a secure Kafka Topic ( using SASL_PLAINTEXT, ScramLogin Method). Spark Version 2.3.1 Scala 2.11 Kafka latest I am using the Spark Structured stream to construct the stream. For this purpose I imported the library : spark-sql-kafka-0-10_2.11-2.3.1 This imports the older v...
Sumit Baurai

View additional questions