Questions tagged [apache-spark-sql]

1

votes
1

answer
848

Views

EXCEPT on Specific columns Spark 1.6

I'm trying to filter out rows from dfA using dfB. dfA: +----+---+----+------------+-----+ |year|cid|X| Y|Z| +----+---+----+------------+-----+ +----+---+----+------------+-----+. dfB: +----+---+ |year|cid| +----+---+ +----+---+ My goal is to fillter all couples year cid in dfB from dfA. I see...
RefiPeretz
1

votes
1

answer
728

Views

How do I select an ambiguous column reference? [duplicate]

This question already has an answer here: Enable case sensitivity for spark.sql globally 1 answer Here's some sample code illustrating what I'm trying to do. There is a dataframe with columns companyid and companyId. I want to select companyId, but the reference is ambiguous. How do I unambiguously...
chris.mclennon
1

votes
3

answer
4k

Views

How to create an empty dataFrame in Spark

I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from underlying HDFS dir. Everything works fine except w...
Vinay Kumar
1

votes
1

answer
490

Views

Spark dataframe calculate the row-wise minimum [duplicate]

This question already has an answer here: Comparing columns in Pyspark 4 answers Row aggregations in Scala 1 answer I'm trying to put the minimum value of a few columns into a separate column. (Creating the min column). The operation is pretty straight forward but I wasn't able to find the right f...
1

votes
2

answer
374

Views

How to check if key exists in spark sql map type

So I have a table with one column of map type (the key and value are both strings). I'd like to write spark sql like this to check if given key exists in the map. select count(*) from my_table where map_contains_key(map_column, 'testKey') I couldn't find any existing spark sql functions that can do...
seiya
1

votes
2

answer
40

Views

Sql Between Clause with multiple columns

I am trying to get the dates between JAN 2016 and March 2018. I have used the below query. But it is not returning proper results. The input columns will come at runtime. We cannot convert it into the date as the input columns can contain quarter. SELECT Year,Month,DayOfMonth FROM period WHERE (Y...
dassum
1

votes
1

answer
49

Views

spark scala create column from Dataframe with values dependent on date time range

I'm trying to create a new column from a dataframe that lets say looks like names|birthtime-datetime| joe|2017-03-29 2:23:38| mike|2017-03-29 3:53:38| mary|2017-03-29 11:63:38| ..... I want to add a column that based on if the DateTime column is in a range gets a int. let's say in this case there ar...
Brian
1

votes
1

answer
32

Views

Scala Spark: How to pad a sublist inside a dataframe with extra values?

Say I have a dataframe, originalDF, which looks like this +--------+--------------+ |data_id |data_list | +--------+--------------+ | 3| [a, b, d] | | 2|[c, a, b, e] | | 1| [g] | +--------+--------------+ And I have another dataframe, extraInfoDF, which looks like...
JR3652
1

votes
2

answer
31

Views

filter or label rows based on a Scala array

Is there a way to filter or label rows based on a Scala array? Please keep in mind in reality there the number of rows is much larger. sample data val clients= List(List('1', '67') ,List('2', '77') ,List('3', '56'),List('4','90')).map(x =>(x(0), x(1))) val df = clients.toDF('soc','ages') +---+----+...
Brian
2

votes
0

answer
11

Views

How to modify a column based on the values in another column of a PySpark dataframe? F.when edge case

I'd like to go through each row in a pyspark dataframe, and change the value of a column based on the content of another column. The value I am changing it to is also based on the current value of the column to be changed. Specifically, I have a column which contains DenseVectors, and another column...
bmarks2010
1

votes
2

answer
35

Views

Update a dataframe with nested fields - Spark

I have two dataframes like below Df1 +----------------------+---------+ |products |visitorId| +----------------------+---------+ |[[i1,0.68], [i2,0.42]]|v1 | |[[i1,0.78], [i3,0.11]]|v2 | +----------------------+---------+ Df2 +---+----------+ | id| name| +---+----------...
yAsH
1

votes
2

answer
1.1k

Views

How to compare two StructType sharing same contents?

It seems like StructType preserves order, so two StructType containing same StructFields are not considered equivalent. For example: val st1 = StructType( StructField('ii',StringType,true) :: StructField('i',StringType,true) :: Nil) val st2 = StructType( StructField('i',StringType,true) :: Struc...
THIS USER NEEDS HELP
-1

votes
0

answer
17

Views

Split ordered events into snapshots with Spark

I have a data structure called auction hits and I want to move through each event in this RDD to produce snapshots of these auctions at each event. So that I will be left with an collection of these those auction hits, where if a an auction event is type NEW it is added to the current collection fo...
eboni
1

votes
2

answer
5.9k

Views

Hive tables not found in Spark SQL - spark.sql.AnalysisException in Cloudera VM

I am trying to access Hive tables through a java program, but looks like my program is not seeing any table in the default database. I however can see the same tables and query them through spark-shell. I have copied hive-site.xml in spark conf directory. Only difference - the spark-shell is running...
Joydeep
1

votes
2

answer
474

Views

Split a row into two and dummy some columns

I need to split a row and create a new row by changing the date columns and make the amt columns to zero as in the below example: Input: +---+-----------------------+-----------------------+-----+ |KEY|START_DATE |END_DATE |Amt | +---+-----------------------+------------...
RaAm
1

votes
1

answer
1.4k

Views

Spark very large task warning

I get warnings like this a lot in SparkSQL. What is this warning telling me, and do I need to care about this? WARN TaskSetManager: Stage 1186 contains a task of very large size (4691 KB). The maximum recommended task size is 100 KB.
Carbon
1

votes
1

answer
445

Views

Union does not remove duplicate rows in spark data frame

I have two data frame like below +--------------------+--------+-----------+-------------+ |UniqueFundamentalSet|Taxonomy|FFAction|!||DataPartition| +--------------------+--------+-----------+-------------+ |192730241374 |1 |I|!| |Japan | |192730241374 |2 |I|!...
Atharv Thakur
1

votes
3

answer
138

Views

Extend org.apache.spark.sql.Row functionality : Spark Scala

In Spark Row trait /** Returns true if there are any NULL values in this row. */ def anyNull: Boolean = { val len = length var i = 0 while (i < len) { if (isNullAt(i)) { return true } i += 1 } false } Which can be used to evaluate any value is null in a row. Similarly I want to evaluate any value 1...
user3190018
1

votes
1

answer
52

Views

How to convert dataframe rows values to dynamic columns?

I have a dataFrame as below ----------------------------- | A | B | C | ----------------------------- | 1 | col_1 | val1 | | 1 | col_2 | val2 | | 1 | col_3 | val3 | | 1 | col_4 | val4 | ----------------------------- I need to convert...
Shyam
1

votes
3

answer
63

Views

Create new dataset using existing dataset by adding null column in-between two columns

I created a dataset in Spark using Java by reading a csv file. Following is my initial dataset: +---+----------+-----+---+ |_c0| _c1| _c2|_c3| +---+----------+-----+---+ | 1|9090999999|NANDU| 22| | 2|9999999999| SANU| 21| | 3|9999909090| MANU| 22| | 4|9090909090|VEENA| 23| +---+----------...
Nandu
1

votes
3

answer
142

Views

Apache Spark startsWith in SQL expression

In Apache Spark API I can use startsWith function in order to test the value of the column: myDataFrame.filter(col('columnName').startsWith('PREFIX')) Is it possible to do the same in Spark SQL expression and if so, could you please show an example?.
alexanoid
1

votes
2

answer
60

Views

A sum of typedLit columns evaluates to NULL

I am trying to create a sum column by taking the sum of the row values of a set of columns in a dataframe. So I followed the following method to do it. val temp_data = spark.createDataFrame(Seq( (1, 5), (2, 4), (3, 7), (4, 6) )).toDF('A', 'B') val cols = List(col('A'), col('B')) temp_data.withColum...
rasthiya
1

votes
2

answer
35

Views

scala aggregate first function giving unexpected results

I am using a simple groupby query in scala spark where the objective is to get the first value in the group in a sorted dataframe. Here is my spark dataframe +---------------+------------------------------------------+ |ID |some_flag |some_type | Timestamp | +---------------+---...
muazfaiz
1

votes
2

answer
211

Views

How to interpolate a column within a grouped object in PySpark?

How do you interpolate a PySpark dataframe within grouped data? For example: I have a PySpark dataframe with the following columns: +--------+-------------------+--------+ |webID |timestamp |counts | +--------+-------------------+--------+ |John |2018-02-01 03:00:00|60 | |John...
penguin
2

votes
0

answer
30

Views

Is there a way to cache on load?

Is there an option with sparksession.read() to cache on load? I'm reading xml files from s3, and it first scans the files to derive a schema. Since it's reading the files anyway, I would rather just load at that time so that it only reads all the files from s3 once. Is there any way to do this? I al...
user2661771
1

votes
2

answer
67

Views

What is Inner Like join?

In the Spark source code for join strategies, code comments mention for Broadcast hash join (BHJ): BHJ is not supported for full outer join. For right outer join, we only can broadcast the left side. For left outer, left semi, left anti and the internal join type ExistenceJoin, we only can broadcast...
Anurag Sharma
1

votes
2

answer
65

Views

append multiple columns to existing dataframe in spark

I need to append multiple columns to the existing spark dataframe where column names are given in List assuming values for new columns are constant, for example given input columns and dataframe are val columnsNames=List('col1','col2') val data = Seq(('one', 1), ('two', 2), ('three', 3), ('four', 4...
nat
0

votes
0

answer
11

Views

how to pass a SparkContext from pyspark file to a scala UDF?

I have a pyspark file and my main codes written in Python in this file. and also I have a Scala file that contains some functions written in Scala and use them as UDFs in pyspark code. Now I need to read a csv file as a Spark DataFrame in Scala functions. to do that I need to create a SparkSession o...
Ali AzG
0

votes
0

answer
3

Views

How to execute DB2 sql functions in SPARKSQL

We are aware that MYSQL and DB2 are relational databases. SQL used in MYSQL differs from SQL in DB2( by some additional functions). When running spark-sql job using DB2 sql I am facing a issue that function not found. But actually that function is available in DB2 but not in MYSQL. multiply_alt is t...
Goutham ssc
1

votes
1

answer
724

Views

Hierarchical data manipulation in Apache Spark

I am having a Dataset in Spark (v2.1.1) with 3 columns (as shown below) containing hierarchical data. My target objective is to assign incremental numbering to each row based on the parent-child hierarchy. Graphically it can be said that the hierarchical data is a collection of trees. As per be...
Sridher
1

votes
0

answer
204

Views

Scala reflection error while registering a Spark UDF

I'm using Spark UDFs all over my code, but there is one registration the fails intermittently with the following error: scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference involving package at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2768) at scala.reflect.in...
Hagai
1

votes
0

answer
41

Views

How to make Spark Worker read data from local mongodb with mongodb-spark-connector?

I have got two 'mongodb' on two computers. And there is also a 'Spark Worker' on each computer. But when I run 'spark', it doesn't read data from its local 'mongodb'. Instead, it reads from one of them. Therefore, only got partial data. There is a page. https://docs.mongodb.com/spark-connector/maste...
BobXWu
1

votes
0

answer
41

Views

Is there intermediate computation optimization when using functions.window [Spark]

I am using functions.window to create sliding window computation using Spark and Java. Example code: Column slidingWindow = functions.window(singleIPPerRow.col('timestamp'), '3 hours', '1 seconds'); Dataset aggregatedResultsForWindow = singleIPPerRow.groupBy(slidingWindow, singleIPPerRow.col('area')...
Anton.P
1

votes
1

answer
14

Views

Use regex_extract to retrieve the score number in a string text column

I need to extract the float number after score. {'reason_desc': { 'score':'0.1', 'numOfIndicatrix':'0', 'indicatrix':[]}, 'success':true, 'id':'1555039965661065S427A2DCF5787920' } I expect the output of 0.1 or any number enclosed by ''.
JYWQ
0

votes
0

answer
2

Views

Scala Spark Dataset Error on Nested Object

I am trying to test dataframe(dataset) code with strongly typed nested case classes into dataframe to then pass over my functions. The serialize/creation of the dataframe keeps failing and I do not have enough experience to know what is going on in scala or spark. I think that I am trying to determi...
vfrank66
1

votes
1

answer
22

Views

Merging records by 1+ common elements

I have a hive table with the following schema: int key1 # this is unique array key2_list Now I want to merge records here if their key2_lists have any common element(s). For example, if record A has (10, [1,3,5]) and B has (12, [1,2,4]), I want to merge it as ([10,12], [1,2,3,4,5]) or ([1,2,3,4,5])....
kee
1

votes
1

answer
283

Views

Spark UDF written in Java Lambda raises ClassCastException

Here's the exception: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to ... of type org.apache.spark.sql.api.java.UDF2 in instance of ... If I don't implement the UDF by Lambda expression, it's ok. Like: private UDF2 funUdf = new UDF2() { @Override public S...
secfree
1

votes
0

answer
483

Views

How to execute Hive queries on Hive 2.1.1 on Spark 2.2.0?

Simple queries, e.g. select, work fine, but when I use aggregate functions, e.g. count, I face errors. I use beeline to connect to Hive 2.1.1 with Spark 2.2.0 and Hadoop 2.8. hive-site.xml is as follows: hive.execution.engine spark Expects one of [mr, tez, spark]. Chooses execution engine. Options a...
chaithanyaa mallamla
1

votes
0

answer
303

Views

Spark coalesce on rdd resulting in less partitions than expected

We are running a spark batch job which performs following operations : Create dataframe by reading from hive table Convert dataframe to rdd Store the rdd into list Above steps are performed for 2 different tables and a variable ( called minNumberPartitions ) is set which holds the minimum number of...
Debanjan Dhar
1

votes
0

answer
80

Views

Spark Sql z-score calculation with given peer group in Java-8

How to calculate z-scores in Spark sql in java-8? I tried understanding other posts with Window functions, but in my case peer group will be 100.
Suresh Polisetti

View additional questions