Questions tagged [dataframe]

25654 questions
1

votes
2

answer
88

Views

Create a new column in a dataframe if the column contains a string from a column of another dataframe

I want to create a new column in my dataframe if the column contains any of the values from a column of a second dataframe. First dataframe WXYnineZAB EFGsixHIJ QRSeightTUV GHItwoJKL YZAfiveBCD EFGsixHIJ MNOthreePQR ABConeDEF MNOthreePQR MNOthreePQR YZAfiveBCD WXYnineZAB GHItwoJKL KLMsevenNOP EFGsix...
Prasad
1

votes
0

answer
41

Views

How to make Spark Worker read data from local mongodb with mongodb-spark-connector?

I have got two 'mongodb' on two computers. And there is also a 'Spark Worker' on each computer. But when I run 'spark', it doesn't read data from its local 'mongodb'. Instead, it reads from one of them. Therefore, only got partial data. There is a page. https://docs.mongodb.com/spark-connector/maste...
BobXWu
1

votes
0

answer
41

Views

Is there intermediate computation optimization when using functions.window [Spark]

I am using functions.window to create sliding window computation using Spark and Java. Example code: Column slidingWindow = functions.window(singleIPPerRow.col('timestamp'), '3 hours', '1 seconds'); Dataset aggregatedResultsForWindow = singleIPPerRow.groupBy(slidingWindow, singleIPPerRow.col('area')...
Anton.P
1

votes
1

answer
249

Views

Writing flatten data.frame in .csv from a Shiny app

I have a problem with managing a dynamic data.frame created in my Shiny app. Problem is, since it is handled through the server.R, I can't find a way to access it, flatten it into a vector then write it after my other variables (which are all text/numeric inputs). Working app can be accessed here :...
Samuel Tremblay
1

votes
1

answer
116

Views

Is it necessary to convert a data to binary set to calculate similarity (jaccard index)?

I need to calculate jaccard similarity to dataframe below: df = data.frame( a=c('1', '1', '1', '1', '2', '2', '2', '3', '3', '4', '4', '4', '4'), b=c('100', '101', '111', '25841', '111', '101', '106', '101', '108', '100', '30256', '108', '112')) Is necessary i convert data to binary set? How do thi...
1

votes
0

answer
42

Views

Counter of previous matches

I have found a lot of material about counting a value in table, but my goal is little different and I havent found any source. This is my data ID_1 ID_2 Date RESULT 1 12 3 2011-12-21 0 2 12 13 2011-12-22 0 3 3...
Mirko Piccolo
1

votes
0

answer
80

Views

Spark Sql z-score calculation with given peer group in Java-8

How to calculate z-scores in Spark sql in java-8? I tried understanding other posts with Window functions, but in my case peer group will be 100.
Suresh Polisetti
1

votes
1

answer
39

Views

Pandas compare values for the same time every day

I have this data frame: date_time value 1/10/2016 0:00:00 28.4 1/10/2016 0:05:00 28.4 1/10/2016 0:10:00 28.4 1/11/2016 0:00:00 27.4 1/11/2016 0:05:00 27.4 1/11/2016 0:10:00 27.4 I want to calculate the difference between two rows in the same timestamp everyday, then add new ca...
Trần Danh Lưu
1

votes
1

answer
27

Views

Creating pandas dataframes from nested json file that has lista

a picture on how the data look like So, I have a json file with data, the file is really nested, I want to take only the words and create a new dataframe for each post id. Can anyone help with this?
penestia
1

votes
1

answer
216

Views

Unable to fillna a column in dataframe with values from a series

I am trying to fillna in a specific column of the dataframe with the mean of not-null values of the same type (based on the value from another column in the dataframe). Here is the code to reproduce my issue: import numpy as np import pandas as pd df = pd.DataFrame() #Create the DateFrame with a col...
Sachin Myneni
1

votes
1

answer
505

Views

Large data with pivot table using Pandas

I’m currently using Postgres database to store survey answers. My problem I’m facing is that I need to generate pivot table from Postgres database. When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table. However, my current database now has a...
Dat Nguyen
1

votes
1

answer
310

Views

Compare timestamps in subsequent records with pandas

I have a large data set of 30000 KB (saved as a 'pandas' dataFrame) of chat conversations between experts and users. Each row represents a message sent by either the expert or the user. I want to measure the time between the second message the user sent and the second response of the expert. (notice...
Sharonio
1

votes
0

answer
250

Views

How to convert DataFrame to javaRdd with distibuted copy?

I'm new to spark optimizaiton. I'm trying to read hive data into a dataFrame. Then I'm converting the dataFrame to javaRdd and running a map function on top of it. The problem I'm facing is, the transformation running on top of javaRdd is running with single task. Also the transformations running on...
Makubex
1

votes
1

answer
132

Views

Slice a Pandas Dataframe based on the results of a function on a column

I want to slice a dataframe using a condition based on a DateTime column's month, element by element: Met_Monthly_DF = Metsite_DF.iloc[Metsite_DF['DateTime'].month == Month] I get the error: builtins.AttributeError: 'Series' object has no attribute 'month' It works on an element by element basis if...
Georgina Peach
1

votes
0

answer
37

Views

compute dataframe operations on selected rows only

I have time series based data. time,val 2018-04-25,30 2018-04-26,10 2018-04-27,-30 2018-04-28,0 2018-04-29,60 I need 4 columns to be added for: 1. mean 2. average 3. (shifted)shifted val by 1 4. (diff)1 if val>0 else 0 In first go I calculate these as: df['mean'] = df[val].ewm...
ashwani
1

votes
1

answer
57

Views

Filtering by datetime confusion

I have a datetime object that is not the index and when I filter it by: df=df[(df['local_time']>=datetime.date(2015,2,18))] df=df.sort_values('local_time',ascending=[True]) why does df.head(1) show 2-17-2015 as the first date when im using: >=datetime.date(2015,2,18)
Danny W
1

votes
1

answer
431

Views

Get date months with iteration over Spark dataframe

I have a problem case to iterate for last 36 months based on an input date. Currently using Scala, through a DataFrame I am getting the max value of a timestamp field. For example: val vGetDate = hc.read.format('filodb.spark').option('database','YYYYY').option('dataset','XXX').load().agg(max('inv_da...
Rajdip
1

votes
0

answer
45

Views

Display NA in excel output from R

I have a dataframe which is numeric but I want the blanks to be NA and for this to be shown in the excel output. My file is CSV which reads in as numeric with 2 columns. My code being used is below. When I export the data now the cells are displaying #NUM!. Any advice would be helpful. #read in data...
user3018495
1

votes
1

answer
317

Views

Spark | For a synchronous request/response use case

Spark Newbie alert. I've been exploring the ideas to design a requirement which involves the following: Building a base predictive model for Linear Regression(One off activity) Pass the data points to get the value for the response variable. Do something with the result. At regular intervals update...
user1189332
1

votes
0

answer
48

Views

Assign NA to cells according to sum of column and row index for each cell

I am dealing with a data frame and should assign NA values in cells, if sum of the row and column index for each cell bigger than a certain value. df 10){ df[j, i]
plotly_user
1

votes
3

answer
221

Views

Python Pandas: How to remove all columns from dataframe that contains the values in a list?

include_cols_path = sys.argv[5] with open(include_cols_path) as f: include_cols = f.read().splitlines() include_cols is a list of strings df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True).toPandas() df1 is a dataframe of a large file. I would like to only retain the colum...
Cody
1

votes
1

answer
25

Views

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does...
Yogesh Patel
1

votes
1

answer
675

Views

PySpark - Calling a sub-setting function within a UDF

I have to find neighbors of a specific data point in a pyspark dataframe. a= spark.createDataFrame([('A', [0,1]), ('B', [5,9]), ('D', [13,5])],['Letter', 'distances']) I have created this function that will take in the dataframe (DB) and then check the closest data points to a fixed point (Q) using...
Bryce Ramgovind
1

votes
0

answer
621

Views

Error in `python3': free(): invalid pointer

I'm trying to read all the csv files in 2 directories using glob module: import os import pandas as pd import glob def get_list_of_group_df(filepath): all_group_df_list = [] groups_path = filepath for file in glob.glob(groups_path): name = os.path.basename(file) name = patient_name.partition('_raw')...
Bella
1

votes
0

answer
103

Views

Assigning a column to a SparseDataFrame

Consider - df = pd.DataFrame({'a':[1,2,3]}) df a 0 1 1 2 2 3 I'd like to do two things: Convert the dataframe to sparse with a default fill value of False Assign a column of all False values to this sparse dataframe Here's two seemingly similar approaches I've come up with. First method; assig...
cs95
1

votes
0

answer
45

Views

How to calculate connections of the node in Spark 2

I have the following DataFrame df: val df = Seq( (1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2), (4, 3, 1, 4, 4), (4, 5, 1, 4, 4) ).toDF('from', 'to', 'attr', 'type_from', 'type_to') +-----+-----+----+---------------+---------------+ |from |to |attr|type_from |type_to | +-----+-----...
Markus
1

votes
0

answer
747

Views

create a table in sqllite by using a dataframe

I'm new to sqllite3 and trying to understand how to create a table in sql environment by using my existing dataframe. I already have a database that I created as 'pythonsqlite.db' #import my csv to python import pandas as pd my_data = pd.read_csv('my_input_file.csv') ## connect to database import sq...
Cagdas Kanar
1

votes
3

answer
396

Views

Uploading csv to SQL table using R shiny

I've been scratching my head trying to figure this out. So I've connected to the database but when I press the action button nothing is happening to the table. The CSV is being converted to a data frame. UI library(shiny) library(RJDBC) library(dbtools) library(jsonlite) library(shinyjs) library(DBI...
TheAnalyst
1

votes
1

answer
289

Views

“DateParseError: Unknown datetime string format, unable to parse: …” using pandas

I have some problems with my code in Python, Here are the test codes: import pandas as pd dict={'Country':['USA','China','Canada'],'Capitol':['Washington DC','Beijing','Ottawa'],'2015-01':[10,20,30],'2015-02':[15,25,35],'2015-03':[20,30,40],'2015-04':[10,20,30],'2015-05':[40,50,60],'2015-06':[20,30,...
Serenity
1

votes
1

answer
1.3k

Views

How to relationalize a JSON to flat structure in AWS Glue

Trying to flatten input JSON data having two map/dictionary fields (custom_event1 and custom_event2), which may contain any key-value pair data. In order to create an output table from the data frame, will have to avoid the flattening of custom_events and store it as JSON string in the column. Follo...
Sumit Saurabh
1

votes
1

answer
162

Views

Button that returns dataframe

I have a Jupyter Notebook that contains a button, some dropdowns, and a dataframe. Essentially, selections are made in a dropdown, and after clicking the button, a dataframe containing values selected in the dropdowns is produced. My dropdowns work perfectly. However, my button does not return the...
Andrew Louis
1

votes
0

answer
44

Views

Subsetting a data.frame

Probably an easy one. I have a data.frame with three columns: cluster, groupand id. Each set.seed(1) df
dan
1

votes
1

answer
910

Views

Save spark dataframe schema to hdfs

For a given data frame (df) we get the schema by df.schema, which is a StructType array. Can I save just this schema onto hdfs, while running from spark-shell? Also, what would be the best format in which the schema should be saved?
Ashwin
1

votes
0

answer
184

Views

Spark : Why some executors are having 0 active tasks and some 13 tasks?

I am trying to read from s3 and do a count on the data frame. I have a cluster of 76 r3.4xlarge(1 master and 75 slaves). I set : spark.dynamicAllocation.enabled 'true' maximizeResourceAllocation 'true' When I checked the Spark UI, I am just seeing : Just 25 executors - in that only 7 have active ta...
user3407267
1

votes
1

answer
150

Views

how to split dict column from pandas data frame

Splitting dictionary/list inside a Pandas Column into Separate Columns> The above link providing some solution to my answer But i have same problem with little different in input. here my DF: df = pd.DataFrame({'a':[1,2,3], 'b':[[{'c':1},{'c':3}], {'d':3}, {'c':5, 'd':6}]}) My dict again contains...
Rakesh Bhagam
1

votes
1

answer
24

Views

get df indeces as dictionary

I am working with multi index pandas dataframes. My dataframe: Value index1 index2 index3 1 5 3 20 It cames from some filtering and I will always have 1 row. I would like to get this dict: {index_name1: index_value1, index_name2: index_value2, ...} {'index1':1,'index2':5,'i...
user9187374
1

votes
0

answer
560

Views

Spark job dataframe write to Oracle using jdbc failing

When writing spark dataframe to Oracle database (Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit), the spark job is failing with the exception java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection. The scala code is dataFrame.write.mode(...
Abhishek Joshi
1

votes
1

answer
350

Views

spark-excel dataype issues

I am using spark-excel package for processing ms excel files using spark 2.2. Some of the files are getting failed to load as a spark dataframe with below exception. If someone have faced this issue can you please help to fix such data type issues? After analyzing I found at that if column name is n...
nilesh1212
1

votes
1

answer
56

Views

Select ALL rows where Pandas DataFrame Column values in a List

I am working with this problems for hours and i just couldn't get the right result. I have a DataFrame that looks like this: df: Col1 Col2 Col3 Col4 aaa bbb ccc 1 aaa bbb ccc 2 aaa bbb ccc 3 aaa bbb ccc 4 aaa bbb ccc 5 And a List = [1,3,5] What I want to do is to select all rows in df...
Maki
1

votes
0

answer
189

Views

Data manipulation in PySpark [duplicate]

This question already has an answer here: How to melt Spark DataFrame? 3 answers Unpivot in spark-sql/pyspark 1 answer I have a dataframe A, whose content is as below: site id date1 date2 A 4/14/2001 1/1/1997 B 3/04/2000 4/8/1999 I want to pivot down data and store i...
Varun Chadha

View additional questions