Questions tagged [dataframe]

17542 questions
1

votes
1

answer
80

Views

Plotting dataframe in r

I have the following dataframe (1:10): Date Avg.Join.Delay Min.Join.Dely Max.Join.Dely ACCOUNT STB_TYPE MARKET 1 6/5/2015 199.20000 51 396 2063207586 IPH8010 Seattle 2 6/5/2015 77.68750 50 145 2063207586 IPW8000 Seattle 3 6/5/2015 8...
A.J
1

votes
1

answer
553

Views

R: data frame column with empty char strings turns to NA on read

Idea is to create and update catalog by rbind-ing data frames. This includes sequentially reading and writing files. Problem appears when for some data frames certain character string columns don't contain any values (blank chr strings ""). Somehow R treats those columns as NULL value and apparently...
statespace
0

votes
2

answer
22

Views

To summarize difference from present row to the previous

Calculating the difference from present row to the previous, I have a simple data set and codes below: import pandas as pd data = {'Month' : [1,2,3,4,5,6,7,8,9,10,11,12], 'Rainfall': [112,118,132,129,121,135,148,148,136,119,104,118]} df = pd.DataFrame(data) Rainfall = df["Rainfall"] df['Changes'] =...
Mark K
2

votes
2

answer
17

Views

how to not drop the rows with nan value Python

I use to drop the rows which has one cell with NAN value with this command: pos_data = df.iloc[:,[5,6,2]].dropna() No I want to know how can I keep the rows with NAN and remove all other rows which do not have NAN in one of their columns. my data is Pandas dataframe. Thanks.
CFD
0

votes
1

answer
18

Views

How to add dummies to Pandas DataFrame?

I have a data_df that looks like: price vehicleType yearOfRegistration gearbox powerPS model kilometer fuelType brand notRepairedDamage postalCode 0 18300 coupe 2011 manuell 190 NaN 125000 diesel audi ja 66954 1 9800...
B Seven
2

votes
1

answer
18

Views

How do I iterate over the rows in dataframe and swap two adjacent rows and perform some operations on the new dataframe created after the swap?

for i, rows in df.iterrows(): x, y = df.iloc[rows].copy(), df.iloc[rows+1].copy() df.iloc[rows], df.iloc[rows+1] = y, x break I get error on execution: positional indexers are out-of-bounds`
Abdul Raheem
0

votes
1

answer
24

Views

Re-index Pandas Dataframe - python

I have a data frame and I droped some part of it. now my new data frame has not all the rows if we consider data frame as a table. I want to change 1 2 3 11 . . . to 0 1 2 3 4 . . . Thanks.
CFD
0

votes
3

answer
20

Views

Is there an R function to filter a dataframe in which columns with all matching values are kept?

I have a large data frame with 5 rows but ~100k columns. I would like to keep columns in which all values within a column match. This is sample of the dataframe > df Mouse JAX00000010 JAX00000010r UNCHS000003 JAX00240606 JAX00240613 JAX00240636 UNCHS000005 1 407 BF BF B...
hacketju
0

votes
2

answer
19

Views

How to handle a KeyError when working with a list inside a list

I have written two functions in Python that I intend to re-use multiple times. Together they will allow me to calculate the total travel distance of a vehicle in a warehouse collecting goods from defined locations in a single aisle. One function get_orderpick extracts two lists from input data in a...
Cerberton
1

votes
1

answer
1.3k

Views

how to get input file name of a record in spark dataframe?

I am creating a dataframe in spark by loading tab separated files from s3. I need to get the input file name information of each record in the dataframe for further processing. I tried dataframe.select(inputFileName()) But I am getting null value for input_file_name. somebody please help me to solv...
ab_
1

votes
3

answer
2.6k

Views

Pyspark - how to do case insensitive dataframe joins?

Is there any nice looking code to perform case insensitive join in Pyspark? Something like: df3 = df1.join(df2, ["col1", "col2", "col3"], "left_outer", "case-insensitive") Or what is your working solutions to this?
Babu
1

votes
1

answer
696

Views

Spark not saving the dataframe as a paraquet file

Trying to save the spark dataframe as a paraquet file.But unable to achieve ,due to the Exception below.Kindly guide me,if I am missing something.The dataframe has been constructed from the kafka stream rdds. dataframe.write.paraquet("/user/space") Exception Stack: Exception in thread "streaming-jo...
Bindumalini KK
1

votes
1

answer
2.5k

Views

Filling an empty value in Scala Spark Dataframe [duplicate]

This question already has an answer here: Convert null values to empty array in Spark DataFrame 2 answers I am currently working with a dataframe right now in scala, and can't figure out how to fill a column with a Seq.empty[Row] value if the value in that row is null. I understand there is the df....
Daniel Dao
1

votes
1

answer
400

Views

PySpark: use one column to index another (udf of two columns?)

(Edited Feb 14th) Let's say I have a Spark (PySpark) dataframe with the following schema: root |-- myarray: array (nullable = true) | |-- element: string (containsNull = true) |-- myindices: array (nullable = true) | |-- element: integer (containsNull = true) It looks like: +------------------...
xenocyon
1

votes
1

answer
657

Views

Apply a function to a specific column for every row in a dataframe in R

I want to apply a user defined function to a specific column for every row in a dataframe in R and save the result back into the column.
stroz
1

votes
2

answer
236

Views

Compress pandas dataframe based on column name and last non-NaN value

I have a pandas dataframe that looks like the following: col1 col2 x_1 x_2 x_3 x_4 a b 0.3 0.2 NaN NaN c d 0.4 0.3 0.2 NaN e f 0.2 0.1 NaN NaN v x NaN 0.2 NaN NaN x r NaN NaN NaN NaN What I'd like to do is for each row find the right-most numeric value, and restructure...
mvarchar
1

votes
2

answer
453

Views

Merge data frames stored in list into single data frame in R [duplicate]

This question already has an answer here: Simultaneously merge multiple data.frames in a list 6 answers I start with list l1, which I want to convert to output. As such, the data frames in l1 need to be merged by their first column (A). df1
milan
1

votes
2

answer
499

Views

R - Find index of element from one dataframe and place in another

Folks, I have 2 data frames as follows. df1 Sorted in reverse order showing the number of times an activity has taken place. Activity  # of Occurrences Walking  38 Jogging  26 Running 12 df2 Shows the calories burned doing the activity, again sorted in reverse order of Calories Bur...
Anand
1

votes
1

answer
205

Views

How to create empty columns when selecting columns from a data frame by using NA indexes?

here is what I mean: indexes=c(1,NA,4) Now if I do the following I will obviously get an error because of the NA: mpg[,indexes] So for the NA index I need an empty column, in other words I want to see: manufacturer unknown year audi "" 1999 audi "" 1999 ... ... ....
Mohammad
1

votes
2

answer
1.1k

Views

Calculate average daily value from large data set with R standard format date/times?

I have a dataframe of approximately 10 million rows spanning about 570 days. After using striptime to convert the dates and times, the data looks like this: date X1 1 2004-01-01 07:43:00 1.2587 2 2004-01-01 07:47:52 1.2585 3 2004-01-01 17:46:14 1.2586 4 2004-01-01 17:56:08 1.2585 5 20...
Patty
1

votes
1

answer
12

Views

getting min max and avg viewcount for each tag

My data set is like this id viewcount title answercount tags first_tag 1 78 ** 2 ** python 2 87 ** 1 ** pandas 3 87 ** 1 ** pandas 4 83 ** 0 ** Excel Now i want to get min, max and avg of v...
Barot Shalin
1

votes
2

answer
2.9k

Views

Dynamically select multiple columns while joining different Dataframe in Scala Spark

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join. def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): Da...
Nish
1

votes
2

answer
25

Views

Matplotlib histogram does not show details of distribution

I have some data and I would like to look at its distribution. But I don't know why when I use this code, the histogram does not really show what is going on within the data and it shows a very general picture. I want to have a more granular histogram. data['feature'].plot(kind='hist') And here is w...
James Robisnon
1

votes
1

answer
234

Views

Spark - avoid mutable dataframe [duplicate]

This question already has an answer here: Spark/Scala repeated calls to withColumn() using the same function on multiple columns 2 answers Assume a dataframe df with column c0. I need to add n columns by performing an operation on c0 (for example, lets say I want to add a literal(2) to the c0 which...
Vigneshwaren
21

votes
1

answer
7.4k

Views

Defining a UDF that accepts an Array of objects in a Spark DataFrame?

When working with Spark's DataFrames, User Defined Functions (UDFs) are required for mapping data in columns. UDFs require that argument types are explicitly specified. In my case, I need to manipulate a column that is made up of arrays of objects, and I do not know what type to use. Here's an examp...
ohruunuruus
0

votes
0

answer
19

Views

How to get the Rank of current row compared to previous rows

How to get the Rank of current row compared to previous rows I have a dataframe like: Instru Price Volume ABCD 1000 100258 ABCD 1000 100252 ABCD 1000 100168 ABCD 1000 100390 ABCD 1000 100470 ABCD 1000 100420 I want to get the rank of current row compared to all previous rows for V...
Rohit
0

votes
0

answer
3

Views

Filter dataframe by month

My Dataframe is given below and I want to filter my whole dataframe by month 6. I view many youtube video but I don't get right code to filter the dataframe city = {'City' : pd.Series(['Ithaca', 'Willingboro', 'Holyoke', 'Abilene', 'New York']), 'Shape Reported': pd.Series(['Triangle', 'Other', 'Ova...
Arsh Singh
1

votes
2

answer
167

Views

Select rows from a Pandas DataFrame with same values in one column but different value in the other column

Say I have the pandas DataFrame below: A B C D 1 foo one 0 0 2 foo one 2 4 3 foo two 4 8 4 cat one 8 4 5 bar four 6 12 6 bar three 7 14 7 bar four 7 14 I would like to select all the rows that have equal values in A but differing values...
tinman248
1

votes
1

answer
49

Views

Implementing pairwise linear distance in Scala

Assume I have the following DataFrame in Scala Spark, where year year value is a String categorical representation, but there is an order in the data. +-----+ |years| +-----+ | 0-1| | 1-2| | 2-5| | 5-10| +-----+ I would like to create a resulting pairwise matrix, representing the "distance" for e...
Ivan
1

votes
1

answer
32

Views

Show number of rows and columns when all rows are displayed

The option display.max_rows is by default set to 60. This means when there's more than 60 rows present in the dataframe, upon using print(df) it will crop it to show only 60 and at the end will display the number of rows and column, such as: [61 rows x 22 columns] However, if there's 60 or less rows...
kiradotee
1

votes
2

answer
39

Views

how to insert column name in pandas dataframe? [duplicate]

This question already has an answer here: how do I insert a column at a specific column index in pandas? 2 answers I have the following dataframe 0 0 0.164560 1 0.000000 2 0.350000 3 0.700000 ... 3778 0.350000 3779 0.000000 3780 0.137500 3781 0.253333 I want to add an doc-id co...
Samuel Mideksa
1

votes
2

answer
30

Views

What is a good way to prevent changes from being applied to an original data frame?

I am attempting to pass a data frame through some commands (preparing a series of arguments for a function). However, when I assign a data frame to a different data frame, this assignment seems to work as equivalency. In other words, after the assignment of a data frame to a new one, all changes app...
arkadiy
1

votes
2

answer
37

Views

Fill with default 0's when creating a DataFrame in Pandas

I have an input dict-of-string-to-list with possibly different lengths for the list. d = {'b': [2,3], 'a': [1]} when I do: df = pd.DataFrame(data=d), i'm seeing ValueError: arrays must all be same length Question: How do i fill the missing values with default (e.g. 0) when creating the df? The reas...
Jeff Xiao
1

votes
1

answer
38

Views

Convert many string with separator to array in scala

I have a dataframe like this : userId someString varA varB 1 "example1" 0,2,5 1,2,9 2 "example2" 1,20,5 9,null,6 i want to convert the data into VarA and varB to an array of String userId someString varA varB 1 "example1" [0,2,5] [1,2,9] 2...
Radhwen KHADHRI
9

votes
5

answer
255

Views

Python 3 pandas.groupby.filter

I am trying to perform a groupby filter that is very similar to the example in this documentation: pandas groupby filter >>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ... 'foo', 'bar'], ... 'B' : [1, 2, 3, 4, 5, 6], ... 'C'...
FinProg
1

votes
1

answer
19

Views

Dropping a problematic column from a dask dataframe

I have a dask dataframe with one problematic column that (I believe) is the source of a particular error that is thrown every time I try to do anything with the dataframe (be it head, or to_csv, or even when I try to subset using a (different) column. The error is probably owing to a data type misma...
wrahool
0

votes
0

answer
4

Views

Reading Csv file written by Dataframewriter Pyspark

I was having dataframe which I wrote to a CSV by using below code: df.write.format("csv").save(base_path+"avg.csv") As i am running spark in client mode, above snippets created a folder name avg.csv and the folder contains some file with part-* .csv on my worker node or nested folder then file part-...
Ayush Mishra
2

votes
1

answer
20

Views

Random choice over specific values of a DF

I have a big df with 17520 rows and 1000 columns. The df has only two values [0,0.05]. I would like to go to each cell of the df with the value of 0.05 and change it for a random value. The random value can only be 0 or 0.05. I tried the following line of code: y = np.array([0,0.05]) df.replace(0.05...
Jonathan Budez
1

votes
1

answer
18

Views

How to Convert Datetime column header (e.g. 2007-03-01 00:00:00) into Date-Month-Year format i.e. 2007-03-01

How to Convert Datetime column header (e.g. 2007-03-01 00:00:00) into Date-Month-Year format i.e. 2007-03-01? I tried df=pd.DataFrame({'Company Name':['3M India Ltd.','A B B India Ltd.'],'2007-03-01 00:00:00':[1571.30,710.20],'2007-04-01 00:00:00':[710.20,818.13]}) df.columns=pd.to_datetime.date(df....
Rahul
-1

votes
0

answer
14

Views

How can the dtype of this dataframe be an object?

I have already tried dropping np.nan's and +- np.infinite's dtyp = df_copy.dtypes print('dtype description') print(dtyp.describe()) print('info') print(df_copy.info()) The output : dtype description count 77 unique 3 top int64 freq 52 dtype: object info Int64Index: 32926...
aditya shourya

View additional questions