# Questions tagged [data-analysis]

652 questions

1

votes

1

answer

40

Views

### Pandas, groupby and counting data in others columns

I have data with four columns, that includes: Id, CreationDate, Score and ViewCount.
The CreationDate has a next format, for example: 2011-11-30 19:41:14.960.
I need to groupby the years of CreationDate, count them, summing Score and ViewCount also, and to add to additional columns.
I want to use wi...

1

votes

1

answer

302

Views

### R neural Networks

I am playing around with Adult Dataset
https://archive.ics.uci.edu/ml/datasets/adult and R. I am trying to use the neuralnet package to train a Neural Network with Back propagation. I have cleaned the data. Now I am trying to run this part :
n

1

votes

0

answer

41

Views

### How to re-scaling signal intensity in image in relation to their spatial position?

Hi I have a 1D radial profile of a sample across a pipe (fig_1). One data point (along the orange straight line) is acquired at each 'band' from the image. The resolution (x,y,z) of each data point is 100um x 100um x 1000um.
(fig_1)
However in order to produce a quantitative image, each data point i...

1

votes

1

answer

31

Views

### split pandas single column(List of dict) and append as new keys of dict as new columns

Input :
df = pd.DataFrame({'a':[1,2],
'b':[[{'x1':1,'x2':3},{'x1':4,'x2':1}],
[{'x1':5},{'x1':3,'x2':6}]],
'c':[5,6]})
If I apply the operation
print(df['b'].apply(pd.Series))
Output is:
0 1
0 {'x1': 1, 'x2': 3} {'x1': 4, 'x2': 1}
1 {'x1': 5} {'x1': 3, 'x2': 6}
Expe...

-1

votes

0

answer

4

Views

### How to group data on the basis of year in R?

I am working with the London Crime data set. It has a Borough, Major Category, Minor Category, Dates, and the Count. Below is my data structure.
'data.frame': 139392 obs. of 5 variables:
$ Borough : Factor w/ 32 levels 'Barking and Dagenham',..: 1 1 1 1 1 1 1 1 1 1 ...
$ Major.Category: Fac...

1

votes

1

answer

126

Views

### Issue using Tweepy to pull data from Twitter Stream: Data Analysis

from tweepy import OAuthHandler
from tweepy import StreamListener
class listener(StreamListener):
def on_data(self, data):
print(data)
return(True)
def on_error(self, status):
print (status)
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listen...

1

votes

0

answer

18

Views

### Making decision depending on complicated factors without starting data

Currently I'm building an automatically making decision system, which depends on many different factors. The problems I meet that is don't have any data to analyze and train.
My system has some factors, such as is on Holiday or not, is on Maintenance Status or not, Currently Connected Users (CCU),...

1

votes

0

answer

43

Views

### CSV spreadsheet analysis

I'm trying to complete the assignment (Quiz 21) described below for the following course:
https://classroom.udacity.com/courses/ud170/lessons/5430778793/concepts/53961386480923
The first code fragment is the one I wrote, which outputs the wrong lengths for the lists. The second code fragment is the...

1

votes

0

answer

84

Views

### Adding hover tool to datashader interactive image

I want to perform datashading on a plot created in bokeh. I encountered with this python notebook. But I want to know can I add hovertool to
resultant image after datashading. If yes then how can I add tools like hovertool,taptool to the Interactive Image of created by datashader?

1

votes

1

answer

47

Views

### Feature selection by machine learning

The aim of my current study is to explore machine learning methods to select outcomes highly associated with treatment, which will be considered an approach for dealing with multiple testing.
My question is: what kinds of machine learning feature selection methods that I can use to find the strong a...

1

votes

2

answer

73

Views

1

votes

0

answer

43

Views

### How to make Jupyter Notebooks Sharable to your colleagues

At my organisation we currently use a sql query tool on top of Redshift. This provider us with ability to save our sql queries and create a place where any one can search for a query name and look at it and its results. We can also give query links to each other.
Problem is since it is sql and comp...

1

votes

1

answer

112

Views

### Pandas: fix typos in keys within a dataframe

So, I have a large data frame with customer names. I used the phone number and email combined to create a unique ID key for each customer. But, sometimes there will be a typo in the email so it will create two keys for the same customer.
Like so:
Key | Order #
555261andymiller...

1

votes

0

answer

64

Views

### How to refresh shape data file in spotfire

I am a beginner in working with geospatial data. What I have done so far:
I created a map chart visualization in spotfire.
I created a shape file using QGIS.
I added the shapefile in the spotfire using Add Data Table -> File
I added a feature layer into map chart and used/applied the shapefile data....

1

votes

0

answer

55

Views

### Fixed it. What is the option_description used for in the build_dict function in the dataMeta package in R?

I have a dataset with some 100,000 tweets and their sentiment scores attached. The original dataset just has two columns one for the tweets and one for their sentiment scores.
I am trying to build a data dictionary for it using the dataMeta package. Here is the code that I have writtern so far:
#Dat...

1

votes

0

answer

35

Views

### Getting an error while writing dataframe into csv

I am trying to write dataframe into csv file using !cat but I'm getting some errors.
Code:
data.to_csv(r'C:\Users\Downloads\pydata\pydata-book-2nd-edition\examples\out.csv')
!cat C:\Users\Pruthvish\Downloads\pydata\pydata-book-2nd-edition\examples\out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4
1...

1

votes

1

answer

348

Views

### Trying to find the most efficient way to convert SQL Query to Pandas DataFrame that has large number of records

I am trying to query MS-SQL database view and convert the result to Pandas DataFrame.
Below are the two different ways I tried and in both cases it is taking ~439.98 seconds (~7 minutes) in order to query and convert to DataFrame that has 415076 records (This time is for converting it to the DataFra...

1

votes

0

answer

88

Views

### 'x' must be atomic for 'sort.list', using dbFD(). FD package

I am trying to run
dbFD(traits, as.matrix(abun))
but i receive this error:
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
my data looks similar, but larger to this:
t1 t2 t3 ... sp1 sp2 sp3 sp4....
sp1 0.2 10...

1

votes

1

answer

118

Views

### Use VBA to suppress Analysis Toolpak Histogram function messege

Question overview:
I am using Excel VBA histogram function from 'Analysis Toolpak' to generate approximately 25 histograms automatically. When Histogram graph is generated, it is placed on top of cells that have values in it, effectively hiding them (Which is OK with me). Therefore a following messa...

1

votes

0

answer

46

Views

### Window function for unique rows in SQL Server

I have a table like below
The main idea is to get the amount of each channel for each orderID.
If the channel is repeating for Id, it should take the amount only once and rest would be null.
The result should look like below
I want to do the same logic for country and source as well. If I do the piv...

1

votes

0

answer

65

Views

### Music21 and D3.js for music feature extraction and visualization?

I am looking for suggestions on what tools could be used for the following scenarios about music feature extraction and visualization (on my Mac):
identify and group notes in a score (from different voices/instruments) that sound concurrently (even if they are attacked in different time offsets, tho...

1

votes

1

answer

48

Views

### How using python to groupby and scaling values?

I would like to rescaled column 'w'.
I have averaged 'w'.
aveData_set = Data_Set.groupby(['buildingid', pd.Grouper(key='reporttime',freq='15T')])['w'].mean().reset_index()
aveData_set result:
Then I would like each 24H rescaling column 'w'.
ScaleData_set = aveData_set.groupby(['buildingid', pd.Group...

1

votes

0

answer

53

Views

### movielense popularity recommender code with R

I'm now studying R, and now doing project about movie recommend algorithm.
I used movielense 100k data with recommenderlab library, and use these tutorials.
https://mitxpro.mit.edu/asset-v1%[email protected][email protected]_CS1_Movies.pdf
https://cran.r-project.org/web/packages...

1

votes

2

answer

55

Views

### Turning textual answers into dichotomous variables

I've done research using the google forms and now I need to prepare that data for the further analysis. The point is I don't really know how to go about that.
I have variables (questionnaire questions), each of this question have four answers. In my data those answers are just strings, so let's say:...

1

votes

0

answer

47

Views

### Combining different time series in R

Let's assume that I am the owner of a burger shop. I log every time that a costumer buys something from my shop, so I have the registries of all burgers and milk-shakes sold on the previous month. For me, It is easier and cheaper to make 20 milk-shakes at once than making 1 at time. So here is my go...

1

votes

0

answer

41

Views

### How to plot in python using Legend as a checkbox?

I have been trying to plot a graph which has a dataframe having 3 columns . One is the 'Hour', Second is the 'amount' in Rupees and the third consist of 'machine codes'. I need to analyze the amount of transaction a machine does on an hourly basis. There are total 67 unique machine codes.
Kindy chec...

0

votes

1

answer

56

Views

### join two tables without losing relevant values

I have two tables representing a database for customer products and its competitors' products:
tmp_match - from_product_id and to_product_id representing matches between customer product and competitor product respectively.
tmp_price_history - shows the price of each product per date.
I am trying to...

1

votes

0

answer

174

Views

### Leaflet / Mapbox marine traffic density Map

I am currently making a marine traffic tool using Leaftlet and Mapbox. For that, I have a huge amount of AIS Data that I converted in GeoJSON file. The GeoJSON file is a list of 'LineString' defining each ship's trajectories like this :
{
'features': [
{
'geometry': {
'coordinates': [
[-4.013451666...

1

votes

1

answer

65

Views

### How do I group data into naturally occurring “Bins”

What approach should I use to sort the following into naturally occurring 'bins'.
double[] x = { 18, 18, 18, 18, 19, 20, 20, 20, 21, 22, 22, 23, 24,
26, 27, 32, 33, 49, 52, 56,900,1200, 1200, 1500, 2000, 2000,2200,2200 };
I've looked at various code for 'outliers', 'quintiles' and not sure about...

1

votes

0

answer

54

Views

### Creating Video Watch Time Retention Plot using Plotly and Python

I have a table like this:
videoId userId viewedMintues totalMinutes
1007975 275308 10 26
1009304 304392 6 6
1009343 463588 3 23
100941 462406 1 26
100941 463199 12 26
100941...

1

votes

2

answer

65

Views

### python, matrix column extraction and sum

Say I have a matrix A = [a_1,a_2,...,a_n]. Each column a_i belongs to a class. All classes are from 1 to K. All n column's labels are stored in one n-dim vector b.
Now for each class i, I need to sum all vectors in class i together and put the result vector as the i-th column of a new matrix. So the...

1

votes

1

answer

138

Views

### How to Change Node's Color Based on Node's Level in CART Plot (rpart.plot) [R]

I want to change node's color based on node's level in CART Plot / rpart.plot on R. The required plot is like this.
enter image description here
I have done until this step which I haven't yet :
1. Move the values of the target variable (Setosa, Versicolor, and Virginica) to the left-side of char...

1

votes

1

answer

141

Views

### Detect significant trend changes

I would like to detect the dates where a trend curve significantly changes using R. The red dots are the points in time where I see a significant changed, these should be detected. Small fluctuations should be ignored.
I have tried the breakpoints functions which finds the dates indicated by the dot...

1

votes

1

answer

31

Views

### Iterating through pandas column

I have a dataframe with following columns:
User_id PQ TGGS PAG Games_played
118399 8.536585 7.079646 10.204082 7.711443
212651 75.000000 73.684211 75.000000 46.534653
210314 60.000000 9.523810 33.333333 14.414414
columns are actually game codes. I want...

1

votes

1

answer

27

Views

### Concatenating CSV Files using Pandas is causing Duplication

I was writing a python method on Google Colab in order to go into a folder of 84 .csv's, concatenate them and output a new .csv
Here is the method
def concatenate(indirectory = '/content/gdrive/My Drive/Folder/Folder', outfile = '/content/gdrive/My Drive/--.csv'):
os.chdir(indirectory)
fileList = gl...

1

votes

0

answer

41

Views

### Strategy to conduct regression modelling within R

I am not sure if this is the right place to ask, but let me try anyway.
I would like to conduct a structured analysis of a data set using linear regressions. I thought it would be a good idea to start off with creating a data frame / table where each row comprises one of the models that I would lik...

1

votes

1

answer

29

Views

### Analyse tables with unknown structure and fault tolerance

I want to analyse tables with similar data, that are structured differently and where the headers also may be slightly diverse.
For collecting all the data from the tables summing them up I face several problems.
Step 1: I look for the header keywords. Searching for if 'cars==cars' is not possible,...

1

votes

0

answer

21

Views

### Is it Necessary to De-Mean my Data before Applying PCA, or does pca(X) do that Automatically?

I am aware that a first step in performing PCA for dimensionality reduction is de-meaning the data.
I have performed PCA after de-meaning manually with X=X-mean(X) and compared with plainly applying [COEFF,score,latent,~,explained]=pca(X) on my data.
By inspecting the eigenvalues and the percentage...

1

votes

1

answer

55

Views

### Expand data set row in R [duplicate]

This question already has an answer here:
Repeat rows of a data.frame
10 answers
I've got a table like this:
| Activation Month | Disabled Month | Month.Fee | Custr
| 21/4/2018 | N/A | 10 | A
| 21/3/2018 | 21/6/2018 | 20 | B
I want to transfor...

1

votes

1

answer

55

Views

### dataExplorer::create_report failed to compile

I am trying to produce a pdf report of a dataframe named 'mydata' using the DataExplorer package. Nevertheless I get the following Error: Failed to compile D:/Documents/R/R projects/ENDO/report.tex.
I have tried to see if any error occurs with tinytex using:
options(tinytex.verbose = TRUE)
devtools:...