# Questions tagged [data-science]

892 questions

0

votes

0

answer

3

Views

### After installing docker 2.0.0.3 and IBM DSX , getting error on Windows 10

Getting issue while installing IBM DSX on windows 10.getting some error after installation

1

votes

1

answer

40

Views

### Pandas, groupby and counting data in others columns

I have data with four columns, that includes: Id, CreationDate, Score and ViewCount.
The CreationDate has a next format, for example: 2011-11-30 19:41:14.960.
I need to groupby the years of CreationDate, count them, summing Score and ViewCount also, and to add to additional columns.
I want to use wi...

1

votes

1

answer

47

Views

### How to calculate the steepness of a trend in python

I am using the regression slope as follows to calculate the steepness (slope) of the trend.
Scenario 1:
For example, consider I am using sales figures (x-axis: 1, 4, 6, 8, 10, 15) for 6 days (y-axis).
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
X = [[1], [4], [6]...

1

votes

2

answer

77

Views

### Resample pandas dataframe and interpolate missing values for timeseries data

I need to resample timeseries data and interpolate missing values in 15 min intervals over the course of an hour. Each ID should have four rows of data per hour.
In:
ID Time Value
1 1/1/2019 12:17 3
1 1/1/2019 12:44 2
2 1/1/2019 12:02 5
2 1/1/2019 12:28 7
Out:...

1

votes

1

answer

61

Views

### Count of duplicates of a list in Pandas Dataframe by group

I have a Dataframe that currently looks like this:
image source label
bookshelf A [flora, jar, plant]
bookshelf B [indoor, shelf, wall]
bookshelf C [furniture, shelf, shelving]
cactus A...

1

votes

3

answer

45

Views

### Concepts to measure text “relevancy” to a subject?

I do side work writing/improving a research project web application for some political scientists. This application collects articles pertaining to the U.S. Supreme Court and runs analysis on them, and after nearly a year and half, we have a database of around 10,000 articles (and growing) to work w...

0

votes

0

answer

6

Views

### how to handle exceptions in pyspark, when data is unproper order?

actually i am creating small RDD from some unorderd data, like it doesn't have same number of columns in each row. so i am taking it as tuple type with maximum line index.
here what i am getting problem is when i am accessing tuple[4],tuple[9] like this some rows doesn't have 9 index and all, so in...

0

votes

1

answer

14

Views

### can not split large .txt file into train, test and validation parts for deep text corrector

I have a single large .txt file and I want to split it into train, test and validation set. below are the lines of code where I want to use those flies. I am not getting any intuition about how to do it.
python correct_text.py --train_path
/movie_dialog_train.txt \
--val_path /movie_dialog_val.txt...

-1

votes

0

answer

5

Views

### What type of machine learning or AI Model can I use for Factor Ranking

What type of machine learning or AI Model can I use for Factor Ranking?
I have some factors and am trying to rank them based on how they are able to predict in my model please what kind of machine learning or AI or Deep Learning Model work for this?

1

votes

2

answer

496

Views

### ValueError: Invalid endpoint: s3-api.xxxx.objectstorage.service.networklayer.com

I'm trying to access a csv file in my Watson Data Platform catalog. I used the code generation functionality from my DSX notebook: Insert to code > Insert StreamingBody object.
The generated code was:
import os
import types
import pandas as pd
import boto3
def __iter__(self): return 0
# @hidden_ce...

1

votes

2

answer

498

Views

### Embedding in Keras

Which algorithm is used for embedding in Keras built-in function?
Word2vec? Glove? Other?
https://keras.io/layers/embeddings/

1

votes

2

answer

42

Views

### Error:-too many values to unpack (expected 2), while trying to iterate over two columns in a Data Frame

for L,M in laundry1['latitude'],laundry1['longitude']:
print('latitude:-')
print(L)
print('longitude:-')
print(M)
i am trying to iterate over the two columns of a data-frame, assigning there value to L & M and printing there value but it shows error of 'too many values to unpack (expected 2) ' view...

1

votes

3

answer

39

Views

### Calculate the average of the rows for each group

I need to calculate the mean of a certain column in DataFrame, so that means for each row is calculated excluding the previous values of the row for which it's calculated in certain group. Lets assume we have this dataframe, this is the expected output
is there any way that like iterate each row b...

1

votes

1

answer

34

Views

### I want to create a crime a new column in my data frame that is the crime rate of each specific row

I have a crime data set, I already calculated the crimes committed in each location. Now I want to create a new column that is the crime rate for that specific row. I already calculated the crime rate now I want to match the specific crime rate to correct row matching the same latitude value
Here I...

0

votes

0

answer

8

Views

### Filling cell data with mean for each unique name

I have been using R for the past couple days and I have question that I am a little stumped on. I have a dataframe with bidder names and bids where some of the bids are empty. I am having trouble implementing a dynamic way to take the average bid for each unique bidder and apply that to the empty ce...

1

votes

1

answer

50

Views

### Comparing columns of a dataset with python

I have a huge dataset (2653, 17). I have noticed two columns to be somewhat related but not exact as I have inferred from the value_counts method. What I mean is most of the corresponding entry of I is M, or of C is NaN. Is there any way to confirm this or calculate how many entries are related this...

1

votes

0

answer

361

Views

### Accessing the columns of pivot table in Python Pandas

I'm using a python pandas pivot. How can I get access the columns of pivot on new data frame?
KM_pivot_first = pd.pivot_table(read_sql_KM, values=['IMPRESSIONS','ENGAGEMENTS'],index='PLACEMENT_ID',aggfunc=np.sum)
KM_data_summary = KM_pivot_first[['PLACEMENT_ID', 'IMPRESSIONS', 'ENGAGEMENTS']]
error:...

1

votes

1

answer

30

Views

### histChanging Class in R for Column Name

I have found many helpful pages on how to change a class in R but all have seemed to not work for my task.
Below is the code I'm using with output:
> mydata = read.table('Books_R_Data.csv', header=TRUE,stringsAsFactors=TRUE,sep=',')
> hist(mydata)
Error in hist.default(mydata) : 'x' must be numeric...

1

votes

1

answer

404

Views

### How to plot a subset of forecast in R?

I have a simple R script to create a forecast based on a file.
Data has been recorded since 2014 but I am having trouble trying to accomplish below two goals:
Plot only a subset of the forecast information (starting on 11/2017 onwards).
Include month and year in a specific format (i.e. Jun 17).
Here...

1

votes

2

answer

361

Views

### How to make polynomial features using sparse matrix in Scikit-learn

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)
But it throws an error
TypeError: A sparse matrix was passed,...

1

votes

1

answer

62

Views

### How to handle missing values in Python3?

A = ds.iloc[:,0:4].values
B = ds.iloc[:,-1].values
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp = impsqft.fit(A[:,3])
A[:,3] = imp.transform(A[:,3])
I want to replace 4th column with mean of that column for null values but it gives me below error:
array=[ 1. 2. nan 4. 1....

1

votes

0

answer

450

Views

### I am working on Sentimental analysis on twitter data got this error: Error in get_oauth_sig() : OAuth has not been registered for this session

> oauth_endpoint(authorize = 'https://api.twitter.com/oauth', access= 'https://api.twitter.com/oauth/access_tocken' )
download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem')
trying URL 'http://curl.haxx.se/ca/cacert.pem' Content type 'application/x-pem-file' le...

1

votes

0

answer

32

Views

### Extracting tabular data from PDF file.The pdf file has text , image as well as tabular data.

The pdf file has text as well as tabular data. If not then is there
any way by which I can understand whether the current page of pdf
contains tables or not
I am able to Extract data from the pdf page but can't confirm whether it is
tabular data or verbose(paragraphs) text.

0

votes

0

answer

16

Views

### how can I write a loop in python to get the difference between first and last date for one id

opptyId field oldValue newValue updateTime
0 Stage Qualify 2014-05-27T18:50:14
0 Forecast Best Case 2014-05-27T18:50:14
0 created 2014-05-27T18:50:14
0 Amount 795.53 2014-06-17T18:54:00
0 Stage Qualify Closed - Won 2014-07-09T20:11:05
0 Forecast...

1

votes

2

answer

448

Views

### Visualizing clusters using TSNE

I have a dataset which I need to cluster and display in a way wherein elements in the same cluster should appear closer together. The dataset is based out of a research study, and has around 16 rows(entries) and about 50 features. I do agree that its not an ideal dataset to begin with, but unfortuna...

1

votes

1

answer

61

Views

### Why am getting different answer while both are same?

When am trying to fetch latitude and longitude using geocode function present in ggmap library, am getting different result in both. But, when am checking the class of 'dd' variable in both the cases its list, but why am not getting same output in 2nd one as 1st output. Wondering why ?
for(i in 1:3)...

1

votes

1

answer

400

Views

### Finding the best LCA model in poLCA R package

I am applying LCA analysis with PoLCA R package, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: 'ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND'. So i cancelled the process at 35 latent class. I am analyzin...

1

votes

0

answer

119

Views

### Add extra layers to pre-trained model at input in tensorflow

I have facet model (ckpt and meta files) which takes input of size (batch_size,160,160,3) and gives output of size (batch_size,128).
My input is a k-dimensional vector and I have a pre-processing function(consists of some convolution and pooling layers) which takes my input and gives (batch_size,16...

1

votes

0

answer

159

Views

### What is a bottleneck in pandas.read_csv: CPU vs Storage

What is actually a bottleneck of reading csv file with pandas.read_csv()?
Is it CPU or Storage reading speed limitations?
How much speed increase can be obtained if use SSD instead of HDD?
To be more specific, let's consider the following configuration (the current cheapest server from Hetzner):
Int...

1

votes

0

answer

179

Views

### Decision Tree Categorical and Continuous Variable

I'm new to data science and currently trying to learn and understand decision tree algorithm. I have a question about how the algorithm works when we have some continuous variables in a classification problem and categorical variables in regression problems.
Usually algo works on the basis of gini i...

1

votes

1

answer

87

Views

### code for value.counts() in columns pretaining to a specific value in one column

I'm new to data science and trying to do some data wrangling with python 2.7 in iPython notebook. A tutorial I was following for my first project asked me to replace all NaN intputs with 0 or 1. But I'd like to consider another approach where I can 1st look at the count for the rows with non-numeric...

1

votes

2

answer

174

Views

### Code for imputing values in a specific column using the particular rows Index number or unique ID?

I'd like to input certain value in a particular column. my data looks something like:
LoanID Married ApplicantIncome CoapplicantIncome Credit_History
LP00135 NaN 33460 16000 1.0
LP00234 Yes 55000 70000 1.0
LP00432 No...

1

votes

1

answer

27

Views

### How to copy data from a column to another based on a condition in R?

I have below data frame as shown below.
Funct.Area Environment ServiceType Ticket.Nature SLA.Result..4P. IRIS.Priority Func_Environment
2 FUN DCF FUN SR OK Medium FUN-DCF
3 AME - FIN DCF FUN SR Defect...

1

votes

0

answer

21

Views

### DSX desktop install NOT working (on x86 laptop)

I have tried multiple times and DSX desktop install does not work
I am trying to install on a win7 laptop
I have selected Docker, Jupyter with spark (around 6.6GB) but it always ends up installation Docker and then hangs (as in the progress bar does not proceed further and is stuck at 25% for a LONG...

1

votes

1

answer

427

Views

### sklearn partial_fit() not showing accurate results as fit()

I am training 3 lists of data L1, L2, L3. First i train all one them with SGDClassifier fit() and later instance by instance with partial_fit(). I I test the data with L4, L5. [The data in lists is image data and L4, L5 images are same as L2].
The predictions with fit() is correct and it is what i a...

1

votes

0

answer

83

Views

### getting similar predictions on data while predicting using tensorflow

I am a beginner in machine learning and I am working on a simple project to predict the electricity consumption of a household using data available here.
The data consists of the global minute averaged active power of every minute for 4 years. The head of the data looked something like this.
Date...

1

votes

1

answer

152

Views

### Referring to parent attribute in pandas

This is my json
{
'fInstructions': [
{
'id': 155,
'type':'finstruction',
'ref': '/spm/finstruction/155',
'iLineItem':[
{
'id': 156,
'type':'ilineitem',
'ref': '/spm/ilineitem/156',
'creationDate': '2018-03-09',
'dueDate':'2018-02-01',
'effectiveDate':'2018-03-09',
'frequency':'01',
'coveredPeriodFro...

1

votes

2

answer

64

Views

### How to group by in Panda with multiple columns

Consider a Panda DataFrame as below
Fruit Rate Quantity
-------------------------
Apple 2 4
Apple 3 3
Apple 5 9
Mango 4 5
Mango 6 12
Banana 2 2
banana 1 2
Here the total quantity of fruits.
Mango: 5+12=17
Apple: 4+3+9= 16
Banana: 2+2=4
Wha...

1

votes

0

answer

47

Views

### Run all regressors against the data in scikit

I am working on creating a framework where I can call all regressors available in scikit-learn. Relating to this I have two questions-
How to get list of all regressors programmatically?
Objective is to run regressors against the dataset and acquire the metrics such as RMSE, R-Sq, Adjusted R-Sq, etc...

1

votes

0

answer

29

Views

### How to understand the equation used for proving the transitivity property for correlation?

I am trying to understand the transitivity proof for correlation. That is if X is highly correlated to Y and Y is highly correlated to Z, then is it necessary that X and Z are also highly correlated. I found the equation which is used for proving this statement and that is:
Corr(X,Y) = Corr(Y,Z)*Cor...