Questions tagged [data-science]

1

votes
0

answer
27

Views

How to set a universal colorscale for plotly offline?

I would like to set a colorscale universally for plotly in offline mode, specifically jupyter notebook. This is so that, I don't need to input the parameter for every plot. I wasn't able to find anything in their documentation on colorscales, over here. EDIT: So, I found this function in cufflinks c...
Haran Rajkumar
1

votes
0

answer
45

Views

Using the “grep” function to Create Y axis label for boxplots in ggplot2

I created a loop to make 57 boxplots that uses grep to pick out the variables I want. However, for whatever reason the Y axis is always labeled the first variable it grabs despite the fact that it grabs all 57 unique variables and creates the loop. I was wondering if someone could take a look at it...
Dillon Lloyd
1

votes
2

answer
49

Views

Retrieving the top 5 sentences- Algorithm if any present

I am new to Data Science. This could be a dumb question, but just want to know opinions and confirm if I could enhance it well. I have a question getting the most common/frequent 5 sentences from the database. I know I could gather all the data (sentences) into a list and using the Counter library -...
1

votes
1

answer
367

Views

How to get Adjusted R Square for Linear Regression

Using sklearn.metrics I can compute R square.How I can compute Adjusted Adjusted R square using Linear Regression model?
Sourav Saha
1

votes
2

answer
42

Views

Linking 2 data frames and returning a value using lookup

I'm a beginner in coding and data in general so any help I can get would be really helpful. If I have a data frame as below,where every matchup is a tuple. df1 = Team A Player 1.1 Team A Player 2.1 Team A Player 3.1 ('Max', 'Hatteberg') ('Hatteberg', 'Tejada') ('Max...
Harvey Koh
1

votes
1

answer
183

Views

Failed to Install pygraphviz through Visual Studio 2017 15.7.1

Here's the error code I'm getting ----- Installing 'graphviz' ----- Collecting graphviz Downloading https://files.pythonhosted.org/packages/84/44/21a7fdd50841aaaef224b943f7d10df87e476e181bb926ccf859bcb53d48/graphviz-0.8.3-py2.py3-none-any.whl Installing collected packages: graphviz Successfully inst...
Prashant Dey
1

votes
0

answer
47

Views

Combining different time series in R

Let's assume that I am the owner of a burger shop. I log every time that a costumer buys something from my shop, so I have the registries of all burgers and milk-shakes sold on the previous month. For me, It is easier and cheaper to make 20 milk-shakes at once than making 1 at time. So here is my go...
Gabriel Bessa
1

votes
1

answer
1.2k

Views

Pandas how to place an array in a single dataframe cell?

So I currently have a dataframe that looks like: And I want to add a completely new column called 'Predictors' with only one cell that contains an array. So [0, 'Predictors'] should contain an array and everything below that cell in the same column should be empty. Here's my attempt, I tried to cr...
amadzebra
1

votes
2

answer
44

Views

How do I replace all >x values in column using Pandas?

I'm trying to replace all the higher values than my limit in Pandas column like this: df_train['IsAlone'].replace([>0],1) but this obviously is not working I got my code working like this: for i in range(len(df_train)): if df_train.iat[i,8] > 0: df_train.iat[i,8] = 1 but I'm wondering if there is a...
AlphaX
1

votes
1

answer
67

Views

Fastest way to mask rows of a 2D numpy array given a boolean vector of same length?

I have a numpy boolean vector of shape 1 x N, and an 2d array with shape 160 x N. What is a fast way of subsetting the columns of the 2d array such that for each index of the boolean vector that has True in it, the column is kept, and for each index of the boolean vector that has False in it, the co...
Psyche
1

votes
1

answer
62

Views

Error in 10-fold cross validation code in Python

I was implementing 10-fold cross validation from scratch in Python. The language is Python 3.6 and I wrote this in Spyder (Anaconda). My input shape is data=(1440,390),label=(1440,1). My code: def partitions(X,y): np.random.shuffle(X) foldx=[] foldy=[] j=0 for i in range(0,10): foldx[i]=X[j:j+143,:]...
Anjali Bhavan
1

votes
0

answer
120

Views

r2pmml Error in .convert(tempfile, file, converter, converter_classpath, verbose) : 1

I want to convert r neural network model into pmml, using r2pmml package based on jpmml: mynn
Zaja
1

votes
0

answer
44

Views
1

votes
1

answer
84

Views

PCA prediction and errors using sklearn

I want to predict some values with PCA in Python with sklearn. I begin by taking in the relevant columns from the data and name them X for features and Y for features that need predicting. Y = DF['Predict'].values X = pd.DataFrame(data=scale(DF[X_cols]), columns=X_cols) pca = PCA(n_components=NCOMPS...
Daniel B
1

votes
1

answer
56

Views

Correct Way to Parallelize Pandas on Data Frame Slices

let's assume I have a data frame with N multi valued categorical columns and I want to encode them as fast as possible using Pandas. This is what i achieved so far, not sure if it is the best way to parallelize Pandas though (I would prefer a vectorized approach where possible) : def encode_single_c...
1

votes
1

answer
104

Views

Iterate over complex numbers

I need to iterate over the complex refractive index = n + ik I made two floats.Span() filled with evenly spaced numbers, containing every n and k that I need to iterate over. How do I 'mix' these two values now so I can make a for loop over every possible combination? I need something like: 0.1+0.1i...
Aramus
1

votes
0

answer
40

Views

Cleaning a pandas dataframe formed from 3500 different text documents - scraped articles

I have 3500+ scraped text files, which are scraped from same or different sources, the text files are such that they are articles and they also contain ads in scraped text, I want to remove these from starting and last, where they are as the content i\lies in between only. I loaded all the text fil...
Ayman Alawin
1

votes
0

answer
81

Views

Match name within date range between to pandas dataframes

I'm trying to solve this problem to merge two datasets based on player name and if it's within a date range. is it possible to do this with fuzzy matching? Tournaments can last for up to two weeks. I've been able to partial match some names with fuzzymatch but matching the names closest to the date...
tomoc4
1

votes
2

answer
60

Views

Predict values in Multidimensional Scaling (MDS) in R

I’m trying to use multidimensional scaling (MDS) in R. Can I predict new values on test set based on the values that I receive from my training set? I’m looking for something similar to what I’ve done in PCA for example: prin_comp
Ittai
1

votes
0

answer
54

Views

Convert medical data from XML to CSV or JSON format?

This is a medical record of a patient with bunch of information about medical history, past, family etc. This is an XML file. I would like to get this is a json or CSV format. Some records are reported in the same format with these topics and some are named a little differently in other records lik...
Faliha Zikra
1

votes
1

answer
52

Views

Gretl doesn't forecast for test data

I have train data file in Gretl and then I append test data file in which the SalePrice data is missing so I want to predict the SalePrice for these rows but annoyingly if I add log for one of the variables (exists in both files without missing values, only a few 0s) then forecast doesnt predict any...
Georgi
1

votes
2

answer
46

Views

dataframe overridden within for-loop in r

I have dataset containing million observations from dataset i'm taking 10000 observations. Here is link to dataset file: dataset file link itemRatingData = itemRatingData[1:10000,] #V2 is user ID, V1 is item ID, V3 is item rating from use library(plyr) countUser = count(itemRatingData, vars = 'V2')...
Saad ur Rehman
1

votes
0

answer
178

Views

KMeans Clustering on Movielens Dataset

I am working on the Movielens dataset and I wanted to apply K-Means algorithm on it. I would like to know what columns to choose for this purpose and How can I proceed further Or Should I directly use the KNN algorithm.
sandeep
1

votes
3

answer
202

Views

Creating Dummy Variables from String Column

I have a pandas dataframe (N = 1485) that looks like this: ID Intervention 1 Blood Draw, Flushed, Locked 1 Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed 1 Blood Draw, Flushed 2 Blood return Verified, Flushed 2 Cap Changed 3...
G. Nguyen
1

votes
0

answer
44

Views

Issues in merging of training data from different samples

I have two files of training data, each of which have been split individually as train and test split data. how do i extract training samples with the given files I have four files file1: some features file2: some other features with a few columns common to file 1 Detailing with an example file 1 c...
Jaghadish
1

votes
1

answer
70

Views

python pandas loop append dataframe

I am attempting to create a loop that will analyze time series data and average the data 'per day' in a seperate pandas dataframe. For now if I make up some fake time series data to get a working program: import pandas as pd import numpy as np time = pd.date_range('6/28/2013', periods=2000, freq='5m...
HenryHub
1

votes
0

answer
37

Views

Convert JSON to Dataframe in Python 3 having an ID

I have to import a file having an ID and JSON field into a DATAFRAME Could import only the json into a dataframe using import pandas as pd from pandas.io.json import json_normalize import json with open('file.json', encoding='utf-8') as data_file: cob_data = json.loads(data_file.read()) json_df = p...
Puttur Kamath
1

votes
1

answer
30

Views

Generate pair wise column combination from np.array

I have an np.array of size 500 x 15. How can I generate a new np array with all possible pair-wise combination of 2 columns from this array? arr = [ [col1],[col2],[col3],..., [col14]] I want output such that it generates combination as [[col1],[col2]] [[col1],[col3]] . . [[col13],[col14]] I can't f...
GoldenPlatinum
1

votes
0

answer
33

Views

About of original raw data and intermediate data has been transformed

I want to use the Cookiecutter Data science project structure, to my project. I found http://drivendata.github.io/cookiecutter-data-science/ and it looks great. I am analyzing the directory differences on their structure and I have some question related to the different data stages. In the README.m...
bgarcial
1

votes
0

answer
109

Views

How to generate samples using Inverse Transform Sampling with a given customized PDF in Python?

The PDF I have is: $$ P(\varepsilon) = \frac{1}{4k} \mathrm{sech}^2 \frac{\varepsilon}{2k}$$ CDF: $$ CDF(\varepsilon) = \frac{\tanh(\frac{\varepsilon}{2k})}{2} $$ and Inverse CDF: $$ CDF^{-1}(c) = \frac{c}{2\tanh^{-1}(2k_BT)} $$ Now what should I do to generate N samples, with my PDF P(\varepsilon)...
g_tenga
1

votes
0

answer
47

Views

How can I reduce the “zeroes” effect on a high dimensional sparse matrix?

I'm a newbie at python and data science and I'm trying to run a multilabel classification. However, I have over 2.000.000 observations and 230 categories to predict. The main problem here is that my sparse matrix will result in a lot of 'zeroes', so the accuracy will be monstrously high (classifying...
Victor Dualibi
1

votes
1

answer
117

Views

Can we apply feature scaling to “independent variable” in a dataset?

I have a dataset with 8 dependent variables (2 categorical data). I have applied ExtraTreeClassifier() to eliminate some of dependent variables. I had also feature scale the X,y . from sklearn.preprocessing import StandardScaler sc = StandardScaler() X = sc.fit_transform(X) X = sc.transform(X) y = s...
failure_14
1

votes
1

answer
52

Views

Optimize the following section of python code for decaying variables

I'm doing the following operation on a sorted dataset 'df_pre_decay' containing time-series dataset for multiple IDs and I want to decay my 'tactic' variables for each ID at different rates (coming from tactic_decay_dict). The created variable for decayed tactic variable 'xyz' will have same value...
Tushar Khandelwal
1

votes
1

answer
15

Views

Most efficient datatype for iteratively adding to?

I have a web scraper which iteratively retrieves data from web pages, and I would like to add the attributes pulled to a pandas dataframe (eventually) for running simple statistics and analysis. The current script returns a dictionary every time a new page is scraped. I understand adding a new row o...
Caleb Pearson
1

votes
0

answer
68

Views

Combining columns without loosing information of their interaction in pandas?

This is a classification task. This is the format of the data set The first row contains the labels for the patients. I initially wanted to transpose the table to have patient IDs as index but I am not sure how to get columns region, position and gene into one column. If I just combined the columns...
Faliha Zikra
1

votes
1

answer
467

Views

How to read this ROC curve and set custom thresholds?

Using this code : from sklearn import metrics import numpy as np import matplotlib.pyplot as plt y_true = [1,0,0] y_predict = [.6,.1,.1] fpr, tpr, thresholds = metrics.roc_curve(y_true, y_predict , pos_label=1) print(fpr) print(tpr) print(thresholds) # Print ROC curve plt.plot(fpr,tpr) plt.show() y...
blue-sky
1

votes
1

answer
210

Views

Model testing on AWS sagemaker “could not convert string to float”

The XGboost model was trained on AWS sagemaker and deployed successfully but I keep getting the following error: ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (415) from model with message 'could not convert string to float: '. Any thoug...
DSexplorer
1

votes
2

answer
29

Views

How can I count repetitions of cpf in determinate day?

I have the folowing dataframe: cpf day startdate enddate 1234 1 08/01/2018 12:50:0 08/01/2018 15:50:0 1234 1 08/01/2018 14:30:0 08/01/2018 15:50:0 1234 1 08/01/2018 14:50:0 08/01/2018 15:50:0 1234 2 08/02/2018 20:20:0 08/02/2018 23:50:0 1234 2 08/02/2018...
Mariana
1

votes
0

answer
33

Views

Model Building on Remote Database

Our client has some data that is in their database and I have to build a statistical model on this. But the client is worried about data security. Is there anyway I can make sure that the data is not stored at all on my PC. The client basically does not want to the data to be taken out of their envi...
pankaj negi
1

votes
1

answer
81

Views

using toarray() with onehotencoding during data preprocessing

I am new to machine learning. I have one doubt: why use toarray() with onehotencoding while not with label encoding here. I am not getting any idea. pls someone help. from sklearn.preprocessing import LabelEncoder, OneHotEncoder label_encoder_x = LabelEncoder() x[:, 0] = label_encoder_x.fit_transfor...
aroN

View additional questions