# Questions tagged [scikit-learn]

6640 questions

1

votes

1

answer

888

Views

### Feature scaling using python StandardScaler produces negative values

I am a newbie in Machine learning. I am trying to use feature scaling on my input training and test data using the python StandardScaler class. However, when I see the scaled values some of them are negative values even though the input values do not have negative values. Is this normal or am I miss...

1

votes

3

answer

498

Views

### Get features from sklearn feature union

I have a feature union which uses some custom transformers to select text and parts of a dataframe. I would like to understand which features it's using.
The pipeline selects and transforms columns and then selects k best. I'm able to pull out the features from k best using the following code:
mask...

0

votes

1

answer

12

Views

### How to print sklearn confusion_matrix output from within a function?

confusion_matrix works properly from the command line on my notebook, but I can't seem to make it print its output when found inside a function. It is applied on the same arrays in both cases.
Am I missing something?

1

votes

2

answer

290

Views

### How to install auto-sklearn on GoogleColab?

I'd like to use auto-sklearn.I used the code from this document.All packages are installed. But I got an error like this.
!curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
!pip install auto-sklearn
import autosklearn.classification
---...

1

votes

2

answer

204

Views

### “ ImportError: DLL load failed: The specified procedure could not be found”- while Digit Recognition using CNN in Python using Keras

I am trying to write a simple character recolonization code using convolutional neural network in python on windows. I am following this tutorial. But somehow I am having following error message. I could not find the appropriate reason of this error. It would be helpful for me if anyone can breakdow...

1

votes

1

answer

39

Views

### K-means on 3D matrix

I am currently learning k-means and wanted to try it on 3D matrix, this is the link through which I am passing 2D matrix.
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmea...

1

votes

1

answer

43

Views

### sklearn.model_selection module not found

I am trying linear regression from a data but when I am trying to
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
it gives me this error
line 4, in
from sklearn.model_selection import train_test_split
KeyError: 32
sklearn, numpy and scipy are...

1

votes

2

answer

667

Views

### it-idf with TfidfVectorizer on Japanese text

I am working with a huge collection of documents written in several languages. I want to compute cosine distance between documents from their tf-idf scores. So far I have:
from sklearn.feature_extraction.text import TfidfVectorizer
# The documents are located in the same folder as the script
text_fi...

1

votes

2

answer

2.3k

Views

### How to print the order of important features in Random Forest regression using python?

I am trying out to create a Random Forest regression model on one of my datasets. I need to find the order of importance of each variable along with their names as well. I have tried few things but can't achieve what I want. Below is the sample code I tried on Boston Housing dataset:
from sklearn.en...

0

votes

0

answer

11

Views

### Iterating through Pandas groups to create DataFrames

I have a table containing production data on parts and the variables that were recorded during their production.
FORMAT:
Part | Variable1 | Variable 2 etc
_____________________________
1-----------X---------------X
1-----------X---------------X
2-----------X---------------X
2-----------X-----------...

1

votes

1

answer

4k

Views

### Calculating accuracy scores of predicted continuous values

from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
I believe this code will return the accuracy of our predictions. However, I am comparing predicted and actual values of continuous values and I believe that most of them are not going to be exactly same.
Should I fit the test...

1

votes

2

answer

3.9k

Views

### How to use inverse_transform in MinMaxScaler for a column in a matrix

I scaled a matrix based on its columns, like this:
scaler = MinMaxScaler(feature_range=(-1, 1))
data = np.array([[-1, 2], [-0.5, 6], [0, 10], [1, 18]])
scaler = scaler.fit(data)
data_scaled = scaler.transform(data)
the data_scaled gave me the following:
array([[-1. , -1. ],
[-0.5, -0.5],
[ 0. , 0...

1

votes

1

answer

481

Views

### Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc s...

1

votes

1

answer

309

Views

### What could be the reason for “TypeError: 'StratifiedShuffleSplit' object is not iterable”?

I have to deliver a Machine Learning project, and I received a file called tester.py. After finish my code in another file, I have to run tester.py to see the results, but I am getting a error: TypeError: 'StratifiedShuffleSplit' object is not iterable
I have researched this error in another topics...

1

votes

1

answer

126

Views

### Machine learning algorithm score changes without any change in data or step

I am new to Machine learning and getting started with Titanic problem on Kaggle. I have written a simple algorithm to predict the result on test data.
My question/confusion is, every time, I execute the algorithm with the same dataset and the same steps, the score value changes (last statement in th...

1

votes

2

answer

50

Views

### Merge distance matrix results and original indices with Python Pandas

I have a panda df with list of bus stops and their geolocations:
stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125
stop_id isn't nece...

1

votes

2

answer

41

Views

### Scikit learn order of coefficients for multiple linear regression and polynomial features

I'm fitting a simple polynomial regression model, and I want get the coefficients from the fitted model.
Given the prep code:
import pandas as pd
from itertools import product
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline...

1

votes

2

answer

46

Views

### How to perform SMOTE with cross validation in sklearn in python

I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.
Therefore, I would like to know the correct...

1

votes

2

answer

27

Views

### One-class-only folds tested through GridSearchCV

When using GridSearchCV on a custom estimator that is a wrapper on SVC, I get the error:
'ValueError: The number of classes has to be greater than one; got 1 class'
The custom estimator is made to add gridsearch parameters to the estimator and seemed to work fine.
Using the debugger, I found that i...

1

votes

0

answer

13

Views

### Scikit-learn transformer pipeline produces different results than running individually

When I tried using the pipeline to combine a couple transformers, the second transformer (log) appears not be applied.
I have tried to simplify the log transformer to perform simple addition but the same problem persists.
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
f...

1

votes

0

answer

406

Views

### features selection for large dataset in python

I have a Document-term matrix of dimension 3144469 x 268496 for which i need to do feature selection.I tried it doing with feature selection of Sckit-learn using code
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40)
documenttermmatrix_train= fs.fit_transform(documentter...

1

votes

1

answer

118

Views

### MiniBatchSparsePCA on Text Data

Goal
I'm trying to replicate an application described in this paper (section 4.1), where Sparse Principal Component Analysis is applied to a text corpus with the output being K principal components, each displaying a 'structure that is otherwise hidden'. In other words, the principal components shou...

1

votes

0

answer

532

Views

### Stock price prediction, choosing amount of time in the future using scikit learn

I'm trying to use machine learning to predict stock prices. I'm having issues choosing how long out to predict, I want to be able to predict out 100-200 days in the future. It seems like my code is cutting off the last 200 days and adding it's prediction there, instead of adding a additional 200 day...

1

votes

0

answer

176

Views

### What is the expected return type for tokenizer that is passed as parameter into Tfidfvectorizer

I am looking at:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
It just says:
tokenizer : callable or None (default) Override the string
tokenization step while preserving the preprocessing and n-grams
generation steps. Only applies if analyzer...

1

votes

0

answer

88

Views

### Is there a way to install a specific submodule of a package?

I'm working on a serverless project using sklearn.neural_network.MLPClassifier using AWS Lambda.
AWS requires that all dependencies get uploaded with the project during deploy, is there a way to install only the files needed to use a specific classifier so I can save some bandwidth?

1

votes

1

answer

263

Views

### Zero-padding a compressed sparse matrix (for NLP)?

I am using a recurrent neural network to classify text sentiment. I used TfidfVectorizer to convert the text into counts.
My code is as follows:
vectorizer = TfidfVectorizer(max_features = 5000)
vectorizer.fit(X_train)
Xtrain = vectorizer.fit_transform(X_train)
Xtest = vectorizer.fit_transform(X...

1

votes

0

answer

514

Views

### Is passing sklearn tfidf matrix to train MultinomialNB model proper?

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:
X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train,...

1

votes

0

answer

475

Views

### How to access hyperparameters in case of nested cross-validation using scikit-learn

The code for hyperparameter tuning using scikit-learn looks like this:
gs = GridSearchCV(estimator=pipe_svc,
param_grid=param_grid,
scoring='accuracy',
cv=10,
n_jobs=-1)
gs = gs.fit(X_train, y_train)
clf = gs.best_estimator_
clf.fit(X_train, y_train)
where for each combination of hyperparameters K-f...

1

votes

2

answer

361

Views

### How to make polynomial features using sparse matrix in Scikit-learn

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)
But it throws an error
TypeError: A sparse matrix was passed,...

1

votes

0

answer

329

Views

### Inferior performance of Tensorflow compared to sklearn

I'm comparing the performance of Tensorflow with sklearn on two datasets:
A toy dataset in sklearn
MNIST dataset
Here is my code (Python):
from __future__ import print_function
# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('/tmp/dat...

1

votes

1

answer

71

Views

### How to create an easy to explain regression model in Python with categorical features?

I have a dataset that looks like this, where each row is a user.
gender age_group c1 c2 c3 total_cost
F 0-10 10 F1234 3456 135.2
F 65-100 10 G5143 876 523.6
M 18-35 15 F3457 876 98.5
F 0-10 10 F1234 545 1052.1
M 35-65 2...

1

votes

1

answer

674

Views

### Grouping data by sklearn.model_selection.GroupShuffleSplit

I have a dataset in a CSV with header as
PRODUCT_ID CATEGORY_NAME PRODUCT_TYPE DISPLAY_COLOR_NAME IMAGE_ID
with same product having multiple rows each with different image_id. I made Image Id as index col when reading CSV into pandas data frame.
I want to create test and...

1

votes

0

answer

359

Views

### Area of a cluster with DBSCAN

I've an array of X and Y coordinates of some points spread in a field. I want to know if these points form clusters.
This coordinate array is a numpy array, where the first column is the X and the second column the Y. Here is an example:
>>> ch1_data[0:5]
[[ 11743. 17707.7]
[ 15850.9 16474. ]
[...

1

votes

0

answer

83

Views

### Why MultinomialNB outputs 0.5 when there is only one feature?

A naive Bayesian classification problem
Code
import numpy as np
X = np.array([[1], [2], [3], [4], [5], [6]], dtype=int)
Y = np.array([1, 1, 1, 0, 0, 0], dtype=int)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.partial_fit(X, Y, [0, 1])
print clf.predict_proba(X)
Output
[[ 0...

1

votes

0

answer

154

Views

### Role of coefficient in Multinomial NB

My MultinomialNB classifier was instantiated and trained on vectorized fake/real news articles, and now I'm trying to understand the meaning behind the coefficients.
nb_classifier = MultinomialNB()
# Extracting the class labels: ('Fake' or 'Real')
class_labels = nb_classifier.classes_
# Extract the...

1

votes

0

answer

211

Views

### How to compute custom loss for tensorflow using numpy/sklearn based on the predictions

I'me facing an issue to combine numpy with tensorflow. For instance, I want to create a custom loss function and use it for the training.
let loss function be
loss = tf.reduce_mean(C * tf.nn.softmax_cross_entropy_with_logits(logits=self.y_logits, labels=self.y10))
where C is some value computed bas...

1

votes

0

answer

256

Views

### How to insert Training and Testing data into Tensorflow's high API WITHOUT manually splitting the CSV myself

Everyone, I found it super inefficient to literally create and prepare two CSV's - training and testing data - to then feed them separately into Tensorflow's high API (tf.estimator).
Is there a way to make this process more efficient? I want to do something like sklearn's model_selection module whe...

1

votes

2

answer

416

Views

### Overriding tokenizer of scikitlearn vectorizer with spacy

I want to implement lemmatization with Spacy package.
Here is my code :
regexp = re.compile( '(?u)\\b\\w\\w+\\b' )
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string))
def custom_tokenizer(document):
doc_...

1

votes

1

answer

280

Views

### sklearn PCA random_state parameter function

I am using PCA to visualize clusters and noticed Sklearn added 'random_state' parameter to the PCA method (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), my question is what does the random_state parameter do there?
My understanding is that PCA should return the sa...

1

votes

0

answer

374

Views

### Machine learning - PCA and KNN on rgb images is too slow (python)

I work with python and images of tables (taken from above). My aim is to take a photo of a random table and then find the most similar tables to it in my database. Obviously, the main feature which distinguishes the tables are their shape (square, rectangular, round, oval) but there are also other d...