# Questions tagged [scikit-learn]

6620 questions

1

votes

0

answer

406

Views

### features selection for large dataset in python

I have a Document-term matrix of dimension 3144469 x 268496 for which i need to do feature selection.I tried it doing with feature selection of Sckit-learn using code
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40)
documenttermmatrix_train= fs.fit_transform(documentter...

1

votes

1

answer

118

Views

### MiniBatchSparsePCA on Text Data

Goal
I'm trying to replicate an application described in this paper (section 4.1), where Sparse Principal Component Analysis is applied to a text corpus with the output being K principal components, each displaying a 'structure that is otherwise hidden'. In other words, the principal components shou...

1

votes

0

answer

532

Views

### Stock price prediction, choosing amount of time in the future using scikit learn

I'm trying to use machine learning to predict stock prices. I'm having issues choosing how long out to predict, I want to be able to predict out 100-200 days in the future. It seems like my code is cutting off the last 200 days and adding it's prediction there, instead of adding a additional 200 day...

1

votes

0

answer

176

Views

### What is the expected return type for tokenizer that is passed as parameter into Tfidfvectorizer

I am looking at:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
It just says:
tokenizer : callable or None (default) Override the string
tokenization step while preserving the preprocessing and n-grams
generation steps. Only applies if analyzer...

1

votes

0

answer

88

Views

### Is there a way to install a specific submodule of a package?

I'm working on a serverless project using sklearn.neural_network.MLPClassifier using AWS Lambda.
AWS requires that all dependencies get uploaded with the project during deploy, is there a way to install only the files needed to use a specific classifier so I can save some bandwidth?

1

votes

1

answer

263

Views

### Zero-padding a compressed sparse matrix (for NLP)?

I am using a recurrent neural network to classify text sentiment. I used TfidfVectorizer to convert the text into counts.
My code is as follows:
vectorizer = TfidfVectorizer(max_features = 5000)
vectorizer.fit(X_train)
Xtrain = vectorizer.fit_transform(X_train)
Xtest = vectorizer.fit_transform(X...

1

votes

0

answer

514

Views

### Is passing sklearn tfidf matrix to train MultinomialNB model proper?

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:
X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train,...

1

votes

0

answer

475

Views

### How to access hyperparameters in case of nested cross-validation using scikit-learn

The code for hyperparameter tuning using scikit-learn looks like this:
gs = GridSearchCV(estimator=pipe_svc,
param_grid=param_grid,
scoring='accuracy',
cv=10,
n_jobs=-1)
gs = gs.fit(X_train, y_train)
clf = gs.best_estimator_
clf.fit(X_train, y_train)
where for each combination of hyperparameters K-f...

1

votes

2

answer

361

Views

### How to make polynomial features using sparse matrix in Scikit-learn

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)
But it throws an error
TypeError: A sparse matrix was passed,...

1

votes

0

answer

329

Views

### Inferior performance of Tensorflow compared to sklearn

I'm comparing the performance of Tensorflow with sklearn on two datasets:
A toy dataset in sklearn
MNIST dataset
Here is my code (Python):
from __future__ import print_function
# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('/tmp/dat...

1

votes

1

answer

71

Views

### How to create an easy to explain regression model in Python with categorical features?

I have a dataset that looks like this, where each row is a user.
gender age_group c1 c2 c3 total_cost
F 0-10 10 F1234 3456 135.2
F 65-100 10 G5143 876 523.6
M 18-35 15 F3457 876 98.5
F 0-10 10 F1234 545 1052.1
M 35-65 2...

1

votes

1

answer

674

Views

### Grouping data by sklearn.model_selection.GroupShuffleSplit

I have a dataset in a CSV with header as
PRODUCT_ID CATEGORY_NAME PRODUCT_TYPE DISPLAY_COLOR_NAME IMAGE_ID
with same product having multiple rows each with different image_id. I made Image Id as index col when reading CSV into pandas data frame.
I want to create test and...

1

votes

0

answer

359

Views

### Area of a cluster with DBSCAN

I've an array of X and Y coordinates of some points spread in a field. I want to know if these points form clusters.
This coordinate array is a numpy array, where the first column is the X and the second column the Y. Here is an example:
>>> ch1_data[0:5]
[[ 11743. 17707.7]
[ 15850.9 16474. ]
[...

1

votes

0

answer

83

Views

### Why MultinomialNB outputs 0.5 when there is only one feature?

A naive Bayesian classification problem
Code
import numpy as np
X = np.array([[1], [2], [3], [4], [5], [6]], dtype=int)
Y = np.array([1, 1, 1, 0, 0, 0], dtype=int)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.partial_fit(X, Y, [0, 1])
print clf.predict_proba(X)
Output
[[ 0...

1

votes

0

answer

154

Views

### Role of coefficient in Multinomial NB

My MultinomialNB classifier was instantiated and trained on vectorized fake/real news articles, and now I'm trying to understand the meaning behind the coefficients.
nb_classifier = MultinomialNB()
# Extracting the class labels: ('Fake' or 'Real')
class_labels = nb_classifier.classes_
# Extract the...

1

votes

0

answer

211

Views

### How to compute custom loss for tensorflow using numpy/sklearn based on the predictions

I'me facing an issue to combine numpy with tensorflow. For instance, I want to create a custom loss function and use it for the training.
let loss function be
loss = tf.reduce_mean(C * tf.nn.softmax_cross_entropy_with_logits(logits=self.y_logits, labels=self.y10))
where C is some value computed bas...

1

votes

0

answer

256

Views

### How to insert Training and Testing data into Tensorflow's high API WITHOUT manually splitting the CSV myself

Everyone, I found it super inefficient to literally create and prepare two CSV's - training and testing data - to then feed them separately into Tensorflow's high API (tf.estimator).
Is there a way to make this process more efficient? I want to do something like sklearn's model_selection module whe...

1

votes

2

answer

416

Views

### Overriding tokenizer of scikitlearn vectorizer with spacy

I want to implement lemmatization with Spacy package.
Here is my code :
regexp = re.compile( '(?u)\\b\\w\\w+\\b' )
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string))
def custom_tokenizer(document):
doc_...

1

votes

1

answer

280

Views

### sklearn PCA random_state parameter function

I am using PCA to visualize clusters and noticed Sklearn added 'random_state' parameter to the PCA method (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), my question is what does the random_state parameter do there?
My understanding is that PCA should return the sa...

1

votes

0

answer

374

Views

### Machine learning - PCA and KNN on rgb images is too slow (python)

I work with python and images of tables (taken from above). My aim is to take a photo of a random table and then find the most similar tables to it in my database. Obviously, the main feature which distinguishes the tables are their shape (square, rectangular, round, oval) but there are also other d...

0

votes

0

answer

17

Views

### Cannot import sklearn module in Python

When using the following line of code:
from sklearn import datasets
I get the error
ImportError: No module named sklearn
I have tried installing sklearn with:
pip3 install -U sklearn
and
pip3 install scikit-learn
But that seems to have no effect. The packages are installed in
sklearn
/Library/Framew...

1

votes

0

answer

110

Views

### The comparison of manifold learning example of scikit learn is not working with my dataset

I changed the X and color variables of the exaple below with my dataset that has the exact shape as the original version.
I get the error below:
C:\ProgramData\Anaconda3\lib\site-packages\scipy\linalg\decomp_lu.py:71: RuntimeWarning: Diagonal number 1 is exactly zero. Singular matrix.
RuntimeWarning...

1

votes

0

answer

43

Views

### feature hashing on image features (ORB) - scikit

I am trying to train an SVM using ORB features for an image. Each ORB point has 32 integer values showing the intensity for the keypoint. The number of keypoints is variable, so I am looking into using feature hashing.
img = cv2.imread(img_name,cv2.IMREAD_GRAYSCALE)
if img is None: continue
gray = c...

1

votes

0

answer

132

Views

### Sklearn MultinomialNB gives 1 probability for some class for few examples?

I used MultinomialNB from sklearn for some text data.
Data contains 12 class.
And its classification task.
After applying MultinomialNB with CounterVectorizer i checked few example's predicted class probability.And for some reasons one class shows 1.0 probability.
[[ 3.91049692e-23 , 2.50074669e-...

1

votes

0

answer

20

Views

### Sklearn import issue with cron

I have a python script that use sklearn package. It runs fine when I launch it from terminal
But when I try to run this script in a cron job, it faills !
It seems that import faills
from sklearn.neural_network import MLPRegressor from
sklearn.externals import joblib
How can I solve this issue ?

1

votes

0

answer

265

Views

### KerasClassifier cannot be used as estimator in VotingClassifier

I am using scikit-learn and Keras for Machine Learning in python. I started to use the KerasClassifier wrapper for wrapping a sequential model of Keras to make it compatible with scikit-learn classifier related functions and API. Most of the functionality is easy to use (I could use the methods fit,...

1

votes

2

answer

44

Views

### Python: how to train the same model different times?

I have a small dataset and I want to try to predict the value of same variables using the Multi-layer Perceptron regressor from sklearn.
This is what I am doing:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import M...

1

votes

0

answer

123

Views

### Can Sklearn Imputer distinguish real zero values from zeros in place for missing values?

I am using scipy sparse matrices, which fill missing values in with zeros. I want to impute these missing values with the mean of the respective columns, and planned to do this with the Imputer in Sklearn. My question, does the Imputer distinguish between real 0 values in the data from missing value...

1

votes

1

answer

114

Views

### Compare cross_val_score before and after log transformation

I am playing around with the houseprices dataset from kaggle (link) and xgboost.
To improve my model, I want to evluate whether it makes sense to perform a log transformation on the target variable (sale prices of houses). I am measuring the performance of my model with neg_mean_absolute_error in cr...

1

votes

0

answer

269

Views

### Getting names and number of selected features before giving to a classifier in sklearn pipeline

I am using sel = SelectFromModel(ExtraTreesClassifier(10), threshold='mean') to select the most important features in my data set.
Then I want to feed these selected features to my keras classifier. But my keras based Neural Network classifier needs the number of imprtant features selected in the fi...

1

votes

0

answer

168

Views

### sklearn.linear_model causing ImportError when opening .exe created with PyInstaller

this is my first post to stackoverflow, though I've been utilizing this site for a good long while now. Excited to be in the community.
(I'm fairly new to python, and especially to PyInstaller) I'm developing a data exploration GUI based on tkinter that relies heavily on pandas and numpy and also ut...

1

votes

1

answer

622

Views

### to find class distribution for breast cancer data set - python

As a part of the assignment of the applied machine learning course in python ( assignment1 question 2 ) I have to find the class distribution of the breast cancer data set ( sklearn.dataset) . The code I used is give below. the function answer_one converts the data set into a data frame of 569x30 (...

1

votes

0

answer

273

Views

### Dask client.map returns KeyError on dask dataframe

I'm trying to create an updated example of random forest classification using python dask, as originally described here.
When I attempt to pass a training set to the Client.map function, it's throwing a KeyError and I'm not sure what I'm doing wrong based on the error message.
Here's what I have:
fr...

1

votes

0

answer

39

Views

### How to deploy scikit-learn classifier model into ANN written in another language

I have a scikit-learn classifier:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(13, 13, 13), learning_rate='constant',
learning_rate_init=0.001, max_iter=500, momentum=0.9,
nesterovs_momentum=True,...

1

votes

0

answer

408

Views

### Scikit unknown label type

I've been trying to learn a bit of machine learning by importing some data from an SQL database.
However, I encountered the following problem:
raise ValueError('Unknown label type: %s' % repr(ys)) ValueError:
Unknown label type: (array([-1634.4, -1534.2, -1497.8, ..., 1670.6,
1733.6, 1835. ]),)...

1

votes

0

answer

73

Views

### Unstable behavior of OneClassSVM by changing 'nu'

In the example above, I'm using my dataset to identify outliers. After making slight changes to the nu parameter, there is a huge difference in the number of anomalies identified.
Could this be just a particularity of the dataset? Or a bug in scikit-learn?
P.S. Unfortunately I cannot share the datas...

1

votes

0

answer

162

Views

### RMSE error doesn't converge towards the same value depending on the train/test ratio

I am trying to find a reliable testing method to compute the error of my model / training parameters, but I am seeing weird results when I play with the train/test ratio.
When I change the ratio of my train/test data, the RMSE converges towards different values, see below:
You can see the test ratio...

1

votes

0

answer

444

Views

### How to estimate the error in Leave One Out Cross Validation

Is there a proper way to perform LOOCV and estimate the error as the mean of all the train data.
I already perform a K-FOLD CROSS-VALIDATION as gives:
scores = cross_val_score(clf, X, y, cv = 10, scoring = 'accuracy')
print('mean score and the 95% confidence interval of the score estimate')
print('A...

1

votes

0

answer

399

Views

### Does scikit-cuda support the newest version of pycuda (9.1) or do I have to revert to 7.5?

I have installed pycuda and scikit cuda via pip install on python 3.6 and have been trying to run the following example from scikit cuda:
from __future__ import print_function
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import numpy as np
import skcuda.linal...

1

votes

1

answer

142

Views

### How to implement 'And' function in perceptron in scikit-learn

I am a newbie to machine learning and scikit-learn. I was trying to implement 'and' function in scikit-learn and written a small code as below:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection imp...