Questions tagged [scikit-learn]

1

votes
0

answer
406

Views

features selection for large dataset in python

I have a Document-term matrix of dimension 3144469 x 268496 for which i need to do feature selection.I tried it doing with feature selection of Sckit-learn using code fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40) documenttermmatrix_train= fs.fit_transform(documentter...
Ranjana Girish
1

votes
1

answer
118

Views

MiniBatchSparsePCA on Text Data

Goal I'm trying to replicate an application described in this paper (section 4.1), where Sparse Principal Component Analysis is applied to a text corpus with the output being K principal components, each displaying a 'structure that is otherwise hidden'. In other words, the principal components shou...
SeánMcK
1

votes
0

answer
532

Views

Stock price prediction, choosing amount of time in the future using scikit learn

I'm trying to use machine learning to predict stock prices. I'm having issues choosing how long out to predict, I want to be able to predict out 100-200 days in the future. It seems like my code is cutting off the last 200 days and adding it's prediction there, instead of adding a additional 200 day...
phillyphil
1

votes
0

answer
176

Views

What is the expected return type for tokenizer that is passed as parameter into Tfidfvectorizer

I am looking at: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html It just says: tokenizer : callable or None (default) Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer...
Kaizer Sozay
1

votes
0

answer
88

Views

Is there a way to install a specific submodule of a package?

I'm working on a serverless project using sklearn.neural_network.MLPClassifier using AWS Lambda. AWS requires that all dependencies get uploaded with the project during deploy, is there a way to install only the files needed to use a specific classifier so I can save some bandwidth?
Ramon Balthazar
1

votes
1

answer
263

Views

Zero-padding a compressed sparse matrix (for NLP)?

I am using a recurrent neural network to classify text sentiment. I used TfidfVectorizer to convert the text into counts. My code is as follows: vectorizer = TfidfVectorizer(max_features = 5000) vectorizer.fit(X_train) Xtrain = vectorizer.fit_transform(X_train) Xtest = vectorizer.fit_transform(X...
anticavity123
1

votes
0

answer
514

Views

Is passing sklearn tfidf matrix to train MultinomialNB model proper?

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both: X = df_new['text_content'] y = df_new['label'] X_train, X_test, y_train,...
ZEE
1

votes
0

answer
475

Views

How to access hyperparameters in case of nested cross-validation using scikit-learn

The code for hyperparameter tuning using scikit-learn looks like this: gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=-1) gs = gs.fit(X_train, y_train) clf = gs.best_estimator_ clf.fit(X_train, y_train) where for each combination of hyperparameters K-f...
Royalblue
1

votes
2

answer
361

Views

How to make polynomial features using sparse matrix in Scikit-learn

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model. model = Pipeline([('poly', PolynomialFeatures(degree=3)), ('linear', LinearRegression(fit_intercept=False))]) model.fit(X, y) But it throws an error TypeError: A sparse matrix was passed,...
Niyamat Ullah
1

votes
0

answer
329

Views

Inferior performance of Tensorflow compared to sklearn

I'm comparing the performance of Tensorflow with sklearn on two datasets: A toy dataset in sklearn MNIST dataset Here is my code (Python): from __future__ import print_function # Import MNIST data from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets('/tmp/dat...
lenhhoxung
1

votes
1

answer
71

Views

How to create an easy to explain regression model in Python with categorical features?

I have a dataset that looks like this, where each row is a user. gender age_group c1 c2 c3 total_cost F 0-10 10 F1234 3456 135.2 F 65-100 10 G5143 876 523.6 M 18-35 15 F3457 876 98.5 F 0-10 10 F1234 545 1052.1 M 35-65 2...
sfactor
1

votes
1

answer
674

Views

Grouping data by sklearn.model_selection.GroupShuffleSplit

I have a dataset in a CSV with header as PRODUCT_ID CATEGORY_NAME PRODUCT_TYPE DISPLAY_COLOR_NAME IMAGE_ID with same product having multiple rows each with different image_id. I made Image Id as index col when reading CSV into pandas data frame. I want to create test and...
Aravind Chamakura
1

votes
0

answer
359

Views

Area of a cluster with DBSCAN

I've an array of X and Y coordinates of some points spread in a field. I want to know if these points form clusters. This coordinate array is a numpy array, where the first column is the X and the second column the Y. Here is an example: >>> ch1_data[0:5] [[ 11743. 17707.7] [ 15850.9 16474. ] [...
Rg111
1

votes
0

answer
83

Views

Why MultinomialNB outputs 0.5 when there is only one feature?

A naive Bayesian classification problem Code import numpy as np X = np.array([[1], [2], [3], [4], [5], [6]], dtype=int) Y = np.array([1, 1, 1, 0, 0, 0], dtype=int) from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.partial_fit(X, Y, [0, 1]) print clf.predict_proba(X) Output [[ 0...
hijkzzz
1

votes
0

answer
154

Views

Role of coefficient in Multinomial NB

My MultinomialNB classifier was instantiated and trained on vectorized fake/real news articles, and now I'm trying to understand the meaning behind the coefficients. nb_classifier = MultinomialNB() # Extracting the class labels: ('Fake' or 'Real') class_labels = nb_classifier.classes_ # Extract the...
Adam Schroeder
1

votes
0

answer
211

Views

How to compute custom loss for tensorflow using numpy/sklearn based on the predictions

I'me facing an issue to combine numpy with tensorflow. For instance, I want to create a custom loss function and use it for the training. let loss function be loss = tf.reduce_mean(C * tf.nn.softmax_cross_entropy_with_logits(logits=self.y_logits, labels=self.y10)) where C is some value computed bas...
Ehab AlBadawy
1

votes
0

answer
256

Views

How to insert Training and Testing data into Tensorflow's high API WITHOUT manually splitting the CSV myself

Everyone, I found it super inefficient to literally create and prepare two CSV's - training and testing data - to then feed them separately into Tensorflow's high API (tf.estimator). Is there a way to make this process more efficient? I want to do something like sklearn's model_selection module whe...
Haebichan Jung
1

votes
2

answer
416

Views

Overriding tokenizer of scikitlearn vectorizer with spacy

I want to implement lemmatization with Spacy package. Here is my code : regexp = re.compile( '(?u)\\b\\w\\w+\\b' ) en_nlp = spacy.load('en') old_tokenizer = en_nlp.tokenizer en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string)) def custom_tokenizer(document): doc_...
Antenna_
1

votes
1

answer
280

Views

sklearn PCA random_state parameter function

I am using PCA to visualize clusters and noticed Sklearn added 'random_state' parameter to the PCA method (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), my question is what does the random_state parameter do there? My understanding is that PCA should return the sa...
Max
1

votes
0

answer
374

Views

Machine learning - PCA and KNN on rgb images is too slow (python)

I work with python and images of tables (taken from above). My aim is to take a photo of a random table and then find the most similar tables to it in my database. Obviously, the main feature which distinguishes the tables are their shape (square, rectangular, round, oval) but there are also other d...
Poete Maudit
0

votes
0

answer
17

Views

Cannot import sklearn module in Python

When using the following line of code: from sklearn import datasets I get the error ImportError: No module named sklearn I have tried installing sklearn with: pip3 install -U sklearn and pip3 install scikit-learn But that seems to have no effect. The packages are installed in sklearn /Library/Framew...
Kokodoko
1

votes
0

answer
110

Views

The comparison of manifold learning example of scikit learn is not working with my dataset

I changed the X and color variables of the exaple below with my dataset that has the exact shape as the original version. I get the error below: C:\ProgramData\Anaconda3\lib\site-packages\scipy\linalg\decomp_lu.py:71: RuntimeWarning: Diagonal number 1 is exactly zero. Singular matrix. RuntimeWarning...
zied hosni
1

votes
0

answer
43

Views

feature hashing on image features (ORB) - scikit

I am trying to train an SVM using ORB features for an image. Each ORB point has 32 integer values showing the intensity for the keypoint. The number of keypoints is variable, so I am looking into using feature hashing. img = cv2.imread(img_name,cv2.IMREAD_GRAYSCALE) if img is None: continue gray = c...
Santino
1

votes
0

answer
132

Views

Sklearn MultinomialNB gives 1 probability for some class for few examples?

I used MultinomialNB from sklearn for some text data. Data contains 12 class. And its classification task. After applying MultinomialNB with CounterVectorizer i checked few example's predicted class probability.And for some reasons one class shows 1.0 probability. [[ 3.91049692e-23 , 2.50074669e-...
Poojan
1

votes
0

answer
20

Views

Sklearn import issue with cron

I have a python script that use sklearn package. It runs fine when I launch it from terminal But when I try to run this script in a cron job, it faills ! It seems that import faills from sklearn.neural_network import MLPRegressor from sklearn.externals import joblib How can I solve this issue ?
Cezembre
1

votes
0

answer
265

Views

KerasClassifier cannot be used as estimator in VotingClassifier

I am using scikit-learn and Keras for Machine Learning in python. I started to use the KerasClassifier wrapper for wrapping a sequential model of Keras to make it compatible with scikit-learn classifier related functions and API. Most of the functionality is easy to use (I could use the methods fit,...
Quan
1

votes
2

answer
44

Views

Python: how to train the same model different times?

I have a small dataset and I want to try to predict the value of same variables using the Multi-layer Perceptron regressor from sklearn. This is what I am doing: from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neural_network import M...
emax
1

votes
0

answer
123

Views

Can Sklearn Imputer distinguish real zero values from zeros in place for missing values?

I am using scipy sparse matrices, which fill missing values in with zeros. I want to impute these missing values with the mean of the respective columns, and planned to do this with the Imputer in Sklearn. My question, does the Imputer distinguish between real 0 values in the data from missing value...
Randy
1

votes
1

answer
114

Views

Compare cross_val_score before and after log transformation

I am playing around with the houseprices dataset from kaggle (link) and xgboost. To improve my model, I want to evluate whether it makes sense to perform a log transformation on the target variable (sale prices of houses). I am measuring the performance of my model with neg_mean_absolute_error in cr...
busy_c
1

votes
0

answer
269

Views

Getting names and number of selected features before giving to a classifier in sklearn pipeline

I am using sel = SelectFromModel(ExtraTreesClassifier(10), threshold='mean') to select the most important features in my data set. Then I want to feed these selected features to my keras classifier. But my keras based Neural Network classifier needs the number of imprtant features selected in the fi...
Abdul Karim Khan
1

votes
0

answer
168

Views

sklearn.linear_model causing ImportError when opening .exe created with PyInstaller

this is my first post to stackoverflow, though I've been utilizing this site for a good long while now. Excited to be in the community. (I'm fairly new to python, and especially to PyInstaller) I'm developing a data exploration GUI based on tkinter that relies heavily on pandas and numpy and also ut...
Tariq
1

votes
1

answer
622

Views

to find class distribution for breast cancer data set - python

As a part of the assignment of the applied machine learning course in python ( assignment1 question 2 ) I have to find the class distribution of the breast cancer data set ( sklearn.dataset) . The code I used is give below. the function answer_one converts the data set into a data frame of 569x30 (...
solly bennet
1

votes
0

answer
273

Views

Dask client.map returns KeyError on dask dataframe

I'm trying to create an updated example of random forest classification using python dask, as originally described here. When I attempt to pass a training set to the Client.map function, it's throwing a KeyError and I'm not sure what I'm doing wrong based on the error message. Here's what I have: fr...
shellcat_zero
1

votes
0

answer
39

Views

How to deploy scikit-learn classifier model into ANN written in another language

I have a scikit-learn classifier: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=(13, 13, 13), learning_rate='constant', learning_rate_init=0.001, max_iter=500, momentum=0.9, nesterovs_momentum=True,...
Robert Boyer
1

votes
0

answer
408

Views

Scikit unknown label type

I've been trying to learn a bit of machine learning by importing some data from an SQL database. However, I encountered the following problem: raise ValueError('Unknown label type: %s' % repr(ys)) ValueError: Unknown label type: (array([-1634.4, -1534.2, -1497.8, ..., 1670.6, 1733.6, 1835. ]),)...
MathiasRa
1

votes
0

answer
73

Views

Unstable behavior of OneClassSVM by changing 'nu'

In the example above, I'm using my dataset to identify outliers. After making slight changes to the nu parameter, there is a huge difference in the number of anomalies identified. Could this be just a particularity of the dataset? Or a bug in scikit-learn? P.S. Unfortunately I cannot share the datas...
Stergios
1

votes
0

answer
162

Views

RMSE error doesn't converge towards the same value depending on the train/test ratio

I am trying to find a reliable testing method to compute the error of my model / training parameters, but I am seeing weird results when I play with the train/test ratio. When I change the ratio of my train/test data, the RMSE converges towards different values, see below: You can see the test ratio...
ben
1

votes
0

answer
444

Views

How to estimate the error in Leave One Out Cross Validation

Is there a proper way to perform LOOCV and estimate the error as the mean of all the train data. I already perform a K-FOLD CROSS-VALIDATION as gives: scores = cross_val_score(clf, X, y, cv = 10, scoring = 'accuracy') print('mean score and the 95% confidence interval of the score estimate') print('A...
Steve Jade
1

votes
0

answer
399

Views

Does scikit-cuda support the newest version of pycuda (9.1) or do I have to revert to 7.5?

I have installed pycuda and scikit cuda via pip install on python 3.6 and have been trying to run the following example from scikit cuda: from __future__ import print_function import pycuda.autoinit import pycuda.driver as drv import pycuda.gpuarray as gpuarray import numpy as np import skcuda.linal...
Ian Campbell Moore
1

votes
1

answer
142

Views

How to implement 'And' function in perceptron in scikit-learn

I am a newbie to machine learning and scikit-learn. I was trying to implement 'and' function in scikit-learn and written a small code as below: import pandas as pd from pandas import Series,DataFrame import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.model_selection imp...
sjrk

View additional questions