Questions tagged [scikit-learn]

1

votes
1

answer
888

Views

Feature scaling using python StandardScaler produces negative values

I am a newbie in Machine learning. I am trying to use feature scaling on my input training and test data using the python StandardScaler class. However, when I see the scaled values some of them are negative values even though the input values do not have negative values. Is this normal or am I miss...
Amit Rastogi
1

votes
3

answer
498

Views

Get features from sklearn feature union

I have a feature union which uses some custom transformers to select text and parts of a dataframe. I would like to understand which features it's using. The pipeline selects and transforms columns and then selects k best. I'm able to pull out the features from k best using the following code: mask...
avocet
0

votes
1

answer
12

Views

How to print sklearn confusion_matrix output from within a function?

confusion_matrix works properly from the command line on my notebook, but I can't seem to make it print its output when found inside a function. It is applied on the same arrays in both cases. Am I missing something?
Helen
1

votes
2

answer
290

Views

How to install auto-sklearn on GoogleColab?

I'd like to use auto-sklearn.I used the code from this document.All packages are installed. But I got an error like this. !curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install !pip install auto-sklearn import autosklearn.classification ---...
Nori
1

votes
2

answer
204

Views

“ ImportError: DLL load failed: The specified procedure could not be found”- while Digit Recognition using CNN in Python using Keras

I am trying to write a simple character recolonization code using convolutional neural network in python on windows. I am following this tutorial. But somehow I am having following error message. I could not find the appropriate reason of this error. It would be helpful for me if anyone can breakdow...
Mahin
1

votes
1

answer
39

Views

K-means on 3D matrix

I am currently learning k-means and wanted to try it on 3D matrix, this is the link through which I am passing 2D matrix. from sklearn.cluster import KMeans import numpy as np X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) kmeans = KMeans(n_clusters=2, random_state=0).fit(X) kmea...
user730119
1

votes
1

answer
43

Views

sklearn.model_selection module not found

I am trying linear regression from a data but when I am trying to from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression it gives me this error line 4, in from sklearn.model_selection import train_test_split KeyError: 32 sklearn, numpy and scipy are...
S.Shiro
1

votes
2

answer
667

Views

it-idf with TfidfVectorizer on Japanese text

I am working with a huge collection of documents written in several languages. I want to compute cosine distance between documents from their tf-idf scores. So far I have: from sklearn.feature_extraction.text import TfidfVectorizer # The documents are located in the same folder as the script text_fi...
Edgar Derby
1

votes
2

answer
2.3k

Views

How to print the order of important features in Random Forest regression using python?

I am trying out to create a Random Forest regression model on one of my datasets. I need to find the order of importance of each variable along with their names as well. I have tried few things but can't achieve what I want. Below is the sample code I tried on Boston Housing dataset: from sklearn.en...
CodeHunter
0

votes
0

answer
11

Views

Iterating through Pandas groups to create DataFrames

I have a table containing production data on parts and the variables that were recorded during their production. FORMAT: Part | Variable1 | Variable 2 etc _____________________________ 1-----------X---------------X 1-----------X---------------X 2-----------X---------------X 2-----------X-----------...
Andrew Pynch
1

votes
1

answer
4k

Views

Calculating accuracy scores of predicted continuous values

from sklearn.metrics import accuracy_score accuracy_score(y_true, y_pred) I believe this code will return the accuracy of our predictions. However, I am comparing predicted and actual values of continuous values and I believe that most of them are not going to be exactly same. Should I fit the test...
Aditya
1

votes
2

answer
3.9k

Views

How to use inverse_transform in MinMaxScaler for a column in a matrix

I scaled a matrix based on its columns, like this: scaler = MinMaxScaler(feature_range=(-1, 1)) data = np.array([[-1, 2], [-0.5, 6], [0, 10], [1, 18]]) scaler = scaler.fit(data) data_scaled = scaler.transform(data) the data_scaled gave me the following: array([[-1. , -1. ], [-0.5, -0.5], [ 0. , 0...
cyberic
1

votes
1

answer
481

Views

Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

I am working in scikit and I am trying to tune my XGBoost. I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc s...
inatos
1

votes
1

answer
309

Views

What could be the reason for “TypeError: 'StratifiedShuffleSplit' object is not iterable”?

I have to deliver a Machine Learning project, and I received a file called tester.py. After finish my code in another file, I have to run tester.py to see the results, but I am getting a error: TypeError: 'StratifiedShuffleSplit' object is not iterable I have researched this error in another topics...
Leandro Baruch
1

votes
1

answer
126

Views

Machine learning algorithm score changes without any change in data or step

I am new to Machine learning and getting started with Titanic problem on Kaggle. I have written a simple algorithm to predict the result on test data. My question/confusion is, every time, I execute the algorithm with the same dataset and the same steps, the score value changes (last statement in th...
YoungHobbit
1

votes
2

answer
50

Views

Merge distance matrix results and original indices with Python Pandas

I have a panda df with list of bus stops and their geolocations: stop_id stop_lat stop_lon 0 1 32.183939 34.917812 1 2 31.870034 34.819541 2 3 31.984553 34.782828 3 4 31.888550 34.790904 4 6 31.956576 34.898125 stop_id isn't nece...
Shakedk
1

votes
2

answer
41

Views

Scikit learn order of coefficients for multiple linear regression and polynomial features

I'm fitting a simple polynomial regression model, and I want get the coefficients from the fitted model. Given the prep code: import pandas as pd from itertools import product from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline...
rovyko
1

votes
2

answer
46

Views

How to perform SMOTE with cross validation in sklearn in python

I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE. Therefore, I would like to know the correct...
Emi
1

votes
2

answer
27

Views

One-class-only folds tested through GridSearchCV

When using GridSearchCV on a custom estimator that is a wrapper on SVC, I get the error: 'ValueError: The number of classes has to be greater than one; got 1 class' The custom estimator is made to add gridsearch parameters to the estimator and seemed to work fine. Using the debugger, I found that i...
Abel Adary
1

votes
0

answer
13

Views

Scikit-learn transformer pipeline produces different results than running individually

When I tried using the pipeline to combine a couple transformers, the second transformer (log) appears not be applied. I have tried to simplify the log transformer to perform simple addition but the same problem persists. import pandas as pd import numpy as np from sklearn.pipeline import Pipeline f...
user11392601
1

votes
0

answer
406

Views

features selection for large dataset in python

I have a Document-term matrix of dimension 3144469 x 268496 for which i need to do feature selection.I tried it doing with feature selection of Sckit-learn using code fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40) documenttermmatrix_train= fs.fit_transform(documentter...
Ranjana Girish
1

votes
1

answer
118

Views

MiniBatchSparsePCA on Text Data

Goal I'm trying to replicate an application described in this paper (section 4.1), where Sparse Principal Component Analysis is applied to a text corpus with the output being K principal components, each displaying a 'structure that is otherwise hidden'. In other words, the principal components shou...
SeánMcK
1

votes
0

answer
532

Views

Stock price prediction, choosing amount of time in the future using scikit learn

I'm trying to use machine learning to predict stock prices. I'm having issues choosing how long out to predict, I want to be able to predict out 100-200 days in the future. It seems like my code is cutting off the last 200 days and adding it's prediction there, instead of adding a additional 200 day...
phillyphil
1

votes
0

answer
176

Views

What is the expected return type for tokenizer that is passed as parameter into Tfidfvectorizer

I am looking at: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html It just says: tokenizer : callable or None (default) Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer...
Kaizer Sozay
1

votes
0

answer
88

Views

Is there a way to install a specific submodule of a package?

I'm working on a serverless project using sklearn.neural_network.MLPClassifier using AWS Lambda. AWS requires that all dependencies get uploaded with the project during deploy, is there a way to install only the files needed to use a specific classifier so I can save some bandwidth?
Ramon Balthazar
1

votes
1

answer
263

Views

Zero-padding a compressed sparse matrix (for NLP)?

I am using a recurrent neural network to classify text sentiment. I used TfidfVectorizer to convert the text into counts. My code is as follows: vectorizer = TfidfVectorizer(max_features = 5000) vectorizer.fit(X_train) Xtrain = vectorizer.fit_transform(X_train) Xtest = vectorizer.fit_transform(X...
anticavity123
1

votes
0

answer
514

Views

Is passing sklearn tfidf matrix to train MultinomialNB model proper?

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both: X = df_new['text_content'] y = df_new['label'] X_train, X_test, y_train,...
ZEE
1

votes
0

answer
475

Views

How to access hyperparameters in case of nested cross-validation using scikit-learn

The code for hyperparameter tuning using scikit-learn looks like this: gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=-1) gs = gs.fit(X_train, y_train) clf = gs.best_estimator_ clf.fit(X_train, y_train) where for each combination of hyperparameters K-f...
Royalblue
1

votes
2

answer
361

Views

How to make polynomial features using sparse matrix in Scikit-learn

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model. model = Pipeline([('poly', PolynomialFeatures(degree=3)), ('linear', LinearRegression(fit_intercept=False))]) model.fit(X, y) But it throws an error TypeError: A sparse matrix was passed,...
Niyamat Ullah
1

votes
0

answer
329

Views

Inferior performance of Tensorflow compared to sklearn

I'm comparing the performance of Tensorflow with sklearn on two datasets: A toy dataset in sklearn MNIST dataset Here is my code (Python): from __future__ import print_function # Import MNIST data from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets('/tmp/dat...
lenhhoxung
1

votes
1

answer
71

Views

How to create an easy to explain regression model in Python with categorical features?

I have a dataset that looks like this, where each row is a user. gender age_group c1 c2 c3 total_cost F 0-10 10 F1234 3456 135.2 F 65-100 10 G5143 876 523.6 M 18-35 15 F3457 876 98.5 F 0-10 10 F1234 545 1052.1 M 35-65 2...
sfactor
1

votes
1

answer
674

Views

Grouping data by sklearn.model_selection.GroupShuffleSplit

I have a dataset in a CSV with header as PRODUCT_ID CATEGORY_NAME PRODUCT_TYPE DISPLAY_COLOR_NAME IMAGE_ID with same product having multiple rows each with different image_id. I made Image Id as index col when reading CSV into pandas data frame. I want to create test and...
Aravind Chamakura
1

votes
0

answer
359

Views

Area of a cluster with DBSCAN

I've an array of X and Y coordinates of some points spread in a field. I want to know if these points form clusters. This coordinate array is a numpy array, where the first column is the X and the second column the Y. Here is an example: >>> ch1_data[0:5] [[ 11743. 17707.7] [ 15850.9 16474. ] [...
Rg111
1

votes
0

answer
83

Views

Why MultinomialNB outputs 0.5 when there is only one feature?

A naive Bayesian classification problem Code import numpy as np X = np.array([[1], [2], [3], [4], [5], [6]], dtype=int) Y = np.array([1, 1, 1, 0, 0, 0], dtype=int) from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.partial_fit(X, Y, [0, 1]) print clf.predict_proba(X) Output [[ 0...
hijkzzz
1

votes
0

answer
154

Views

Role of coefficient in Multinomial NB

My MultinomialNB classifier was instantiated and trained on vectorized fake/real news articles, and now I'm trying to understand the meaning behind the coefficients. nb_classifier = MultinomialNB() # Extracting the class labels: ('Fake' or 'Real') class_labels = nb_classifier.classes_ # Extract the...
Adam Schroeder
1

votes
0

answer
211

Views

How to compute custom loss for tensorflow using numpy/sklearn based on the predictions

I'me facing an issue to combine numpy with tensorflow. For instance, I want to create a custom loss function and use it for the training. let loss function be loss = tf.reduce_mean(C * tf.nn.softmax_cross_entropy_with_logits(logits=self.y_logits, labels=self.y10)) where C is some value computed bas...
Ehab AlBadawy
1

votes
0

answer
256

Views

How to insert Training and Testing data into Tensorflow's high API WITHOUT manually splitting the CSV myself

Everyone, I found it super inefficient to literally create and prepare two CSV's - training and testing data - to then feed them separately into Tensorflow's high API (tf.estimator). Is there a way to make this process more efficient? I want to do something like sklearn's model_selection module whe...
Haebichan Jung
1

votes
2

answer
416

Views

Overriding tokenizer of scikitlearn vectorizer with spacy

I want to implement lemmatization with Spacy package. Here is my code : regexp = re.compile( '(?u)\\b\\w\\w+\\b' ) en_nlp = spacy.load('en') old_tokenizer = en_nlp.tokenizer en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string)) def custom_tokenizer(document): doc_...
Antenna_
1

votes
1

answer
280

Views

sklearn PCA random_state parameter function

I am using PCA to visualize clusters and noticed Sklearn added 'random_state' parameter to the PCA method (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), my question is what does the random_state parameter do there? My understanding is that PCA should return the sa...
Max
1

votes
0

answer
374

Views

Machine learning - PCA and KNN on rgb images is too slow (python)

I work with python and images of tables (taken from above). My aim is to take a photo of a random table and then find the most similar tables to it in my database. Obviously, the main feature which distinguishes the tables are their shape (square, rectangular, round, oval) but there are also other d...
Poete Maudit

View additional questions