Questions tagged [data-science]

1

votes
1

answer
25

Views

How can I get similar distribution from different groups?

I've to find in the dataset subgroups with similar average for 2 metrics than my original group. For example, I'd like to find a city or group of cities with the closest average(metric 1) = 10 and average(metric 2) = 5. Dataset example: How can I do it?
gabriel.almeida
0

votes
0

answer
6

Views

Understanding Catboost multi class accuracy score

I'm new to Catboost and trying it out on a project. I'm doing a multiclass classification, that ranges from 1-10. Here are my parameters for training: model = CatBoostClassifier( custom_loss=['Accuracy'], random_seed=42, logging_level='Silent', loss_function='MultiClass' ) model.fit( df_train_featur...
0

votes
0

answer
4

Views

Test Train Split in a small Dataset

I am performing sentiment analysis on a relatively small dataset, the dataset contains round about 2K observations, so my question is what would be the ideal split for training my model? and for such scenario, Tfidf is ideal or count vectorizer?
Saud
0

votes
3

answer
33

Views

What is a right way to use like in SQL

I am working on "Not Boring Movies" problem in leetcode. The porblem describes as following "X city opened a new cinema, many people would like to go to this cinema. The cinema also gives out a poster indicating the movies’ ratings and descriptions. Please write a SQL query to output movies with...
pipi
0

votes
0

answer
5

Views

Statsmodel api in python giving the p value of all feature as 0

I doing a logistic regression on income census data of US. There are maximum column with categorical values so I created dummy variables for them. But when fitting the model using statsmodel api for all the features I am getting the P value as 0. So this tell that all the features are significant. B...
Sarthak Sahu
0

votes
1

answer
28

Views

datetime combine date & time stamp

Im trying to use this SO post to combine a date and time stamp but not having any luck.. #df= pd.read_csv('C:\\Users\\desktop\\master.csv', index_col='Date', parse_dates=True) df= pd.read_csv('C:\\Users\\desktop\\master.csv') This is where Im stuck, I don't know how to import the package correctly.....
HenryHub
1

votes
0

answer
5

Views

LightGBM - sklearnAPI vs training and data structure API and lgb.cv vs gridsearchcv/randomisedsearchcv

What are the differences between the sklearnAPI(LGBMModel, LGBMClassifier etc) and default API(lgb.Dataset, lgb.cv, lgb.train) of lightgbm? Which one should I prefer using? Is it better to use lgb.cv or gridsearchcv/randomisedsearchcv of sklearn when using lightgbm?
Sift
-1

votes
2

answer
101

Views

How to load a 3D array with mixed data types into Tensorflow for training?

I am working with a dataset that looks something like this- PERSON1 = [["Person1Id", "Rome", "Frequent Flyer", "1/2/2018"],["Person1Id", "London", "Frequent Flyer", "3/4/2018"],["Person1Id", "Paris", "Frequent Flyer", "2/4/2018"], ...] PERSON2 = [["Person2Id", "Shenzen", "Frequent Flyer", "1/2/2018"...
unicornication
1

votes
1

answer
47

Views

I would like help optimising a triple for loop with an if in statement

I am doing a triple for loop on a dataframe with almost 70 thousand entries. How do I optimize it? My ultimate goal is to create a new column that has the country of a seismic event. I have a latitude, longitude and 'place' (ex: '17km N of North Nenana, Alaska') column. I tried to reverse geocode, b...
laythstag
1

votes
2

answer
679

Views

Subsample size in scikit-learn RandomForestClassifier

How is it possible to control the size of the subsample used for the training of each tree in the forest? According to the documentation of scikit-learn: A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to imp...
user6903745
1

votes
3

answer
151

Views

What are training and test data sets

I am getting started in kaggle. I have just gone through various data science and machine learning competition I have seen that for every competition they have uploaded training data, test data and Original data. Can someone explain me what are those and how do we use those datasets while solving a...
Abhishek Sharma
1

votes
1

answer
38

Views

Disconnect points to plot overlay in Vega-lite / Vega

An example in vega-editor here I don’t want dateTime 5 & dateTime 7 to be connected since they are not consecutive. Idea is to plot on overlay based on some condition and connect only when the count is >=5. Has anyone tried this already?
false9striker
1

votes
1

answer
34

Views

Explanation for R code used to delete column

Can anyone tell me the piece-by-piece meaning of the following code used to conditionally delete a column of a data frame? df2=df[,!names(df)%in%c("column")] Conditions: column is the column I want to delete from the dataframe df. df2 is the new dataframe.
0

votes
0

answer
6

Views

Trouble conceptualising solution to tough pandas/data problem

Let's say I have a a pandas dataframe with two columns x and y, which, when visualised with a scatter plot, naturally form a main cluster like this: https://www.dropbox.com/s/ko6pvai7dkh35gs/first.png I would like to be able to detect the area that makes up the immediate surrounding of the cluster m...
rortest
1

votes
0

answer
16

Views

Train/test set for LSTM on multivariate Time series with varying length sampes

I have a time series dataset with (30 seconds) time-step and 20 features. Each observation/sample has a length between 188 to 200 time-steps. I have just over 2000 samples collected from the past three years. I want to implement LSTM to make a prediction at time t=1. I.e first 30 sec for all (20) fe...
MatN
0

votes
0

answer
8

Views

Keras Load_module : Unknown Layer:Name

The following is my model that I am using when training model = Sequential() model.add(Conv2D(64, kernel_size=3, use_bias=False, input_shape=(size, size, 1))) model.add(BatchNormalization()) model.add(Activation("relu")) model.add(Conv2D(32, kernel_size=3, use_bias=False)) model.add(BatchNormalizat...
BearsBeetBattlestar
1

votes
2

answer
665

Views

find the “elbow point” on an optimization curve with Python

i have a list of points which are the inertia values of a kmeans algorithm. To determine the optimum amount of clusters i need to find the point, where this curve starts to flatten. Data example Here is how my list of values is created and filled: sum_squared_dist = [] K = range(1,50) for k in K:...
ItFreak
1

votes
0

answer
11

Views

How to snip part of a rows' data and only leave the first 3 digits in Python.

0 546/001441 1 540/001495 2 544/000796 3 544/000797 4 544/000798 I have a column in my dataframe that I've provided above. It can have any number of rows depending on the data being crunched. It is one of many columns but the first three numbers match another columns data. I need to c...
J_Millar
7

votes
1

answer
1.8k

Views

How to explain the outcome of k-means clustering?

I am currently conducting some analysis using NTSB aviation accident database. There are cause statements for most of the aviation incidents in this dataset that describe the factors lead to such event. One of my objectives here is to try to group the causes, and clustering seems to be a feasible w...
mightyheptagon
0

votes
0

answer
2

Views

Data abstraction from unstructured pdf analysis, what tools and library should i go through?

I want to abstract data from unstructured invoices pdf, need a good machine learning analysis tool for it, or even good direction to lean would also be helpful.
Jalaj Chandnani
2

votes
1

answer
23

Views

Removing parentheses and everything in them with Regex

Having a bit of trouble with some code I'm working through. Basically, I have transcripts (txt files) for a few Japanese anime, of which I want to remove everything but the spoken lines (Japanese sentences) in order to do some NLP experiments. I've managed to accomplish a good bit of cleaning, but w...
capsulemage
0

votes
0

answer
25

Views

extract data from multiple urls stored in a column of dataframe

I want to extract data from multiple URLs, but the URLs are in a column of a data frame. I tried data extraction with the code below but no luck. from urllib.request import urlopen,Request link = data.column1 f = urlopen(link) myfile = f.read() print(myfile) It shows: AttributeError: 'Series' object...
Harpreet Singh
1

votes
2

answer
39

Views

Best algorithm for finding most occurring combinations of values in a dataset

---------------------------------------- ColumnA | ColumnB | ColumnC | ---------------------------------------- Cat | Shirt | Pencil | Dog | Shirt | Eraser | Worm | Dress | Pen | Cow | Shirt | Pen | Cat | Shirt | Pen |...
NeverPhased
1

votes
3

answer
227

Views

How can I use regex as a delimiter when importing a csv file with pandas with extra commas?

The csv file was sent to me/ I can not re delimit the columns 239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. pape...
Adi Srinivasan
2

votes
2

answer
56

Views

Scoring in Gridsearch CV

I just started with GridSearchCV in Python, but I am confused what is scoring in this. Somewhere I have seen scorers = { 'precision_score': make_scorer(precision_score), 'recall_score': make_scorer(recall_score), 'accuracy_score': make_scorer(accuracy_score) } grid_search = GridSearchCV(clf, param_...
KMittal
3

votes
3

answer
40

Views

Numpy obtain dtype per column

I need to obtain the type for each column to properly preprocess it. Currently I do this via the following method: import pandas as pd # input is of type List[List[any]] # but has one type (int, float, str, bool) per column df = pd.DataFrame(input, columns=key_labels) column_types = dict(df.dtypes)...
Daan Luttik
2

votes
1

answer
147

Views

Pandas: How to analyse data with start and end timestamp?

I have to analyze the activity of users who uses an application during a given period, periods are start and end timestamp. I tried with a bar chart but I do not know how to include hours in interval. Ex : user with uid=2 use the application at [18, 19, 20, 21] My dataframe is like: uid s...
Adil Blanco
5

votes
3

answer
3.8k

Views

InvalidArgumentError: Expected dimension in the range [-1, 1) but got 1

I'm not sure what this error means. This error occurs when I try to calculate acc: acc = accuracy.eval(feed_dict = {x: batch_images, y: batch_labels, keep_prob: 1.0}) I've tried looking up solutions, but I couldn't find any online. Any ideas on what's causing my error? Here's a link to my full cod...
mdlee6
3

votes
2

answer
35

Views

Is there a standard approach to counting repetitions in an oscillating signal?

I am collecting sensor data from a repetitive physical process (think an elevator moving up and down). This is an example of what the signal looks like. The y-axis reflects our equivalent of 'height' and the x-axis is just time. Perhaps not surprising, this particular image reflects 5 repetitions...
migsvult
2

votes
1

answer
24

Views

How to select columns with dplyr/tidyvese depending on minimal value in column in R

i have a data set of certain counts of Landcoverpixel per Point. species_distr 50%) / natural vegetation (tree, shrub, herbaceous cover) (50%) / cropland (15%)` = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_ ), `Tree cover, broadleaved, deciduous, closed to open (>...
m4D_guY
2

votes
0

answer
30

Views

How to cluster points based on the function they belong in Python?

sorry, if the title is ambiguous. Let me explain the problem. By the way, I'm really new to Data Science, so sorry if I make a statement that doesn't make sense. Recently came across to a problem which was related to clustering. The coordinates were given for a lot of points. The task was to cluster...
Mensur Qulami
-2

votes
0

answer
12

Views

image classification using python

I am trying to achieve a scenario using python and data science concepts more specifically image classification, to identify the liquid products. for example if there are two liquid products Milk & Beer it should identify the product being milk or beer when i pass the image value to my program. Unfo...
jamesorc
32

votes
2

answer
11.6k

Views

What does the standard Keras model output mean? What is epoch and loss in Keras?

I have just built my first model using Keras and this is the output. It looks like the standard output you get after building any Keras artificial neural network. Even after looking in the documentation, I do not fully understand what the epoch is and what the loss is which is printed in the output....
pr338
0

votes
2

answer
29

Views

Pandas DataFrame replace does not work with inplace=True

In my coloumn of the data frame i have version numbers like 6.3.5, 1.8, 5.10.0 saved as objects and thus likely as Strings. I want to remove the dots with nothing so i get 635, 18, 5100. My code idea was this: for row in dataset.ver: row.replace(".","",inplace=True) The thing is it works if i dont s...
A. Kazakov
0

votes
0

answer
3

Views

scikit learn train_test_split function not working as expected

I am using train test split function to separate data for training and testing, but function assigns wrong label for separated train test data. Instead of assigning label from expected row it assigns label from 2nd row from expected row. Please, Let me know where i am going wrong ? data = pd.read_c...
Kamble Tanaji
2

votes
2

answer
186

Views

How do I load data from a StreamingBody object using Insert to Code to pandas in Watson Studio?

The Insert to Code feature enables you to access data stored in Cloud Object Storage when working in Jupyter notebooks in Watson Studio. Some file types (e.g. txt files) will have just StreamingBody and Credentials as insert to code options: How can I use the StreamingBody object to access my data?
Joe Plumb
2

votes
1

answer
93

Views

How to format date to 1900's?

I'm preprocessing data and one column represents dates such as '6/1/51' I'm trying to convert the string to a date object and so far what I have is: date = row[2].strip() format = "%m/%d/%y" datetime_object = datetime.strptime(date, format) date_object = datetime_object.date() print(date_object) pri...
elbertkim
3

votes
1

answer
2.1k

Views

Difference between Machine Learning and explicit programming [closed]

I'm newbie to data science field. So I'm trying to understand his basics step by step. And among his most important fields, we find machine learning. I found this definition : "the machine learning is the field of study to give the ability to a Machine to learn without being explicitly programmed."...
sarah
2

votes
2

answer
31

Views

How to remove rows in a dataframe with more than 7 Null values?

I am trying to remove the rows in the data frame with more than 7 null values. Please suggest something that is efficient to achieve this.
tia
4

votes
2

answer
3.1k

Views

Summary statistics on Large csv file using python pandas

Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method. In this case first i need to create a DataFrame for all the 10gb csv data. text_csv=Pandas.read_csv("target.csv") df=Pandas.DataFrame(text_csv) df.describe() Does this mean all the...

View additional questions