Questions tagged [data-analysis]

0

votes
0

answer
6

Views

What are some dynamic knowledge-based authentication questions that could be asked to a customer for fraud prevention?

I'm looking to frame some dynamic knowledge-based authentication questions based on customer's data to prevent fraud. For example, based on the customer's call records, I could ask them to identify the most frequent number they call as an authentication step. Or I could ask them to identify the clos...
Surya Murali
0

votes
5

answer
35

Views

I have data in one column, how to extract that?

Ihave data in one column, how to extract that? For example: {"name":"circle","cx":371,"cy":2921,"r":73} I want output as follows: name cx cy r circle 371 2921 73 Note: all 4 values are in same column shape_attributes
Shikha Mishra
1

votes
0

answer
8

Views

Cluster Size is too big after BIRCH clustering

I have a data of 2,4million row and about 56 variables. I was doing sampling of 10000 data and do PCA into 10 dimensions Then I use BIRCH clustering as k-means and hierarchical were showing bad silhoutte coefficient. Scikit says that the usecase of BIRCH is large dataset and data reduction As the re...
Elbert
0

votes
1

answer
17

Views

How to use if statements with pandas/ csv files

i want to check if the string data in a series is equal to a given string. but this returns: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). i know that to use and/or i use & / | but i don't understand how to do this with an if statement for i in range(...
BOBTHEBUILDER
1

votes
2

answer
6.8k

Views

Categorize Data in a column in dataframe

I have a column of numbers in my dataframe, i want to categorize these numbers into e.g high , low, excluded. How do i accomplish that. I am clueless , i have tried looking at the cut function and category datatype.
Nathaniel Babalola
1

votes
1

answer
211

Views

How do I find the mean/median of the different values in a row?

I have a dataset in a csv file that looks like this: teacher student student grade Jon marin 99 Jon Rob 81 Jon marly 90 Bon martin 76 Bon marie 56 Ton Seri...
JJ123
1

votes
2

answer
1.2k

Views

How do I calculate popularity of content?

I'm developing a web site where the user rates content (1-5 stars). I need to measure the popularity of the content (also referred to as importance/hotness/interest). My first thought was just to add the user ratings for a content: Popularity = SUM(Rating - 2.5) If two users gives it 5-stars and on...
Pking
1

votes
1

answer
243

Views

Data Analysis - Manipulating Auction Data in Excel - VBA

I have a .csv file with the following data from eBay auctions: auctionid - unique identifier of an auction bidtime - the time (in days) that the bid was placed, from the start of the auction bidder - eBay username of the bidder I am trying to create new variables for how long any given bidder partic...
user3408811
1

votes
2

answer
157

Views

Different result of code example on book: Python for Data Analysis

I have a question on a book "Python for Data Analysis" if anyone is interested in this book. After running an example on page 244 (Plotting Maps: Visualizing Haiti Earthquake Crisis Data), my result of dummy_frame.ix doesn't look the same as what the book says as below: dummy_frame = DataFrame(np.ze...
ronnefeldt
1

votes
1

answer
1.1k

Views

Counting qualitative values based on the date range in Pandas

I am learning to use Pandas library and need to perform analysis and plot the crime data set below. Each row represents one occurrence of crime. date_rep column contains daily dates for a year. Data needs to be grouped by month and instances of specific crime need to be added up per month, like in...
verkter
1

votes
1

answer
1.1k

Views

Calculating the mean of groups in python/pandas

My grouped data looks like: deviceid time 01691cbb94f16f737e4c83eca8e5f5e5390c2801 January 10 022009f075929be71975ce70db19cd47780b112f April 566 August 210 January 4 July 578 June 1048 May 1483 02bad1cdf92fbaa932...
user866098
1

votes
2

answer
1k

Views

Lua library for data analysis (data frames)

Is there any Lua implementation of data frames - structures for data analysis which? Something like Python pandas. I want to do some statistical operations using LuaJIT.
Robert Zaremba
1

votes
2

answer
90

Views

Mean and standart deviation by groups where a condition is satisfied

I have such a data frame(df) which is just a sapmle: group condition values 1 0 12 1 1 15 1 1 23 1 1 14 2 1 34 2 1 37 2 0 31 2 0 36 2 1 35 Namely; df
oercim
-1

votes
1

answer
48

Views

How to optimise the performance of this code?

I am trying to run the code below. It works fine for small data size, but for larger data size, it is taking almost a day. Anyone who can help to optimise the code or who can tell me the approach. Can we use apply lambda to solve the issue? for index in df.index: for i in df.index: if ((df.loc[...
user10261014
1

votes
1

answer
82

Views

K-Mean Clustering: Evaluating new Cluster centers

Is it better to evaluate new Cluster centers after each iteration of all data points, or after assigning a cluster to each data point? To clarify, which of the two methods is preferred: You assign all the data points to various clusters and then find the new cluster center Or, you assign the next da...
Dipped Bits
1

votes
2

answer
1k

Views

Python Script to run a command over all files in a folder

For converting pdf to text I am using the following command: pdf2txt.py -o text.txt example.pdf # It will convert example.pdf to text.txt But I have more than 1000 pdf files which I need to convert to text file first and then do the analysis. Is there a way through which I can use this command to i...
python
1

votes
1

answer
1.9k

Views

How to merge two large numpy arrays if slicing doesn't resolve memory error?

I have two numpy arrays container1 and container2 where container1.shape = (900,4000) and container2.shape = (5000,4000). Merging them using vstack results in a MemoryError. After searching through the old questions posted here, I tried to merge them using slicing like this: mergedContainer = numpy....
Tehmas
1

votes
1

answer
48

Views

How can i create pivot_table with pandas, where displayed other fields than i use for index

I use package "pandas" for python. And i have a question. I have DataFrame like this: | first | last | datr |city| |Zahir |Petersen|22.11.15|9 | |Zahir |Petersen|22.11.15|2 | |Mason |Sellers |10.04.16|4 | |Gannon |Cline |29.10.15|2 | |Craig |Sampson |20.04.16|2 | |C...
NCNecros
1

votes
1

answer
531

Views

NoSQL as a data mining solution?

In what ways are NoSQL databases be more useful in data mining than say OLAP databases or how is it less useful? Is there an advantage in having a fast data-retrieval from gigantic volume of data but also having a schema-less database?
DazedNConfused
0

votes
0

answer
6

Views

Updating sklearn LabelEncoder

I've used sklearn.LabelEncoder to encode some items, but when feed it with some new items it throws error: ValueError: y contains previously unseen labels So, is there any method to update the label encoder with new items ? Thanks
Mohammed Magdy Ismael
0

votes
0

answer
2

Views

Data abstraction from unstructured pdf analysis, what tools and library should i go through?

I want to abstract data from unstructured invoices pdf, need a good machine learning analysis tool for it, or even good direction to lean would also be helpful.
Jalaj Chandnani
0

votes
1

answer
49

Views

Dictionary not handling multiple values

I am trying to create a dataframe of states and cities. Each state name in the table I am reading from ends with the letters [edit],city on the other hand either end with (text)[number] I have used regex to remove the text within the parentheses and square brackets, saved states in a list for states...
Emm
3

votes
1

answer
1.7k

Views

Combine two pandas dataframes adding corresponding values

I have two dataframes like these: df1 = pd.DataFrame({'A': [1,0,3], 'B':[0,0,1], 'C':[0,2,2]}, index =['a','b','c']) df2 = pd.DataFrame({'A': [0,0], 'B':[2,1]}, index =['a','c']) df1 and df2: | A | B | C | | A | B | ---|---|---|---| ---|---|---| a | 1 | 0 | 0 | a | 0 | 2 |...
kiril
3

votes
2

answer
127

Views

Filling a column in a dataframe based on a column in another dataframe in r

I have a dataframe of comments which looks like this(df1) Comments Apple laptops are really good for work,we should buy them Apple Iphones are too costly,we can resort to some other brands Google search is the best search engine Android phones are great these days I lost my visa card today I have a...
function
3

votes
1

answer
2.2k

Views

Pandas compare each row with all rows in data frame and save results in list for each row

I try compare each row with all rows in pandas DF through fuzzywuzzy.fuzzy.partial_ratio() >= 85 and write results in list for each row. in: df = pd.DataFrame( {'id':[1, 2, 3, 4, 5, 6], 'name':['dog', 'cat', 'mad cat', 'good dog', 'bad dog', 'chicken']}) use pandas function with fuzzywuzzy library g...
pirr
1

votes
3

answer
1.2k

Views

How do you test speed of sorting algorithm?

I want to do an empirical test on the speed of sorting algorithms. Initially I randomly generated data but this seems to be unfair and mess up some algorithms. For example with quicksort the pivot selection is important and one method of picking the pivot is to always pick the first and another meth...
Celeritas
1

votes
1

answer
8.5k

Views

How to get the scores of each feature from sklearn.feature_selection.SelectKBest?

I am trying to get the scores of all the features of my data set. file_data = numpy.genfromtxt(input_file) y = file_data[:,-1] X = file_data[:,0:-1] x_new = SelectKBest(chi2, k='all').fit_transform(X,y) Before the first row of X had the "Feature names" in string format but I was getting "Input conta...
Black Dragon
1

votes
3

answer
492

Views

Subtracting rows between data frames in pandas

I have two dataframes, df1 Name | std kumar | 8 Ravi | 10 Sri | 2 Ram | 4 df2, Name | std Sri | 2 Ram | 4 I want to subtract df2 rows from df1 and I tried, df1.subtract(df2,fill_value=None) but I am getting error, TypeError: unsupported operand type(s) for -: 'str' and 'str' My...
pyd
2

votes
1

answer
684

Views

how to merge two dataframes based on a column in pandas [duplicate]

This question already has an answer here: Pandas Merging 101 1 answer I have two data frames, df1=pd.DataFrame({"Req":["Req 1","Req 2","Req 3"],"Count":[1,2,1]}) Req Count 0 Req 1 1 1 Req 2 2 2 Req 3 1 df2=pd.DataFrame({"Req":["Req 1","Req 2"],"Count":[0,1]}) Req Count 0...
pyd
1

votes
1

answer
1.5k

Views

How to add a new column and aggregate values in R

I am completely new to gnuplot and am only trying this because I need to learn it. I have a values in three columns where the first represents the filename (date and time, one hour interval) and the remaining two columns represent two different entities Prop1 and Prop2. Datetime Prop1...
sfactor
2

votes
1

answer
1.5k

Views

seaborn multiple variables group bar plot

I have pandas dataframe, one index(datetime) and three variables(int) date A B C 2017-09-05 25 261 31 2017-09-06 261 1519 151 2017-09-07 188 1545 144 2017-09-08 200 2110 232 2017-09-09 292 2391 325 I can create grouped bar plot with basic pandas plot. df.plot(ki...
jaykodeveloper
2

votes
1

answer
35

Views

BigQuery: Xoring Elements of Two Arrays

I have a two arrays. a = [1, 2, 3, 4] b = [11, 22, 33, 44] How can I xor the respective elements of two arrays to get a result as result = [10, 20 ,34, 40] i-e 1^11 = 10, 2^22=20 and so on I have tried BIT_XOR(x) but it takes one array and xor all of the elements of array. SELECT BIT_XOR(x) AS bit_...
john
2

votes
1

answer
53

Views

How to slice a dataframe column based on another column

I have a df like this, Main Length Sri playnig well cricket 5 sri went out 2 Ram is in 1 Ram went to UK,US 2 I am trying to slice the df["Main"] based on df["Length"] My expected output is, Main Length Sri p...
pyd
3

votes
2

answer
488

Views

Remove columns with values 0 or 999999

I am working with a large data set with 400 columns some of the columns have all values zero and others have all zeros with few '999999999'. I want to get rid of such columns. I was able to do it for the columns containing just zeroes but not sure hoe to do it for columns containing zeroes and '999...
Uasthana
-2

votes
0

answer
24

Views

Selecting all strings in a list that contain a given expression in R

I am reading a crime statistic .csv file into a variable called crimes. New data frame crimes contains a column titled Text_General_Code that briefly describes the crime that was committed. I want to find out what is the percentage of burglaries and robberies in crimes committed. I am using the foll...
Jakov Sergo
19

votes
1

answer
37.8k

Views

Python Pandas join dataframes on index

I am trying to join to dataframe on the same column "Date", the code is as follow: import pandas as pd from datetime import datetime df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date') start = datetime(2010, 2, 5) end = datetime(2012, 10, 26) df_train_fly = pd.date_range(...
wuha
4

votes
1

answer
1.3k

Views

Calculating subtractions of pairs of columns in pandas DataFrame

I work with significantly sized (48K rows, up to tens of columns) DataFrames. At a certain point in their manipulation, I need to do pair-wise subtractions of column values and I was wondering if there is a more efficient way to do so rather than the one I'm doing (see below). My current code: # Mat...
Einar
19

votes
12

answer
6.9k

Views

What's the best approach to recognize patterns in data, and what's the best way to learn more on the topic?

A developer I am working with is developing a program that analyzes images of pavement to find cracks in the pavement. For every crack his program finds, it produces an entry in a file that tells me which pixels make up that particular crack. There are two problems with his software though: 1) It pr...
Phil
5

votes
2

answer
157

Views

Analyze similarities in model data using Elasticsearch and Rails

I would like to use Elasticsearch to analyze data and display it to the user. When a user views a record for a model, I want to display a list of 'similar' records in the database for that model, and the percentage of similarity. This would match against every field on the model. I am aware that wi...
Drew
2

votes
1

answer
1.3k

Views

Plotting event density in Python with ggplot and pandas

I am trying to visualize data of this form: timestamp senderId 0 735217 106758968942084595234 1 735217 114647222927547413607 2 735217 106758968942084595234 3 735217 106758968942084595234 4 735217 114647222927547413607 5 etc... geom_density works if I don't...
MasterScrat

View additional questions