# Questions tagged [data-analysis]

421 questions

0

votes

0

answer

6

Views

### What are some dynamic knowledge-based authentication questions that could be asked to a customer for fraud prevention?

I'm looking to frame some dynamic knowledge-based authentication questions based on customer's data to prevent fraud. For example, based on the customer's call records, I could ask them to identify the most frequent number they call as an authentication step. Or I could ask them to identify the clos...

0

votes

5

answer

35

Views

### I have data in one column, how to extract that?

Ihave data in one column, how to extract that?
For example:
{"name":"circle","cx":371,"cy":2921,"r":73}
I want output as follows:
name cx cy r
circle 371 2921 73
Note: all 4 values are in same column shape_attributes

1

votes

0

answer

8

Views

### Cluster Size is too big after BIRCH clustering

I have a data of 2,4million row and about 56 variables. I was doing sampling of 10000 data and do PCA into 10 dimensions
Then I use BIRCH clustering as k-means and hierarchical were showing bad silhoutte coefficient. Scikit says that the usecase of BIRCH is large dataset and data reduction
As the re...

0

votes

1

answer

17

Views

### How to use if statements with pandas/ csv files

i want to check if the string data in a series is equal to a given string.
but this returns:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
i know that to use and/or i use & / | but i don't understand how to do this with an if statement
for i in range(...

1

votes

2

answer

6.8k

Views

### Categorize Data in a column in dataframe

I have a column of numbers in my dataframe, i want to categorize these numbers into e.g high , low, excluded. How do i accomplish that. I am clueless , i have tried looking at the cut function and category datatype.

1

votes

1

answer

211

Views

### How do I find the mean/median of the different values in a row?

I have a dataset in a csv file that looks like this:
teacher student student grade
Jon marin 99
Jon Rob 81
Jon marly 90
Bon martin 76
Bon marie 56
Ton Seri...

1

votes

2

answer

1.2k

Views

### How do I calculate popularity of content?

I'm developing a web site where the user rates content (1-5 stars). I need to measure the popularity of the content (also referred to as importance/hotness/interest). My first thought was just to add the user ratings for a content:
Popularity = SUM(Rating - 2.5)
If two users gives it 5-stars and on...

1

votes

1

answer

243

Views

### Data Analysis - Manipulating Auction Data in Excel - VBA

I have a .csv file with the following data from eBay auctions:
auctionid - unique identifier of an auction
bidtime - the time (in days) that the bid was placed, from the start of the auction
bidder - eBay username of the bidder
I am trying to create new variables for how long any given bidder partic...

1

votes

2

answer

157

Views

### Different result of code example on book: Python for Data Analysis

I have a question on a book "Python for Data Analysis" if anyone is interested in this book.
After running an example on page 244 (Plotting Maps: Visualizing Haiti Earthquake Crisis Data), my result of dummy_frame.ix doesn't look the same as what the book says as below:
dummy_frame = DataFrame(np.ze...

1

votes

1

answer

1.1k

Views

### Counting qualitative values based on the date range in Pandas

I am learning to use Pandas library and need to perform analysis and plot the crime data set below. Each row represents one occurrence of crime. date_rep column contains daily dates for a year.
Data needs to be grouped by month and instances of specific crime need to be added up per month, like in...

1

votes

1

answer

1.1k

Views

### Calculating the mean of groups in python/pandas

My grouped data looks like:
deviceid time
01691cbb94f16f737e4c83eca8e5f5e5390c2801 January 10
022009f075929be71975ce70db19cd47780b112f April 566
August 210
January 4
July 578
June 1048
May 1483
02bad1cdf92fbaa932...

1

votes

2

answer

1k

Views

### Lua library for data analysis (data frames)

Is there any Lua implementation of data frames - structures for data analysis which? Something like Python pandas. I want to do some statistical operations using LuaJIT.

1

votes

2

answer

90

Views

### Mean and standart deviation by groups where a condition is satisfied

I have such a data frame(df) which is just a sapmle:
group condition values
1 0 12
1 1 15
1 1 23
1 1 14
2 1 34
2 1 37
2 0 31
2 0 36
2 1 35
Namely;
df

-1

votes

1

answer

48

Views

### How to optimise the performance of this code?

I am trying to run the code below. It works fine for small data size, but for larger data size, it is taking almost a day.
Anyone who can help to optimise the code or who can tell me the approach. Can we use apply lambda to solve the issue?
for index in df.index:
for i in df.index:
if ((df.loc[...

1

votes

1

answer

82

Views

### K-Mean Clustering: Evaluating new Cluster centers

Is it better to evaluate new Cluster centers after each iteration of all data points, or after assigning a cluster to each data point? To clarify, which of the two methods is preferred:
You assign all the data points to various clusters and then find the new cluster center
Or, you assign the next da...

1

votes

2

answer

1k

Views

### Python Script to run a command over all files in a folder

For converting pdf to text I am using the following command:
pdf2txt.py -o text.txt example.pdf # It will convert example.pdf to text.txt
But I have more than 1000 pdf files which I need to convert to text file first and then do the analysis.
Is there a way through which I can use this command to i...

1

votes

1

answer

1.9k

Views

### How to merge two large numpy arrays if slicing doesn't resolve memory error?

I have two numpy arrays container1 and container2 where container1.shape = (900,4000) and container2.shape = (5000,4000). Merging them using vstack results in a MemoryError. After searching through the old questions posted here, I tried to merge them using slicing like this:
mergedContainer = numpy....

1

votes

1

answer

48

Views

### How can i create pivot_table with pandas, where displayed other fields than i use for index

I use package "pandas" for python. And i have a question.
I have DataFrame like this:
| first | last | datr |city|
|Zahir |Petersen|22.11.15|9 |
|Zahir |Petersen|22.11.15|2 |
|Mason |Sellers |10.04.16|4 |
|Gannon |Cline |29.10.15|2 |
|Craig |Sampson |20.04.16|2 |
|C...

1

votes

1

answer

531

Views

### NoSQL as a data mining solution?

In what ways are NoSQL databases be more useful in data mining than say OLAP databases or how is it less useful?
Is there an advantage in having a fast data-retrieval from gigantic volume of data but also having a schema-less database?

0

votes

0

answer

6

Views

### Updating sklearn LabelEncoder

I've used sklearn.LabelEncoder to encode some items, but when feed it with some new items it throws error:
ValueError: y contains previously unseen labels
So, is there any method to update the label encoder with new items ?
Thanks

0

votes

0

answer

2

Views

### Data abstraction from unstructured pdf analysis, what tools and library should i go through?

I want to abstract data from unstructured invoices pdf, need a good machine learning analysis tool for it, or even good direction to lean would also be helpful.

0

votes

1

answer

49

Views

### Dictionary not handling multiple values

I am trying to create a dataframe of states and cities.
Each state name in the table I am reading from ends with the letters [edit],city on the other hand either end with (text)[number]
I have used regex to remove the text within the parentheses and square brackets, saved states in a list for states...

3

votes

1

answer

1.7k

Views

### Combine two pandas dataframes adding corresponding values

I have two dataframes like these:
df1 = pd.DataFrame({'A': [1,0,3], 'B':[0,0,1], 'C':[0,2,2]}, index =['a','b','c'])
df2 = pd.DataFrame({'A': [0,0], 'B':[2,1]}, index =['a','c'])
df1 and df2:
| A | B | C | | A | B |
---|---|---|---| ---|---|---|
a | 1 | 0 | 0 | a | 0 | 2 |...

3

votes

2

answer

127

Views

### Filling a column in a dataframe based on a column in another dataframe in r

I have a dataframe of comments which looks like this(df1)
Comments
Apple laptops are really good for work,we should buy them
Apple Iphones are too costly,we can resort to some other brands
Google search is the best search engine
Android phones are great these days
I lost my visa card today
I have a...

3

votes

1

answer

2.2k

Views

### Pandas compare each row with all rows in data frame and save results in list for each row

I try compare each row with all rows in pandas DF through fuzzywuzzy.fuzzy.partial_ratio() >= 85 and write results in list for each row.
in: df = pd.DataFrame( {'id':[1, 2, 3, 4, 5, 6], 'name':['dog', 'cat', 'mad cat', 'good dog', 'bad dog', 'chicken']})
use pandas function with fuzzywuzzy library g...

1

votes

3

answer

1.2k

Views

### How do you test speed of sorting algorithm?

I want to do an empirical test on the speed of sorting algorithms. Initially I randomly generated data but this seems to be unfair and mess up some algorithms. For example with quicksort the pivot selection is important and one method of picking the pivot is to always pick the first and another meth...

1

votes

1

answer

8.5k

Views

### How to get the scores of each feature from sklearn.feature_selection.SelectKBest?

I am trying to get the scores of all the features of my data set.
file_data = numpy.genfromtxt(input_file)
y = file_data[:,-1]
X = file_data[:,0:-1]
x_new = SelectKBest(chi2, k='all').fit_transform(X,y)
Before the first row of X had the "Feature names" in string format but I was getting "Input conta...

1

votes

3

answer

492

Views

### Subtracting rows between data frames in pandas

I have two dataframes,
df1
Name | std
kumar | 8
Ravi | 10
Sri | 2
Ram | 4
df2,
Name | std
Sri | 2
Ram | 4
I want to subtract df2 rows from df1 and I tried,
df1.subtract(df2,fill_value=None)
but I am getting error,
TypeError: unsupported operand type(s) for -: 'str' and 'str'
My...

2

votes

1

answer

684

Views

### how to merge two dataframes based on a column in pandas [duplicate]

This question already has an answer here:
Pandas Merging 101
1 answer
I have two data frames,
df1=pd.DataFrame({"Req":["Req 1","Req 2","Req 3"],"Count":[1,2,1]})
Req Count
0 Req 1 1
1 Req 2 2
2 Req 3 1
df2=pd.DataFrame({"Req":["Req 1","Req 2"],"Count":[0,1]})
Req Count
0...

1

votes

1

answer

1.5k

Views

### How to add a new column and aggregate values in R

I am completely new to gnuplot and am only trying this because I need to learn it. I have a values in three columns where the first represents the filename (date and time, one hour interval) and the remaining two columns represent two different entities Prop1 and Prop2.
Datetime Prop1...

2

votes

1

answer

1.5k

Views

### seaborn multiple variables group bar plot

I have pandas dataframe, one index(datetime) and three variables(int)
date A B C
2017-09-05 25 261 31
2017-09-06 261 1519 151
2017-09-07 188 1545 144
2017-09-08 200 2110 232
2017-09-09 292 2391 325
I can create grouped bar plot with basic pandas plot.
df.plot(ki...

2

votes

1

answer

35

Views

### BigQuery: Xoring Elements of Two Arrays

I have a two arrays.
a = [1, 2, 3, 4]
b = [11, 22, 33, 44]
How can I xor the respective elements of two arrays to get a result as
result = [10, 20 ,34, 40] i-e 1^11 = 10, 2^22=20 and so on
I have tried BIT_XOR(x) but it takes one array and xor all of the elements of array.
SELECT BIT_XOR(x) AS bit_...

2

votes

1

answer

53

Views

### How to slice a dataframe column based on another column

I have a df like this,
Main Length
Sri playnig well cricket 5
sri went out 2
Ram is in 1
Ram went to UK,US 2
I am trying to slice the df["Main"] based on df["Length"]
My expected output is,
Main Length
Sri p...

3

votes

2

answer

488

Views

### Remove columns with values 0 or 999999

I am working with a large data set with 400 columns some of the columns have all values zero and others have all zeros with few '999999999'. I want to get rid of such columns. I was able to do it for the columns containing just zeroes but not sure hoe to do it for columns containing zeroes and '999...

-2

votes

0

answer

24

Views

### Selecting all strings in a list that contain a given expression in R

I am reading a crime statistic .csv file into a variable called crimes. New data frame crimes contains a column titled Text_General_Code that briefly describes the crime that was committed. I want to find out what is the percentage of burglaries and robberies in crimes committed. I am using the foll...

19

votes

1

answer

37.8k

Views

### Python Pandas join dataframes on index

I am trying to join to dataframe on the same column "Date", the code is as follow:
import pandas as pd
from datetime import datetime
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(...

4

votes

1

answer

1.3k

Views

### Calculating subtractions of pairs of columns in pandas DataFrame

I work with significantly sized (48K rows, up to tens of columns) DataFrames. At a certain point in their manipulation, I need to do pair-wise subtractions of column values and I was wondering if there is a more efficient way to do so rather than the one I'm doing (see below).
My current code:
# Mat...

19

votes

12

answer

6.9k

Views

### What's the best approach to recognize patterns in data, and what's the best way to learn more on the topic?

A developer I am working with is developing a program that analyzes images of pavement to find cracks in the pavement. For every crack his program finds, it produces an entry in a file that tells me which pixels make up that particular crack. There are two problems with his software though:
1) It pr...

5

votes

2

answer

157

Views

### Analyze similarities in model data using Elasticsearch and Rails

I would like to use Elasticsearch to analyze data and display it to the user.
When a user views a record for a model, I want to display a list of 'similar' records in the database for that model, and the percentage of similarity. This would match against every field on the model.
I am aware that wi...

2

votes

1

answer

1.3k

Views

### Plotting event density in Python with ggplot and pandas

I am trying to visualize data of this form:
timestamp senderId
0 735217 106758968942084595234
1 735217 114647222927547413607
2 735217 106758968942084595234
3 735217 106758968942084595234
4 735217 114647222927547413607
5 etc...
geom_density works if I don't...