# Questions tagged [statistics]

5030 questions

votes
1

answer
25

Views

### Simulation of multiple binomial random numbers in R

I have the following algorithm Step 1. Generate X1=x1~Bin(6,1/3) Step 2. Generate X2|X1=x1~Bin(6-x1,(1/3)/(1-1/3)) Step 3. Generate X3|X1=x1,X2=x2~Bin(6-x1-x2,(1/3)/(1-1/3-1/3)) Step 4. Repeat step 1-3 N times. Here is my approach to implement this algorithm in R: mult_binom
Isa

votes
1

answer
2.7k

Views

### test statistic for Spearman rank correlation

my problem is that I have calculated Spearman rank correlations between two variables. One reviewer asked me if I could add also test statistics for all coefficients where p < 0.001. Here is one result: > cor.test(pl\$data, pl\$rang, method= 'spearman') # Spearman's rank correlation rho data: p...
Eco06

votes
2

answer
22

Views

### Is it possible to get the PERCENT_RANK for a single record, but relative to the entire table?

I would like the PERCENT_RANK value for a single record, but in relation to the entire table. Is this possible? Examples I've seen are like this: SELECT Name, Salary PERCENT_RANK() OVER (ORDER BY Salary) FROM Employees Notice that it's calculating the percentile for the returned recordset. If you're...
Deane

votes
1

answer
1.9k

Views

### R ggplot geom_bar() label bars (with 'count')

I have a ggplot like this: ggplot(df,aes(x=DateDiff, fill=TEAM)) + geom_bar() How can I label the bars with the results from the y axis, when there's no y axis defined? (without altering the df)
adlisval

votes
0

answer
5

Views

### What type of machine learning or AI Model can I use for Factor Ranking

What type of machine learning or AI Model can I use for Factor Ranking? I have some factors and am trying to rank them based on how they are able to predict in my model please what kind of machine learning or AI or Deep Learning Model work for this?
tplshams

votes
1

answer
10

Views

### How to remove extra statistical mean results from JSON dictionary in python?

I'm working in python3 - I'm trying to determine the mean from measurements in a JSON dictionary of contaminants in a well. When I return the code its shows the mean of the data for each line. Essentially I want to find one mean for all results of one contaminant. There are multiple results for the...
ceagle

votes
4

answer
93

Views

### Linear regression with two variables on python

I am developing a code to analyze the relation of two variables. I am using a DataFrame to save the variables in two columns as it follows: column A = 132.54672, 201.3845717, 323.2654551 column B = 51.54671995, 96.38457166, 131.2654551 I have tried to use statsmodels but it says that I do not hav...
Hugo Assis Brandao

votes
1

answer
103

Views

### Julia - describe() function display incomplete summary statistics

I'm trying basic data analysis with Julia I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv) with the following code: using DataFrames, RDatasets, CSV, StatsBase train = CSV.read('/Path/to/train_u6lujuX_CVtuZ9i.csv'); describe(train[...
ecjb

votes
0

answer
94

Views

### Choosing the right design for nparLD

I have to use nparLD package due to some data distribution and heteroscedasticity reasons but I have some trouble to select the right design (F1-LD-F1 or F2-LD-F1). I have two groups (one for patients and one for controls) and all participants underwent the same MR examination three times (before ex...
N. Szilvia

votes
1

answer
262

Views

### Bootstrap for Confidence Intervals

My problem is as follows: Firstly, I have to create 1000 bootstrap samples of size 100 of a 'theta hat'. I have a random variable X which follows a scaled t_5-distribution. The following code creates 1000 bootstrap samples of theta hat: library('metRology', lib.loc='~/R/win-library/3.4') # Draw some...
Hans Christensen

votes
0

answer
164

Views

### How do I propagate the error of a linear regression when projecting from Y to X?

I'm trying to figure out how to propagate errors in the following case I am calibrating a machine with a couple of standards (a, b, c) with accepted values x. My machine measures y for these standards, with a certain error (standard deviation of 1 in this example). Then I measure replicates of a sam...
Japhir

votes
0

answer
68

Views

### IHS (inverse hyperbolic sine) option of mhurdle in r does not work

I hope this thread finds you well. I am trying to use Box-Cox double hurdle model and Inverse Hyperbolic Sine (IHS) double hurdle model since the dependent variable of my data is non-normal. The problem I face is that I could not find some appropriate package to run the IHS model. I tried 'mhurdle'...
Jaecheol Lee

votes
0

answer
317

Views

### How to estimate weibull hazard rate function for a feature based on components failure data?

A mechanical component was run continuously till it failed (test-to-failure). We have data of one such experiment. Data Dictionary age --> time in mins. life_per --> age/total time to failure life_status --> faliure==1 and 0== non failure (note: we only have 1 record with life_status=1 i.e. the tim...
GeorgeOfTheRF

votes
0

answer
145

Views

### BCa confidence interval using pre-bootstrapped data

I want to calculate a BCa confidence interval using the function boot.ci in the boot package. I do already have bootstrap replicates (not calculated using the boot function provided by the package). To calculate a BCa confidence interval, it should be sufficient to pass the bootstrap replicates tog...
UweM.

votes
0

answer
200

Views

### Can Kruskal-Wallis test be used to test significance of multiple groups within multiple factors?

I have tried to read what I can on Kruskal-Wallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the Kruskal-Wallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependen...
user3245629

votes
0

answer
108

Views

### Correllation (pandas) between integer and boolean?

I have my data in the form of: price | bool_qual_1 | bool_qual_2 | bool_qual_3 13000 | True | True | True 20000 | False | True | True 15000 | True | True | False 13000 | False | False | False 15000 | True |...
Zeruno

votes
1

answer
118

Views

### Trying to display NONE when there is no mode in R

I am trying to figure the mode of a data set, while displaying 'NONE' if there is no mode. I am currently using Gregor's function as commented below. examples: {1,1,2,2,3} Expected results 1 2(success) {NA,NA,NA,1,1,1,3,3} Expected results NA 1(success) {1,2,3,4,5} Expected result NONE (success) {...
Jerry Lim

votes
0

answer
159

Views

### How to perform Welch's Ttest in Spark 2.0.0 using StreamingTest

I want to try Welch's T-test in Spark 2.0.0 As I know I can use StreamingTest() from mllib on this website. [https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest] this is my code in spark-shell import org.apache.spark.mllib.stat.test.{BinarySample,...
Jchanho

votes
0

answer
64

Views

### failure to get the 95%CI of D index using 'survcomp' and 'boot' package

I have used 'survcomp' to calculate the D index of the prediction model. The code was listed as follows: source('http://bioconductor.org/biocLite.R') biocLite('survcomp') library(survcomp) set.seed(12345) age
QW Zhang

votes
1

answer
139

Views

### Error with bspline smoothing for functional data in R with high number of basis functions

I have this error in R, I can't solve it : Error in chol.default(temp) : the leading minor of order 445 is not positive definite In addition: Warning message: In smooth.basis1(argvals, y, fdParobj, wtvec = wtvec, fdnames = fdnames, : Matrix of basis function values has rank 799 < dim(fdobj\$basis)...
Victoire Louis

votes
0

answer
332

Views

### Multivariable linear regression in JS

I am trying to perform multivariable linear regression with a single dependent variable Y and two independent variables x1, x2. This is simply an OLS regression with an additional dependent variable: Y = b0 + b1 x1 + b2 x2 I need to also calculate the correlation coefficient R^2 for this relationsh...
Martin

votes
1

answer
89

Views

### Linear regression with two independent variables in javascript

The following will output the slope, intercept and correlation coefficient R^2 for a given set of x and y values. let linearRegression = (y,x) => { let lr = {} let n = y.length let sum_x = 0 let sum_y = 0 let sum_xy = 0 let sum_xx = 0 let sum_yy = 0 for (let i = 0; i < y.length; i++) { sum_x += x[i]...
Martin

votes
0

answer
26

Views

### Distinguishing random gnp graphs from preferential attachment grpahs using the powerlaw python package

My goal is to find the point where scale-free networks become indistinguishable from random (non-scale-free) networks using the powerlaw python package As stated in their paper one should determine the goodness of a power-law fit always by comparing it to the fit to another distribution. I would exp...
David Nathan

votes
0

answer
48

Views

### Latent factor recovery with probabilistic matrix factorization using Edward

I implemented a probabilistic matrix factorization model (R = U'V) following the example in Edward's repo: # data U_true = np.random.randn(D, N) V_true = np.random.randn(D, M) R_true = np.dot(np.transpose(U_true), V_true) + np.random.normal(0, 0.1, size=(N, M)) # model I = tf.placeholder(tf.float32,...
charlesh

votes
0

answer
16

Views

### how can I write a loop in python to get the difference between first and last date for one id

opptyId field oldValue newValue updateTime 0 Stage Qualify 2014-05-27T18:50:14 0 Forecast Best Case 2014-05-27T18:50:14 0 created 2014-05-27T18:50:14 0 Amount 795.53 2014-06-17T18:54:00 0 Stage Qualify Closed - Won 2014-07-09T20:11:05 0 Forecast...
bella

votes
0

answer
12

Views

### How to learn R as a statistical system?

I am more familiar with Python (at a beginner level), which I have found way easier than R thus far. Nevertheless, for professional reasons I want to learn R as well. My main intention is to learn how to use it as a statistical system. I looked in Stackoverflow for some recommendations on learning...
Alejandro Ruiz

votes
0

answer
194

Views

### Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about en...
Manner

votes
0

answer
181

Views

### Pandas vectorize statistical odds-ratio test

I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neith...
XyledMonkey

votes
1

answer
401

Views

### How can I make all-possible-regressions in R also include exponents and logs of the variables?

I primarily work in research and statistics and so am not as familiar with programming. I'm using the OLSRR package for statistical analysis when trying to compare as many model specifications as possible using all-possible-regressions. I use the code: model
Zayaan

votes
1

answer
897

Views

### Confidence interval in python

Is there a function or a package in python to get the 95% or 99% confidence interval of the distributions present in scipy.stats. If not what is easiest alternative for that Here are the distributions (but my priority is halfgennorm and loglaplace) st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime...
Will_smith12

votes
1

answer
109

Views

### NA value in portfolio sorts

I am currently doing portfolio sorts on panel data meaning every month I form 5 portfolios based on the volatility of stocks. I have the following function: arguments are x: a vector of returns P: the number of portfolios we want sortPort
user9259005

votes
0

answer
167

Views

### Is this right way to generate the following heavy-tailed distribution?

I'm trying to generate an error distribution which follows a heavy-tailed distribution based on the statement : (2) A heavy-tailed distribution, i.e., t-df=2, a t-distribution with 2 degrees of freedom. This distribution has C95=3.61, where C95 is a measure of the distribution tail weight and defin...
Ryan

votes
0

answer
248

Views

### Multivariate linear mixed effects model in Python

I am playing around with this code which is for Univariate linear mixed effects modelling. The data set denotes: students as s instructors as d departments as dept service as service In the syntax of R's lme4 package (Bates et al., 2015), the model implemented can be summarized as: y ~ 1 + (1|stude...
laza

votes
0

answer
229

Views

### Fair quantiles with lots of same values?

Doing rfm analysis. i want to divide ranks by 5. i've done this ok. The problem is when i try the quantile function by 5 in the frequency part to divide buyers by 'great customer' if they buy often with rank 5, 'good' if they buy kind of less (rank 4) and so on, the fact that large number of buyers...
J_p

votes
1

answer
298

Views

### Bootstrap Confidence Interval for Prediction

I would like to calculate a confidence interval for the RMSE of a machine learning regression in the out-of-sample test set predictions. My train set is the first 80% of the sample, and the 'out-of-sample' test set is the last 20% of the sample. I treat the RMSE of the test set predictions as the o...
Ben Smith

votes
0

answer
89

Views

### BTYD pnbd.EstimateParameters(): cal.cbs must have a frequency column labelled “x”

I am working on building a CLV model. I have faced an error, but answers to sharp's question helped in resolving it. Now, when I try and estimate the parameters using params
Akshat Bajaj

votes
2

answer
92

Views

### knn algorithm output one result per test data

I used knn to do a basic predictive model build. After I run: predictions = knn(train,test,cl,k=3) and then output predictions, the R console has a dozen results per row. Such as:  Yes No Yes Yes Yes No Yes Yes Yes No No  No Yes Yes Yes No No No Yes Yes Yes No etc etc for 10000 rows. I need t...
JayA

votes
1

answer
374

Views

### Standard deviation of binned values with `scipy.stats.binned_statistic`

When I bin my data accordingly to scipy.stats.binned_statistic (see here for example), how do I get the error (that is the standard deviation) on the average binned values? For example, if I bin my data as following: windspeed = 8 * np.random.rand(500) boatspeed = .3 * windspeed**.5 + .2 * np.random...
Py-ser

votes
0

answer
4

Views

### What statistical method should I resort to to analyze the generational difference?

So I would like to analyze the value attitude differences between two generations (Young people's generation and their parents' generation). I understand that normally independent t test or Anova can be used for this problem. The thing is, I PAIRED the children with their respective parent(so I coll...
jeff.lian

votes
0

answer
35

Views

### Get the variance of a “list” of values with multiplicities?

I'm using python and I've pulled in numpy/scipy as dependencies. It's OK to pull in more if they're well-tested and so on. Suppose I've got a dataset with a relatively small number of distinct values, each of which has a high multiplicity. I'll represent it as a map (Value -> Multiplicity), say some...
Richard Rast