# Questions tagged [statistics]

5030 questions

1

25

Views

### Simulation of multiple binomial random numbers in R

I have the following algorithm Step 1. Generate X1=x1~Bin(6,1/3) Step 2. Generate X2|X1=x1~Bin(6-x1,(1/3)/(1-1/3)) Step 3. Generate X3|X1=x1,X2=x2~Bin(6-x1-x2,(1/3)/(1-1/3-1/3)) Step 4. Repeat step 1-3 N times. Here is my approach to implement this algorithm in R: mult_binom
Isa

1

2.7k

Views

### test statistic for Spearman rank correlation

my problem is that I have calculated Spearman rank correlations between two variables. One reviewer asked me if I could add also test statistics for all coefficients where p < 0.001. Here is one result: > cor.test(pl\$data, pl\$rang, method= 'spearman') # Spearman's rank correlation rho data: p...
Eco06

2

22

Views

### Is it possible to get the PERCENT_RANK for a single record, but relative to the entire table?

I would like the PERCENT_RANK value for a single record, but in relation to the entire table. Is this possible? Examples I've seen are like this: SELECT Name, Salary PERCENT_RANK() OVER (ORDER BY Salary) FROM Employees Notice that it's calculating the percentile for the returned recordset. If you're...
Deane

1

1.9k

Views

### R ggplot geom_bar() label bars (with 'count')

I have a ggplot like this: ggplot(df,aes(x=DateDiff, fill=TEAM)) + geom_bar() How can I label the bars with the results from the y axis, when there's no y axis defined? (without altering the df)

0

5

Views

### What type of machine learning or AI Model can I use for Factor Ranking

What type of machine learning or AI Model can I use for Factor Ranking? I have some factors and am trying to rank them based on how they are able to predict in my model please what kind of machine learning or AI or Deep Learning Model work for this?
tplshams

1

10

Views

### How to remove extra statistical mean results from JSON dictionary in python?

I'm working in python3 - I'm trying to determine the mean from measurements in a JSON dictionary of contaminants in a well. When I return the code its shows the mean of the data for each line. Essentially I want to find one mean for all results of one contaminant. There are multiple results for the...
ceagle

4

93

Views

### Linear regression with two variables on python

I am developing a code to analyze the relation of two variables. I am using a DataFrame to save the variables in two columns as it follows: column A = 132.54672, 201.3845717, 323.2654551 column B = 51.54671995, 96.38457166, 131.2654551 I have tried to use statsmodels but it says that I do not hav...
Hugo Assis Brandao

1

103

Views

### Julia - describe() function display incomplete summary statistics

I'm trying basic data analysis with Julia I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv) with the following code: using DataFrames, RDatasets, CSV, StatsBase train = CSV.read('/Path/to/train_u6lujuX_CVtuZ9i.csv'); describe(train[...
ecjb

0

94

Views

### Choosing the right design for nparLD

I have to use nparLD package due to some data distribution and heteroscedasticity reasons but I have some trouble to select the right design (F1-LD-F1 or F2-LD-F1). I have two groups (one for patients and one for controls) and all participants underwent the same MR examination three times (before ex...
N. Szilvia

1

262

Views

### Bootstrap for Confidence Intervals

My problem is as follows: Firstly, I have to create 1000 bootstrap samples of size 100 of a 'theta hat'. I have a random variable X which follows a scaled t_5-distribution. The following code creates 1000 bootstrap samples of theta hat: library('metRology', lib.loc='~/R/win-library/3.4') # Draw some...
Hans Christensen

0

164

Views

### How do I propagate the error of a linear regression when projecting from Y to X?

I'm trying to figure out how to propagate errors in the following case I am calibrating a machine with a couple of standards (a, b, c) with accepted values x. My machine measures y for these standards, with a certain error (standard deviation of 1 in this example). Then I measure replicates of a sam...
Japhir

0

68

Views

### IHS (inverse hyperbolic sine) option of mhurdle in r does not work

I hope this thread finds you well. I am trying to use Box-Cox double hurdle model and Inverse Hyperbolic Sine (IHS) double hurdle model since the dependent variable of my data is non-normal. The problem I face is that I could not find some appropriate package to run the IHS model. I tried 'mhurdle'...
Jaecheol Lee

0

317

Views

### How to estimate weibull hazard rate function for a feature based on components failure data?

A mechanical component was run continuously till it failed (test-to-failure). We have data of one such experiment. Data Dictionary age --> time in mins. life_per --> age/total time to failure life_status --> faliure==1 and 0== non failure (note: we only have 1 record with life_status=1 i.e. the tim...
GeorgeOfTheRF

0

145

Views

### BCa confidence interval using pre-bootstrapped data

I want to calculate a BCa confidence interval using the function boot.ci in the boot package. I do already have bootstrap replicates (not calculated using the boot function provided by the package). To calculate a BCa confidence interval, it should be sufficient to pass the bootstrap replicates tog...
UweM.

0

200

Views

### Can Kruskal-Wallis test be used to test significance of multiple groups within multiple factors?

I have tried to read what I can on Kruskal-Wallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the Kruskal-Wallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependen...
user3245629

0

108

Views

### Correllation (pandas) between integer and boolean?

I have my data in the form of: price | bool_qual_1 | bool_qual_2 | bool_qual_3 13000 | True | True | True 20000 | False | True | True 15000 | True | True | False 13000 | False | False | False 15000 | True |...
Zeruno

1

118

Views

### Trying to display NONE when there is no mode in R

I am trying to figure the mode of a data set, while displaying 'NONE' if there is no mode. I am currently using Gregor's function as commented below. examples: {1,1,2,2,3} Expected results 1 2(success) {NA,NA,NA,1,1,1,3,3} Expected results NA 1(success) {1,2,3,4,5} Expected result NONE (success) {...
Jerry Lim

0

159

Views

### How to perform Welch's Ttest in Spark 2.0.0 using StreamingTest

I want to try Welch's T-test in Spark 2.0.0 As I know I can use StreamingTest() from mllib on this website. [https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest] this is my code in spark-shell import org.apache.spark.mllib.stat.test.{BinarySample,...
Jchanho

0

64

Views

### failure to get the 95%CI of D index using 'survcomp' and 'boot' package

I have used 'survcomp' to calculate the D index of the prediction model. The code was listed as follows: source('http://bioconductor.org/biocLite.R') biocLite('survcomp') library(survcomp) set.seed(12345) age
QW Zhang

1

139

Views

### Error with bspline smoothing for functional data in R with high number of basis functions

I have this error in R, I can't solve it : Error in chol.default(temp) : the leading minor of order 445 is not positive definite In addition: Warning message: In smooth.basis1(argvals, y, fdParobj, wtvec = wtvec, fdnames = fdnames, : Matrix of basis function values has rank 799 < dim(fdobj\$basis)...
Victoire Louis

0

332

Views

### Multivariable linear regression in JS

I am trying to perform multivariable linear regression with a single dependent variable Y and two independent variables x1, x2. This is simply an OLS regression with an additional dependent variable: Y = b0 + b1 x1 + b2 x2 I need to also calculate the correlation coefficient R^2 for this relationsh...
Martin

1

89

Views

### Linear regression with two independent variables in javascript

The following will output the slope, intercept and correlation coefficient R^2 for a given set of x and y values. let linearRegression = (y,x) => { let lr = {} let n = y.length let sum_x = 0 let sum_y = 0 let sum_xy = 0 let sum_xx = 0 let sum_yy = 0 for (let i = 0; i < y.length; i++) { sum_x += x[i]...
Martin

0

26

Views

### Distinguishing random gnp graphs from preferential attachment grpahs using the powerlaw python package

My goal is to find the point where scale-free networks become indistinguishable from random (non-scale-free) networks using the powerlaw python package As stated in their paper one should determine the goodness of a power-law fit always by comparing it to the fit to another distribution. I would exp...
David Nathan

0

48

Views

### Latent factor recovery with probabilistic matrix factorization using Edward

I implemented a probabilistic matrix factorization model (R = U'V) following the example in Edward's repo: # data U_true = np.random.randn(D, N) V_true = np.random.randn(D, M) R_true = np.dot(np.transpose(U_true), V_true) + np.random.normal(0, 0.1, size=(N, M)) # model I = tf.placeholder(tf.float32,...
charlesh

0

16

Views

### how can I write a loop in python to get the difference between first and last date for one id

opptyId field oldValue newValue updateTime 0 Stage Qualify 2014-05-27T18:50:14 0 Forecast Best Case 2014-05-27T18:50:14 0 created 2014-05-27T18:50:14 0 Amount 795.53 2014-06-17T18:54:00 0 Stage Qualify Closed - Won 2014-07-09T20:11:05 0 Forecast...
bella

0

12

Views

### How to learn R as a statistical system?

I am more familiar with Python (at a beginner level), which I have found way easier than R thus far. Nevertheless, for professional reasons I want to learn R as well. My main intention is to learn how to use it as a statistical system. I looked in Stackoverflow for some recommendations on learning...
Alejandro Ruiz

0

194

Views

### Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about en...
Manner

0

181

Views

### Pandas vectorize statistical odds-ratio test

I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neith...
XyledMonkey

1

401

Views

### How can I make all-possible-regressions in R also include exponents and logs of the variables?

I primarily work in research and statistics and so am not as familiar with programming. I'm using the OLSRR package for statistical analysis when trying to compare as many model specifications as possible using all-possible-regressions. I use the code: model
Zayaan

1

897

Views

### Confidence interval in python

Is there a function or a package in python to get the 95% or 99% confidence interval of the distributions present in scipy.stats. If not what is easiest alternative for that Here are the distributions (but my priority is halfgennorm and loglaplace) st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime...
Will_smith12

1

109

Views

### NA value in portfolio sorts

I am currently doing portfolio sorts on panel data meaning every month I form 5 portfolios based on the volatility of stocks. I have the following function: arguments are x: a vector of returns P: the number of portfolios we want sortPort
user9259005

0

167

Views

### Is this right way to generate the following heavy-tailed distribution?

I'm trying to generate an error distribution which follows a heavy-tailed distribution based on the statement : (2) A heavy-tailed distribution, i.e., t-df=2, a t-distribution with 2 degrees of freedom. This distribution has C95=3.61, where C95 is a measure of the distribution tail weight and defin...
Ryan

0

248

Views

### Multivariate linear mixed effects model in Python

I am playing around with this code which is for Univariate linear mixed effects modelling. The data set denotes: students as s instructors as d departments as dept service as service In the syntax of R's lme4 package (Bates et al., 2015), the model implemented can be summarized as: y ~ 1 + (1|stude...
laza

0

229

Views

### Fair quantiles with lots of same values?

Doing rfm analysis. i want to divide ranks by 5. i've done this ok. The problem is when i try the quantile function by 5 in the frequency part to divide buyers by 'great customer' if they buy often with rank 5, 'good' if they buy kind of less (rank 4) and so on, the fact that large number of buyers...
J_p

1

298

Views

### Bootstrap Confidence Interval for Prediction

I would like to calculate a confidence interval for the RMSE of a machine learning regression in the out-of-sample test set predictions. My train set is the first 80% of the sample, and the 'out-of-sample' test set is the last 20% of the sample. I treat the RMSE of the test set predictions as the o...
Ben Smith

0

89

Views

### BTYD pnbd.EstimateParameters(): cal.cbs must have a frequency column labelled “x”

I am working on building a CLV model. I have faced an error, but answers to sharp's question helped in resolving it. Now, when I try and estimate the parameters using params
Akshat Bajaj

2

92

Views

### knn algorithm output one result per test data

I used knn to do a basic predictive model build. After I run: predictions = knn(train,test,cl,k=3) and then output predictions, the R console has a dozen results per row. Such as: [1] Yes No Yes Yes Yes No Yes Yes Yes No No [12] No Yes Yes Yes No No No Yes Yes Yes No etc etc for 10000 rows. I need t...
JayA

1

374

Views

### Standard deviation of binned values with `scipy.stats.binned_statistic`

When I bin my data accordingly to scipy.stats.binned_statistic (see here for example), how do I get the error (that is the standard deviation) on the average binned values? For example, if I bin my data as following: windspeed = 8 * np.random.rand(500) boatspeed = .3 * windspeed**.5 + .2 * np.random...
Py-ser

0

4

Views

### What statistical method should I resort to to analyze the generational difference?

So I would like to analyze the value attitude differences between two generations (Young people's generation and their parents' generation). I understand that normally independent t test or Anova can be used for this problem. The thing is, I PAIRED the children with their respective parent(so I coll...
jeff.lian

0

35

Views

### Get the variance of a “list” of values with multiplicities?

I'm using python and I've pulled in numpy/scipy as dependencies. It's OK to pull in more if they're well-tested and so on. Suppose I've got a dataset with a relatively small number of distinct values, each of which has a high multiplicity. I'll represent it as a map (Value -> Multiplicity), say some...
Richard Rast