# Questions tagged [statistics]

5030 questions

1

votes

1

answer

25

Views

### Simulation of multiple binomial random numbers in R

I have the following algorithm
Step 1. Generate X1=x1~Bin(6,1/3)
Step 2. Generate X2|X1=x1~Bin(6-x1,(1/3)/(1-1/3))
Step 3. Generate X3|X1=x1,X2=x2~Bin(6-x1-x2,(1/3)/(1-1/3-1/3))
Step 4. Repeat step 1-3 N times.
Here is my approach to implement this algorithm in R:
mult_binom

1

votes

1

answer

2.7k

Views

### test statistic for Spearman rank correlation

my problem is that I have calculated Spearman rank correlations between two variables. One reviewer asked me if I could add also test statistics for all coefficients where p < 0.001.
Here is one result:
> cor.test(pl$data, pl$rang, method= 'spearman')
# Spearman's rank correlation rho
data: p...

1

votes

2

answer

22

Views

### Is it possible to get the PERCENT_RANK for a single record, but relative to the entire table?

I would like the PERCENT_RANK value for a single record, but in relation to the entire table. Is this possible?
Examples I've seen are like this:
SELECT Name, Salary
PERCENT_RANK() OVER (ORDER BY Salary)
FROM Employees
Notice that it's calculating the percentile for the returned recordset. If you're...

1

votes

1

answer

1.9k

Views

### R ggplot geom_bar() label bars (with 'count')

I have a ggplot like this:
ggplot(df,aes(x=DateDiff, fill=TEAM)) + geom_bar()
How can I label the bars with the results from the y axis, when there's no y axis defined? (without altering the df)

-1

votes

0

answer

5

Views

### What type of machine learning or AI Model can I use for Factor Ranking

What type of machine learning or AI Model can I use for Factor Ranking?
I have some factors and am trying to rank them based on how they are able to predict in my model please what kind of machine learning or AI or Deep Learning Model work for this?

2

votes

1

answer

10

Views

### How to remove extra statistical mean results from JSON dictionary in python?

I'm working in python3 - I'm trying to determine the mean from measurements in a JSON dictionary of contaminants in a well. When I return the code its shows the mean of the data for each line. Essentially I want to find one mean for all results of one contaminant. There are multiple results for the...

1

votes

4

answer

93

Views

### Linear regression with two variables on python

I am developing a code to analyze the relation of two variables. I am using a DataFrame to save the variables in two columns as it follows:
column A = 132.54672, 201.3845717, 323.2654551
column B = 51.54671995, 96.38457166, 131.2654551
I have tried to use statsmodels but it says that I do not hav...

1

votes

1

answer

103

Views

### Julia - describe() function display incomplete summary statistics

I'm trying basic data analysis with Julia
I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv) with the following code:
using DataFrames, RDatasets, CSV, StatsBase
train = CSV.read('/Path/to/train_u6lujuX_CVtuZ9i.csv');
describe(train[...

1

votes

0

answer

94

Views

### Choosing the right design for nparLD

I have to use nparLD package due to some data distribution and heteroscedasticity reasons but I have some trouble to select the right design (F1-LD-F1 or F2-LD-F1).
I have two groups (one for patients and one for controls) and all participants underwent the same MR examination three times (before ex...

1

votes

1

answer

262

Views

### Bootstrap for Confidence Intervals

My problem is as follows:
Firstly, I have to create 1000 bootstrap samples of size 100 of a 'theta hat'. I have a random variable X which follows a scaled t_5-distribution. The following code creates 1000 bootstrap samples of theta hat:
library('metRology', lib.loc='~/R/win-library/3.4')
# Draw some...

1

votes

0

answer

164

Views

### How do I propagate the error of a linear regression when projecting from Y to X?

I'm trying to figure out how to propagate errors in the following case
I am calibrating a machine with a couple of standards (a, b, c) with
accepted values x. My machine measures y for these standards, with a
certain error (standard deviation of 1 in this example).
Then I measure replicates of a sam...

1

votes

0

answer

68

Views

### IHS (inverse hyperbolic sine) option of mhurdle in r does not work

I hope this thread finds you well.
I am trying to use Box-Cox double hurdle model and Inverse Hyperbolic Sine (IHS) double hurdle model since the dependent variable of my data is non-normal. The problem I face is that I could not find some appropriate package to run the IHS model.
I tried 'mhurdle'...

1

votes

0

answer

317

Views

### How to estimate weibull hazard rate function for a feature based on components failure data?

A mechanical component was run continuously till it failed (test-to-failure). We have data of one such experiment.
Data Dictionary
age --> time in mins.
life_per --> age/total time to failure
life_status --> faliure==1 and 0== non failure (note: we only have 1
record with life_status=1 i.e. the tim...

1

votes

0

answer

145

Views

### BCa confidence interval using pre-bootstrapped data

I want to calculate a BCa confidence interval using the function boot.ci in the boot package. I do already have bootstrap replicates (not calculated using the boot function provided by the package).
To calculate a BCa confidence interval, it should be sufficient to pass the bootstrap replicates tog...

1

votes

0

answer

200

Views

### Can Kruskal-Wallis test be used to test significance of multiple groups within multiple factors?

I have tried to read what I can on Kruskal-Wallis and while I have found some useful information, I still seem to not find the answer to my question.
I am trying to use the Kruskal-Wallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependen...

1

votes

0

answer

108

Views

### Correllation (pandas) between integer and boolean?

I have my data in the form of:
price | bool_qual_1 | bool_qual_2 | bool_qual_3
13000 | True | True | True
20000 | False | True | True
15000 | True | True | False
13000 | False | False | False
15000 | True |...

1

votes

1

answer

118

Views

### Trying to display NONE when there is no mode in R

I am trying to figure the mode of a data set, while displaying 'NONE' if there is no mode. I am currently using Gregor's function as commented below.
examples:
{1,1,2,2,3}
Expected results 1 2(success)
{NA,NA,NA,1,1,1,3,3}
Expected results NA 1(success)
{1,2,3,4,5}
Expected result NONE (success)
{...

1

votes

0

answer

159

Views

### How to perform Welch's Ttest in Spark 2.0.0 using StreamingTest

I want to try Welch's T-test in Spark 2.0.0
As I know I can use StreamingTest() from mllib
on this website.
[https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest]
this is my code in spark-shell
import org.apache.spark.mllib.stat.test.{BinarySample,...

1

votes

0

answer

64

Views

### failure to get the 95%CI of D index using 'survcomp' and 'boot' package

I have used 'survcomp' to calculate the D index of the prediction model. The code was listed as follows:
source('http://bioconductor.org/biocLite.R')
biocLite('survcomp')
library(survcomp)
set.seed(12345)
age

1

votes

1

answer

139

Views

### Error with bspline smoothing for functional data in R with high number of basis functions

I have this error in R, I can't solve it :
Error in chol.default(temp) :
the leading minor of order 445 is not positive definite
In addition: Warning message:
In smooth.basis1(argvals, y, fdParobj, wtvec = wtvec, fdnames = fdnames, :
Matrix of basis function values has rank 799 < dim(fdobj$basis)...

1

votes

0

answer

332

Views

### Multivariable linear regression in JS

I am trying to perform multivariable linear regression with a single dependent variable Y and two independent variables x1, x2.
This is simply an OLS regression with an additional dependent variable:
Y = b0 + b1 x1 + b2 x2
I need to also calculate the correlation coefficient R^2 for this relationsh...

1

votes

1

answer

89

Views

### Linear regression with two independent variables in javascript

The following will output the slope, intercept and correlation coefficient R^2 for a given set of x and y values.
let linearRegression = (y,x) => {
let lr = {}
let n = y.length
let sum_x = 0
let sum_y = 0
let sum_xy = 0
let sum_xx = 0
let sum_yy = 0
for (let i = 0; i < y.length; i++) {
sum_x += x[i]...

1

votes

0

answer

26

Views

### Distinguishing random gnp graphs from preferential attachment grpahs using the powerlaw python package

My goal is to find the point where scale-free networks become indistinguishable from random (non-scale-free) networks using the powerlaw python package
As stated in their paper one should determine the goodness of a power-law fit always by comparing it to the fit to another distribution.
I would exp...

1

votes

0

answer

48

Views

### Latent factor recovery with probabilistic matrix factorization using Edward

I implemented a probabilistic matrix factorization model (R = U'V) following the example in Edward's repo:
# data
U_true = np.random.randn(D, N)
V_true = np.random.randn(D, M)
R_true = np.dot(np.transpose(U_true), V_true) + np.random.normal(0, 0.1, size=(N, M))
# model
I = tf.placeholder(tf.float32,...

0

votes

0

answer

16

Views

### how can I write a loop in python to get the difference between first and last date for one id

opptyId field oldValue newValue updateTime
0 Stage Qualify 2014-05-27T18:50:14
0 Forecast Best Case 2014-05-27T18:50:14
0 created 2014-05-27T18:50:14
0 Amount 795.53 2014-06-17T18:54:00
0 Stage Qualify Closed - Won 2014-07-09T20:11:05
0 Forecast...

-1

votes

0

answer

12

Views

### How to learn R as a statistical system?

I am more familiar with Python (at a beginner level), which I have found way easier than R thus far. Nevertheless, for professional reasons I want to learn R as well. My main intention is to learn how to use it as a statistical system.
I looked in Stackoverflow for some recommendations on learning...

1

votes

0

answer

194

Views

### Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about en...

1

votes

0

answer

181

Views

### Pandas vectorize statistical odds-ratio test

I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neith...

1

votes

1

answer

401

Views

### How can I make all-possible-regressions in R also include exponents and logs of the variables?

I primarily work in research and statistics and so am not as familiar with programming. I'm using the OLSRR package for statistical analysis when trying to compare as many model specifications as possible using all-possible-regressions.
I use the code:
model

1

votes

1

answer

897

Views

### Confidence interval in python

Is there a function or a package in python to get the 95% or 99% confidence interval of the distributions present in scipy.stats.
If not what is easiest alternative for that
Here are the distributions (but my priority is halfgennorm and loglaplace)
st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime...

1

votes

1

answer

109

Views

### NA value in portfolio sorts

I am currently doing portfolio sorts on panel data meaning every month I form 5 portfolios based on the volatility of stocks. I have the following function:
arguments are
x: a vector of returns
P: the number of portfolios we want
sortPort

1

votes

0

answer

167

Views

### Is this right way to generate the following heavy-tailed distribution?

I'm trying to generate an error distribution which follows a heavy-tailed distribution based on the statement
: (2) A heavy-tailed distribution, i.e., t-df=2, a t-distribution with 2 degrees of freedom. This distribution has C95=3.61, where C95 is a measure of the distribution tail weight and defin...

1

votes

0

answer

248

Views

### Multivariate linear mixed effects model in Python

I am playing around with this code which is for Univariate linear mixed effects modelling. The data set denotes:
students as s
instructors as d
departments as dept
service as service
In the syntax of R's lme4 package (Bates et al., 2015), the model implemented can be summarized as:
y ~ 1 + (1|stude...

1

votes

0

answer

229

Views

### Fair quantiles with lots of same values?

Doing rfm analysis. i want to divide ranks by 5. i've done this ok. The problem is when i try the quantile function by 5 in the frequency part to divide buyers by 'great customer' if they buy often with rank 5, 'good' if they buy kind of less (rank 4) and so on, the fact that large number of buyers...

1

votes

1

answer

298

Views

### Bootstrap Confidence Interval for Prediction

I would like to calculate a confidence interval for the RMSE of a machine learning regression in the out-of-sample test set predictions.
My train set is the first 80% of the sample, and the 'out-of-sample' test set is the last 20% of the sample. I treat the RMSE of the test set predictions as the o...

1

votes

0

answer

89

Views

### BTYD pnbd.EstimateParameters(): cal.cbs must have a frequency column labelled “x”

I am working on building a CLV model. I have faced an error, but answers to sharp's question helped in resolving it. Now, when I try and estimate the parameters using
params

1

votes

2

answer

92

Views

### knn algorithm output one result per test data

I used knn to do a basic predictive model build.
After I run:
predictions = knn(train,test,cl,k=3)
and then output predictions, the R console has a dozen results per row.
Such as:
[1] Yes No Yes Yes Yes No Yes Yes Yes No No
[12] No Yes Yes Yes No No No Yes Yes Yes No
etc etc for 10000 rows.
I need t...

1

votes

1

answer

374

Views

### Standard deviation of binned values with `scipy.stats.binned_statistic`

When I bin my data accordingly to scipy.stats.binned_statistic (see here for example), how do I get the error (that is the standard deviation) on the average binned values?
For example, if I bin my data as following:
windspeed = 8 * np.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * np.random...

0

votes

0

answer

4

Views

### What statistical method should I resort to to analyze the generational difference?

So I would like to analyze the value attitude differences between two generations (Young people's generation and their parents' generation). I understand that normally independent t test or Anova can be used for this problem. The thing is, I PAIRED the children with their respective parent(so I coll...

1

votes

0

answer

35

Views

### Get the variance of a “list” of values with multiplicities?

I'm using python and I've pulled in numpy/scipy as dependencies. It's OK to pull in more if they're well-tested and so on.
Suppose I've got a dataset with a relatively small number of distinct values, each of which has a high multiplicity. I'll represent it as a map (Value -> Multiplicity), say some...