Questions tagged [statistics]

1

votes
1

answer
25

Views

Simulation of multiple binomial random numbers in R

I have the following algorithm Step 1. Generate X1=x1~Bin(6,1/3) Step 2. Generate X2|X1=x1~Bin(6-x1,(1/3)/(1-1/3)) Step 3. Generate X3|X1=x1,X2=x2~Bin(6-x1-x2,(1/3)/(1-1/3-1/3)) Step 4. Repeat step 1-3 N times. Here is my approach to implement this algorithm in R: mult_binom
Isa
1

votes
1

answer
2.7k

Views

test statistic for Spearman rank correlation

my problem is that I have calculated Spearman rank correlations between two variables. One reviewer asked me if I could add also test statistics for all coefficients where p < 0.001. Here is one result: > cor.test(pl$data, pl$rang, method= 'spearman') # Spearman's rank correlation rho data: p...
Eco06
1

votes
2

answer
22

Views

Is it possible to get the PERCENT_RANK for a single record, but relative to the entire table?

I would like the PERCENT_RANK value for a single record, but in relation to the entire table. Is this possible? Examples I've seen are like this: SELECT Name, Salary PERCENT_RANK() OVER (ORDER BY Salary) FROM Employees Notice that it's calculating the percentile for the returned recordset. If you're...
Deane
1

votes
1

answer
1.9k

Views

R ggplot geom_bar() label bars (with 'count')

I have a ggplot like this: ggplot(df,aes(x=DateDiff, fill=TEAM)) + geom_bar() How can I label the bars with the results from the y axis, when there's no y axis defined? (without altering the df)
adlisval
-1

votes
0

answer
5

Views

What type of machine learning or AI Model can I use for Factor Ranking

What type of machine learning or AI Model can I use for Factor Ranking? I have some factors and am trying to rank them based on how they are able to predict in my model please what kind of machine learning or AI or Deep Learning Model work for this?
tplshams
2

votes
1

answer
10

Views

How to remove extra statistical mean results from JSON dictionary in python?

I'm working in python3 - I'm trying to determine the mean from measurements in a JSON dictionary of contaminants in a well. When I return the code its shows the mean of the data for each line. Essentially I want to find one mean for all results of one contaminant. There are multiple results for the...
ceagle
1

votes
4

answer
93

Views

Linear regression with two variables on python

I am developing a code to analyze the relation of two variables. I am using a DataFrame to save the variables in two columns as it follows: column A = 132.54672, 201.3845717, 323.2654551 column B = 51.54671995, 96.38457166, 131.2654551 I have tried to use statsmodels but it says that I do not hav...
Hugo Assis Brandao
1

votes
1

answer
103

Views

Julia - describe() function display incomplete summary statistics

I'm trying basic data analysis with Julia I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv) with the following code: using DataFrames, RDatasets, CSV, StatsBase train = CSV.read('/Path/to/train_u6lujuX_CVtuZ9i.csv'); describe(train[...
ecjb
1

votes
0

answer
94

Views

Choosing the right design for nparLD

I have to use nparLD package due to some data distribution and heteroscedasticity reasons but I have some trouble to select the right design (F1-LD-F1 or F2-LD-F1). I have two groups (one for patients and one for controls) and all participants underwent the same MR examination three times (before ex...
N. Szilvia
1

votes
1

answer
262

Views

Bootstrap for Confidence Intervals

My problem is as follows: Firstly, I have to create 1000 bootstrap samples of size 100 of a 'theta hat'. I have a random variable X which follows a scaled t_5-distribution. The following code creates 1000 bootstrap samples of theta hat: library('metRology', lib.loc='~/R/win-library/3.4') # Draw some...
Hans Christensen
1

votes
0

answer
164

Views

How do I propagate the error of a linear regression when projecting from Y to X?

I'm trying to figure out how to propagate errors in the following case I am calibrating a machine with a couple of standards (a, b, c) with accepted values x. My machine measures y for these standards, with a certain error (standard deviation of 1 in this example). Then I measure replicates of a sam...
Japhir
1

votes
0

answer
68

Views

IHS (inverse hyperbolic sine) option of mhurdle in r does not work

I hope this thread finds you well. I am trying to use Box-Cox double hurdle model and Inverse Hyperbolic Sine (IHS) double hurdle model since the dependent variable of my data is non-normal. The problem I face is that I could not find some appropriate package to run the IHS model. I tried 'mhurdle'...
Jaecheol Lee
1

votes
0

answer
317

Views

How to estimate weibull hazard rate function for a feature based on components failure data?

A mechanical component was run continuously till it failed (test-to-failure). We have data of one such experiment. Data Dictionary age --> time in mins. life_per --> age/total time to failure life_status --> faliure==1 and 0== non failure (note: we only have 1 record with life_status=1 i.e. the tim...
GeorgeOfTheRF
1

votes
0

answer
145

Views

BCa confidence interval using pre-bootstrapped data

I want to calculate a BCa confidence interval using the function boot.ci in the boot package. I do already have bootstrap replicates (not calculated using the boot function provided by the package). To calculate a BCa confidence interval, it should be sufficient to pass the bootstrap replicates tog...
UweM.
1

votes
0

answer
200

Views

Can Kruskal-Wallis test be used to test significance of multiple groups within multiple factors?

I have tried to read what I can on Kruskal-Wallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the Kruskal-Wallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependen...
user3245629
1

votes
0

answer
108

Views

Correllation (pandas) between integer and boolean?

I have my data in the form of: price | bool_qual_1 | bool_qual_2 | bool_qual_3 13000 | True | True | True 20000 | False | True | True 15000 | True | True | False 13000 | False | False | False 15000 | True |...
Zeruno
1

votes
1

answer
118

Views

Trying to display NONE when there is no mode in R

I am trying to figure the mode of a data set, while displaying 'NONE' if there is no mode. I am currently using Gregor's function as commented below. examples: {1,1,2,2,3} Expected results 1 2(success) {NA,NA,NA,1,1,1,3,3} Expected results NA 1(success) {1,2,3,4,5} Expected result NONE (success) {...
Jerry Lim
1

votes
0

answer
159

Views

How to perform Welch's Ttest in Spark 2.0.0 using StreamingTest

I want to try Welch's T-test in Spark 2.0.0 As I know I can use StreamingTest() from mllib on this website. [https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest] this is my code in spark-shell import org.apache.spark.mllib.stat.test.{BinarySample,...
Jchanho
1

votes
0

answer
64

Views

failure to get the 95%CI of D index using 'survcomp' and 'boot' package

I have used 'survcomp' to calculate the D index of the prediction model. The code was listed as follows: source('http://bioconductor.org/biocLite.R') biocLite('survcomp') library(survcomp) set.seed(12345) age
QW Zhang
1

votes
1

answer
139

Views

Error with bspline smoothing for functional data in R with high number of basis functions

I have this error in R, I can't solve it : Error in chol.default(temp) : the leading minor of order 445 is not positive definite In addition: Warning message: In smooth.basis1(argvals, y, fdParobj, wtvec = wtvec, fdnames = fdnames, : Matrix of basis function values has rank 799 < dim(fdobj$basis)...
Victoire Louis
1

votes
0

answer
332

Views

Multivariable linear regression in JS

I am trying to perform multivariable linear regression with a single dependent variable Y and two independent variables x1, x2. This is simply an OLS regression with an additional dependent variable: Y = b0 + b1 x1 + b2 x2 I need to also calculate the correlation coefficient R^2 for this relationsh...
Martin
1

votes
1

answer
89

Views

Linear regression with two independent variables in javascript

The following will output the slope, intercept and correlation coefficient R^2 for a given set of x and y values. let linearRegression = (y,x) => { let lr = {} let n = y.length let sum_x = 0 let sum_y = 0 let sum_xy = 0 let sum_xx = 0 let sum_yy = 0 for (let i = 0; i < y.length; i++) { sum_x += x[i]...
Martin
1

votes
0

answer
26

Views

Distinguishing random gnp graphs from preferential attachment grpahs using the powerlaw python package

My goal is to find the point where scale-free networks become indistinguishable from random (non-scale-free) networks using the powerlaw python package As stated in their paper one should determine the goodness of a power-law fit always by comparing it to the fit to another distribution. I would exp...
David Nathan
1

votes
0

answer
48

Views

Latent factor recovery with probabilistic matrix factorization using Edward

I implemented a probabilistic matrix factorization model (R = U'V) following the example in Edward's repo: # data U_true = np.random.randn(D, N) V_true = np.random.randn(D, M) R_true = np.dot(np.transpose(U_true), V_true) + np.random.normal(0, 0.1, size=(N, M)) # model I = tf.placeholder(tf.float32,...
charlesh
0

votes
0

answer
16

Views

how can I write a loop in python to get the difference between first and last date for one id

opptyId field oldValue newValue updateTime 0 Stage Qualify 2014-05-27T18:50:14 0 Forecast Best Case 2014-05-27T18:50:14 0 created 2014-05-27T18:50:14 0 Amount 795.53 2014-06-17T18:54:00 0 Stage Qualify Closed - Won 2014-07-09T20:11:05 0 Forecast...
bella
-1

votes
0

answer
12

Views

How to learn R as a statistical system?

I am more familiar with Python (at a beginner level), which I have found way easier than R thus far. Nevertheless, for professional reasons I want to learn R as well. My main intention is to learn how to use it as a statistical system. I looked in Stackoverflow for some recommendations on learning...
Alejandro Ruiz
1

votes
0

answer
194

Views

Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about en...
Manner
1

votes
0

answer
181

Views

Pandas vectorize statistical odds-ratio test

I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neith...
XyledMonkey
1

votes
1

answer
401

Views

How can I make all-possible-regressions in R also include exponents and logs of the variables?

I primarily work in research and statistics and so am not as familiar with programming. I'm using the OLSRR package for statistical analysis when trying to compare as many model specifications as possible using all-possible-regressions. I use the code: model
Zayaan
1

votes
1

answer
897

Views

Confidence interval in python

Is there a function or a package in python to get the 95% or 99% confidence interval of the distributions present in scipy.stats. If not what is easiest alternative for that Here are the distributions (but my priority is halfgennorm and loglaplace) st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime...
Will_smith12
1

votes
1

answer
109

Views

NA value in portfolio sorts

I am currently doing portfolio sorts on panel data meaning every month I form 5 portfolios based on the volatility of stocks. I have the following function: arguments are x: a vector of returns P: the number of portfolios we want sortPort
user9259005
1

votes
0

answer
167

Views

Is this right way to generate the following heavy-tailed distribution?

I'm trying to generate an error distribution which follows a heavy-tailed distribution based on the statement : (2) A heavy-tailed distribution, i.e., t-df=2, a t-distribution with 2 degrees of freedom. This distribution has C95=3.61, where C95 is a measure of the distribution tail weight and defin...
Ryan
1

votes
0

answer
248

Views

Multivariate linear mixed effects model in Python

I am playing around with this code which is for Univariate linear mixed effects modelling. The data set denotes: students as s instructors as d departments as dept service as service In the syntax of R's lme4 package (Bates et al., 2015), the model implemented can be summarized as: y ~ 1 + (1|stude...
laza
1

votes
0

answer
229

Views

Fair quantiles with lots of same values?

Doing rfm analysis. i want to divide ranks by 5. i've done this ok. The problem is when i try the quantile function by 5 in the frequency part to divide buyers by 'great customer' if they buy often with rank 5, 'good' if they buy kind of less (rank 4) and so on, the fact that large number of buyers...
J_p
1

votes
1

answer
298

Views

Bootstrap Confidence Interval for Prediction

I would like to calculate a confidence interval for the RMSE of a machine learning regression in the out-of-sample test set predictions. My train set is the first 80% of the sample, and the 'out-of-sample' test set is the last 20% of the sample. I treat the RMSE of the test set predictions as the o...
Ben Smith
1

votes
0

answer
89

Views

BTYD pnbd.EstimateParameters(): cal.cbs must have a frequency column labelled “x”

I am working on building a CLV model. I have faced an error, but answers to sharp's question helped in resolving it. Now, when I try and estimate the parameters using params
Akshat Bajaj
1

votes
2

answer
92

Views

knn algorithm output one result per test data

I used knn to do a basic predictive model build. After I run: predictions = knn(train,test,cl,k=3) and then output predictions, the R console has a dozen results per row. Such as: [1] Yes No Yes Yes Yes No Yes Yes Yes No No [12] No Yes Yes Yes No No No Yes Yes Yes No etc etc for 10000 rows. I need t...
JayA
1

votes
1

answer
374

Views

Standard deviation of binned values with `scipy.stats.binned_statistic`

When I bin my data accordingly to scipy.stats.binned_statistic (see here for example), how do I get the error (that is the standard deviation) on the average binned values? For example, if I bin my data as following: windspeed = 8 * np.random.rand(500) boatspeed = .3 * windspeed**.5 + .2 * np.random...
Py-ser
0

votes
0

answer
4

Views

What statistical method should I resort to to analyze the generational difference?

So I would like to analyze the value attitude differences between two generations (Young people's generation and their parents' generation). I understand that normally independent t test or Anova can be used for this problem. The thing is, I PAIRED the children with their respective parent(so I coll...
jeff.lian
1

votes
0

answer
35

Views

Get the variance of a “list” of values with multiplicities?

I'm using python and I've pulled in numpy/scipy as dependencies. It's OK to pull in more if they're well-tested and so on. Suppose I've got a dataset with a relatively small number of distinct values, each of which has a high multiplicity. I'll represent it as a map (Value -> Multiplicity), say some...
Richard Rast

View additional questions