**Course Content**: Efficiency, information inequality, simulation, bootstrap, hypothesis testing, p-values, likelihood ratio, nonparametrics, descriptive statistics, experimental design, regression, multiple linear regression and analysis of variance, categorical data, chi-squared tests, Bayesian statistics.

**Instructor**: Larry Goldstein, larry at usc dot edu, KAP 406D, 213 740 2405. Office Hours: MW 10:45-11:45

**Main Text**: Mathematical Statistics and Resampling with R, Chihara and Hesterberg, Textbook Supplements, including datasets

**R Resources**: Introduction to R, Another Introduction to R, R homepage, Introductory example, R-Tutorial, an R Graphics Gallery

**Teaching Assistant**: Michael Hankin, mhankin at usc dot edu. Office Hours: Tuesday 3-4, Wednesday 1-2, Thursday 3-4, Math Center

**Exams and Grading Policy**

- Homework: 15%

- Midterm 1: 20%: Wednesday, February 19th.

Scores: 23 26 33 34 34 35 39 44 46 54 58 69

- Midterm 2: 20%, Wednesday, April 2nd. Includes material in Chapter 4, Chapter 5 up to section 5.6 excluding Bootstrap percentile intervals, Section 6.3.1, 6.3.3., Sections 7.1.1, 7.1.2, and 7.3.

Scores: 37,42,47,49,52,71,76,87,98,99,105

**Course Project:** 20% Analysis of a data set using the techniques learned in class. You should prepare a report, or writeup, for distribution to the class that explains the data, what inferences were drawn from it, and what statistical techniques were applied. Presentations should last 25 minutes. If interesting coding or new R issues arose during your work you may also consider discussing these with the class as well. 15% credit for your own presentation, 5% for participation with questions, comments, or suggestions on the presentations of others. April 21st, 23rd, 28th and 30th.

**Final Exam:** 25% Monday, May 12th, 2-4PM. Exam will be comprehensive over all material during the semester, with emphasis on what was covered since the second midterm, including Sections 8.2.1, 8.2.2, 9.1-9.4.

**Assignments**

1. Listen to Radio Lab, Stochastiticy parts A: A Very Lucky Wind, and B: Seeking Patterns

2. Write a simulation in R to estimate the probability of obtaining seven heads in a row in 100 tosses of a fair coin. It may be simpler to break the task down in two pieces, as follows

a) Write a function that takes in a vector of 100 0’s and 1’s and returns either TRUE or FALSE depending on whether or not it has a run in it.

Hint: check to see if each flip is the beginning of a 7 flip run.

b) Create 1000 100-flip-samples and use your function to count the number that contain such runs.

3. Exercises 1.11: 2, 3, 5, 6.

4. Exercises 2.8: 2,4,6,8,13,14,15,17

5. Test the null hypothesis that the Salk Vaccine is ineffective, using the Hypergeometric distribution, and compare the exact p-value obtained there to the one computed using both the Binomial and the Normal Approximation. Recall that both treatment and control groups were of size 200,000, and that the treatment group had 56 cases, while the control group had 141.

6. Perform a permutation test for the data NCBirths2004 to test the null hypothesis that Tobacco use by the mother does not affect the birth weight of newborns.

7. Exercises 3.9: 4,8,11,13,17,19,22, 25,29

8. Exercises 4.4: 1,2,6,9,10,15,18,20,22,25,27,28

9. Find EZ^n for Z ~ N(0,1) for n=0,1,2,3,4,5 and 6.

10. Find the moment generating function of the chi squared distribution on k degrees of freedom, and use it to calculate the mean and variance.

11. The victor of the World Series in Baseball is awarded to the first team who wins four games. Hence the series can be 4,5,6 or 7 games long. Over the 50 year peroid starting in 1952, the number of times the series lasted for those number of games was 8,8,10 and 24, respectively. Test the hypotheses H_0 that the games of the world series are independent, with each team having an equal chance of winning.

12. Find the expected number of contiguous subsequences of the form 01111111 in 100 tosses of a fair coin. Assuming that the distribution of the number of occurrences of this subsequence is approximately Poisson, find an approximation to the probability that the 100 toss sequence contains at least one subsequence of this type. Compare the result obtained this way to the estimate computed by simulation in Problem 2, above.

13. The unbiased estimates of variance, scaled by n-1, is typically preferred to the variance estimate scaled by the sample size n. Use the bootstrap to estimate the bias of these two variance estimates for a small sample of independent normal variables.

14. Find the distribution and density function of the second largest observation from a sample of n independent and identically distributed random variables with density function f. What is the expected value of this variable when the density is uniform over [0,1]?

15. Exercises 5.10: 5,6,8,9,10,11,12,14

S. Find an observational study reported in a `reliable source’ (e.g. LA Times, CNN News, etc.) where you can name an overlooked confounded effect that would partially, or fully, negate the conclusion drawn.

16. Exercises 6.4: 1,2,4,5,10,12,14,16,25,27,34,36

17. Exercises 7.6: 1,2,7,8,10,12,17,20,24,25,31,34,37

18. Exercises: 8.5: 4,6,11,14,16,17,18,25,36,37

19. Exercises: 9.7: 7,9,10,11,17,18,21

20. Find the power function for the one sided hypothesis test of H_0: μ = μ_0 vs H_0: μ > μ_0 at significance level α when observing n i.i.d. normal variables with unknown mean μ and variance σ^2=1. Plot the power function for α = 0.05, μ=0 and n =100. How large a sample is needed in order to have power 0.80 to detect that μ=1?

21. Find the least squares estimate of, and a confidence interval for, β in the linear model y_{i}= βx_{i}+ ε_{i, }i=1,2,..n. when the errors are i.i.d. normal variables.

22. Run a linear regression analysis on the Pearson father-son height data set. Form a scatter plot for the data, estimate all the parameters of the model, and test the hypotheses that there is no association between father and son height., that is, test the hypothesis that β equals zero, against the alternative that it is non-zero,

F. Write a problem for possible use in the final exam. Though direct variations of already assigned problems are one possibility, higher credit will be given for problems that fairly test a student’s ability to understand, use, manipulate and extend the concepts taught in the course.

**Due Dates:**

1,3 Jan 30th.

2,4,5,6,7 Feb 18th

8,9,10,11 Mar 6th

12,13,14,15 Mar 25th

S. Mar 26th

16,17 Apr 10th

18,19,20 Apr 22

21,22 May 1

F May 2

**Data Links of Interest**

- CDC On Line Data Bases
- www.data.gov
- http://sda.berkeley.edu/
- http://archive.ics.uci.edu/ml/datasets.html
- http://www.bigdata-startups.com/public-data/
- http://www.statsci.org/datasets.html
- http://www.pro-football-reference.com/play-index/play_finder.cgi
- https://www.kaggle.com/
- logistic regression tutorial