Econ 424: Computer Methods in Economics

Fall 2009

Session 0101: Tuesday & Thursday 9:30am-10:45am
Session 0201: Tuesday & Thursday 8:00am-9:15am
Plant Sciences 1129

Instructor:
Ginger Z. Jin (Session 1)
Paul Bailey (Session 2)

Final, project 6 and overall grades are available online.


Contact and office hours

In any case, the best way to reach us is via e-mail.
 
Session 1 Session 2
Ginger Z. Jin
Office: 3115 F
Office Hours:  Thursday 2-3pm
Phone: 301-405-3484
E-mail: ginger@umd.edu
Web: kuafu.umd.edu/~ginger/
Paul Bailey
Office: 4101D Tydings
Office Hours: Tuesday 9:30-10:30am
Phone: 301-405-3521
E-mail: pdbailey@umd.edu

Comments and suggestions for the class design are always welcome.



Syllabus
 

Goal
 

As a first step to hi-tech economics, Econ 424 introduces the most basic data handling techniques in economic studies. The ultimate goal is three-fold. At the end of the semester:
In order to fulfill this goal, all classes, including mid-term and final, will meet in a computer lab and use two popular statistical softwares -- Excel and SAS. In addition, students will learn how to use the World Wide Web and how to complete computer projects. Through hands-on experience, students are expected to master both softwares at the introductory level and apply them to economic issues in the real world.


Prerequisites
 

Eligible Students must major in Economics and have completed Econ 305 (Intermediate Macroeconomics Theory and Policy), Econ 306 (Intermediate Microeconomics Theory) and Econ 321 (Economic Statistics).  we will devote classes to introduce Excel and SAS, so experience with either software is not required. However, if you need extra help in getting started, please contact us as soon as possible.


Waiting List Policy
 

Due to the limit of lab capacity, each session can only accomodate 36 students. If you are number x on the waiting list, you won't get in the class unless x enrolled students drop the class during the semester.


Recommended reference books
 

The class will follow lecture notes instead of textbook. The most updated lecture notes are always available on this website. However, if you want published books as reference, we recommend:
#1 --- Malcolm Getz, "e.stat for Business and Economics" (CD-ROM), published by Southern-Western, a division of Thompson-Learning. ISBN: 0-324-00895-3.
#2 --- Lora D. Delwiche and Susan J. Slaughter: "The little SAS book: A Primer," third edition (paperback). ISBN: 1-59047-333-7.
#3 --- Ron Cody & Ray Pass, "SAS Programming by Examples", published by SAS Institute Inc.  ISBN: 1-55544-681-7. 
The SAS books won't be relevant until after midterm.


Evaluation
 

Grades for the course will be based on


Class attendance

Hands-on teaching is much more effective than remote communication by emails. If you miss a class, you can download the lecture notes or consult your classmates. If you still have questions after reading the lecture notes, you are welcome to contact us via email or in person. Please don't expect the instructor to re-lecture every point covered in the missed class.



Exams

The course involves one mid-term and one final (grading weights in parentheses):

(30) Mid-term : Session 1 (October 15 (new!), 9:30-10:45am) Session 2 (October 15 (new!), 8:00-9:15am), open book, on-line in Plant Sciences 1129 (the same room for every class meeting).
Old midterm examples:
Fall 2008: Exam text   Data
Fall 2005: Exam text   Data
(30) Final: cumulative, Session 1 (December 15, 8-10am) Session 2 (December 18, 10:30am-12:30pm), open book, on-line in Plant Sciences 1129 (the same room for every class meeting).
Practice Final:
Hardcopy handout (.doc)
Excel data set (.xls)
SAS data set 1(.csv)
SAS data set 2(.csv)
SAS program (.sas)
answer key
Old final examples:
Fall 2008:
Hardcopy handout (.doc)
Excel data set (.xls)
SAS data set (.csv)
SAS data set (.sas)

Attention: Grades will be posted online as econ424-fall2009-publicgrade.xls. The same link is provided at the beginning of the syllabus. The file is to be listed by a class id, and class id will be assigned to every one in the first class. Please remember your class id so you can check your grades anytime online.

If you are going to miss the midterm or the final for a legitimate reason (following university policy), please notify the instructor AT LEAST 12 HOURS IN ADVANCE. Any excuse delivered after the exam is invalid and will result in zero test score.


If you miss the midterm for a legitimate reason, you have two choices: either take a makeup midterm, or skip the midterm and allow your final to automatically account for 60% of the course grade.




Projects

Each student will complete six projects during the course. They may require you to choose an interesting topic or conduct statistical analysis based on a given topic. Either way, the quality of presentation matters, treat it as though you were giving it to your new employer. It is not necessarily fancy, but must be clear, right to the point and well explained.

Out of the six projects described below, projects 1,2,4,5 are designed as individual work. Each student must work on them independently and submit by him/herself. For Projects 3 and 6, you may choose to do them on your own (for both projects) or partner with another student as a group (for both projects). If you choose the group option, each group member is required to submit a confidential evaluation of the partner's team work at the end of semester, which accounts for 5 points. In correspondence, the total points you get out of the two group projects will be (the earned points*15/20+team evaluation). For example, as described below, the grading weight is 5 points for project 3 and 15 for project 6. If you complete both projects on your own and we grade them as 5 and 11, the total points you get from the three projects will be 5+11=16. If you complete both projects with your partner and your partner gives you 5 points for team work evaluation, the same grading will yield the total points of your three projects as (5+11)*15/20+5=17. However, if your partner gives you 2 points for team work, the total points of your group projects will be (5+11)*15/20+2=14. Generally speaking, two hard-working team-members will produce better group projects and help each other learn new materials. We highly recommend team work whenever feasible. You may search for a team partner until the due date of Project 3. Once you commit to a team on Project 3, you must work within the team on project 6.

Each project should be submitted by email to both Ginger Jin (ginger@umd.edu) and Paul Bailey (pdbailey@umd.edu)simultaneously (so that we have a back up in case something goes wrong), with your name and project number in the subject.  For the files attached to the email, please name them following the convention of yourlastname_yourfirstname_project_number.

For example, if your name is Joe Smith, your first project should be named Smith_Joe_project_1.* where * denotes the file's extension name. If you and your partner Andy Johnson submit project 3 together, please name the file Smith_Johnson_project_3.*. Any project report won't be fully considered unless it is submitted by deadline. Should a submission be delayed for less than two hours, 20% of the full points (for that project) is deducted automatically. Delays over 2 hours is unacceptable. Should there be a legitimate reason for the delay (following university policy), the grading points will be carried over to your midterm or the final (whichever comes first).

Each project is described below, with grading weights in parentheses:

(5) Project 1: Descriptive Statistics

Collect original data from a self-designed questionnaire. Make histograms and compute descriptive statistics for at least two random variables across 30 or more subjects. More specifically:

1. Describe your sample design and one alternative design. By sample design, we mean who to survey, how many to survey, etc. Explain why your design is better than the alternative.

2. Describe your sampling method and one alternative method. By sampling method, we mean how to approach and obtain answers from your subjects. The typical methods include phone, postal mail, email, website entry, and in-person interviews. Explain why your method is better than the alternative.

3. Ask at least two questions. For each question, describe an alternative way to ask and explain why your question is better than the alternative.

4. Analyze the collected data by summary statistics and histograms. Summary statistics should include minimum, maximum, mean, median, all the four quartiles, variance, standard deviation, trimmed mean, skewness and kurtosis for each variable. Make separate histogram for each variable. If you believe the data allows you to compare the two variables, plot relative frequency polygon for each variable and put the two polygons in one chart.

The final report should include (1) one excel sheet as a work sheet including all the detailed step-by-step calculation, and (2) a second excel sheet or a .doc file that includes your answers to the four items listed above and explains why you are interested in these two variables and what you have learned from the data summary. The second excel sheet or the .doc file should be clear, concise and to the point, as if you are submitting a summary report to your employer!

Examples: Survey fellow students for their daily commuting time and methods.  Survey freshmen and seniors for their monthly expenditure on long distance phone calls.

Project 1 is due at 11:59pm of Sept. 15. Each individual student should submit his/her own original report, including a dataset in excel format, summary statistics in the excel file, and a separate word file (or a second excel sheet with text box) describing why you collect this data set. Duplication is not acceptable.

(5) Project 2: Monto Carlo Study (Answer Key)
This project is designed for you to understand data simulation, central limit theorem and sample size.

1. Focus on a normal distribution by choosing the mean and the standard deviation at your own discretion. Take this normal distribution as the "population".

2. From the population, draw 100 random samples, each with 30 observations. Calculate sample mean for each sample. What does the Central Limit Theorem predicts for the distribution of the sample mean?

3. Observe the distribution of the sample mean. In one chart, plot a relative frequency polygon for the sample mean and a relative frequency polygon for the raw data you have drawn from the population. How is the distribution of the sample mean compared to the population? Is it similar to what the Central Limit Theorem predicts? Explain it in a summary paragraph.

4. Repeat all above exercise (1,2,3) for two different sample sizes, one bigger than 30 one smaller than 30 (you have the discretion of choosing sample sizes). How does sample size affect your results? Explain.

5. Repeat all above exercise (1,2,3,4) for a t distribution with the degree of freedom 10. How do results differ? Explain.

Project 2 is due at 11:59pm, Sept 29.  Each individual student should submit his/her own excel file with simulated data and a text box summarizing answers to questions listed above . Duplicate data or duplicate summary is not acceptable.

(5) Project 3: Regression
This project is designed for you to carry out basic data cleaning from a public data set and perform mean estimates, mean comparison, and ordinary least square regressions on the cleaned data.
The data you will use is from www.census.gov/hhes/www/cpstc/cps_table_creator.html which allows for tabulating just about anything from the current population survey (the survey used to estimate the unemployment rate in the US).
I created a data set using this tool that you can download here, the data is formatted in blocks. Each block starts with the year. Then the data lines have columns (1) state, (2) total inhabitants of the state, (3) number insured, (4) number uninsured.
1. Import the data into excel and clean the format so that the excel file have:
column 1 called YEAR = calendar year 2003 to 2009
column 2 called STATE = AL, AK, .... WY
column 3 called TOTAL = total inhabitants of the state at the year
column 4 called INSURED = number insured
column 5 called UNINSURED = number uninsured

Hint: You do NOT need to retype every number into Excel. Instead, you can copy the whole text from the raw data file and paste it in excel. The pasted results should appear in one column just as plain text. Highlight this column, go to "data" menu, choose "text to column", you will see a box pop up. In this box, tell the computer that your data is "Delimited" by "space" and "tab". With this information, the computer should convert the text into columns. You still need to construct the column of YEAR so that we know each row is for which year.

2. It is always a good idea to make sure that your data makes some sense before you begin. Make sure that the population of the United States is about right, it should be between 285 and 305 million in these years, calculate and report the sum for at least one year and verify that it was in this range.
3. Add column 6 with the fraction uninsured. This is the ratio of number uninsured over total population for each state-year.
4. Focus on year 2009. What is the state with the highest fraction uninsured? What is the state with the lowest fraction uninsured? What state is closest to the national average of 15.37% uninsured?
Hint: you can answer these questions by sorting the data. You have to highlight all of the columns to sort them all as a set, and you will have to sort by more than one column to answer this.
5. Compute the mean of the fraction uninsured for the 50 states plus the district for 2009. What is the confidence interval for this mean? Is it significantly different from the national average of 15.37% uninsured? Make sure you write down your null hypothesis, alternative hypothesis, how you conduct the test, and how you reach the conclusion.
6. Compare years 2009 and 2008 for the 51 states. Conduct a statistical test on whether the mean fraction uninsured is equal for the two years. Since we don't know if this will shrink or grow, form your alternative accordingly. Also answer these questions: is this a one-tail or two-tail test? You are comparing two samples, are they independent or matched pairs?
7. Focus on the years 2003 and 2009 for the 51 states. Copy these two years of data into a new data sheet and reshape it as
column 1 called STATE = AL, AK, ... WY
column 2 called TOTAL2003 = total population of a specific state in 2003
column 3 called INSURED2003 = number insured in a state in 2003
column 4 called UNINSURE2003 = number uninsured in a state in 2003
column 5 called FRACUNINSURE2003 = fraction uninsured in a state in 2003
column 6 called TOTAL2009 = total population of a specific state in 2009
column 7 called INSURED2009 = number insured in a state in 2009
column 8 called UNINSURE2009 = number uninsured in a state in 2009
column 9 called FRACUNINSURE2009 = fraction uninsured in a state in 2009
Draw a scatterplot with fraction uninsured in 2003 on the x-axis and fraction uninsured in 2009 on the y-axis. What does this graph tell you? Use excel function CORREL to calculate the correlation coefficient between the two. Hint: you will probably have to make a new column or sheet to do this comparison.
8. Continue from question 7, what would you get if you ran a regression where the dependent variable is fraction uninsured in 2009 and the independent variable is fraction uninsured in 2003 (include an intercept as well)? What model does the regression imply? (Write down the regression equation.) How do you interpret the economic meaning of the coefficient of fraction uninsured in 2003? Is the coefficient of fraction uninsured in 2003 significantly different from zero? Use "equal to zero" as your null hypothesis and "not equal to zero" as your alternative hypothesis.

Project 3's deadline is 11:59pm, Oct. 15 (new!). Each student (or group) submits one excel file as the working sheet and one separate word file answering all the questions.

(5) Project 4: Data cleaning by SAS, part 1 ( answer key )

This project is designed for you to understand the basic SAS commands on data read-in and data cleaning. Based on what we have learned in the class, you should be able to answer the questions directly without any use of the SAS software.

Click either .rtf or .doc for the project description. You can simply edit this file with your answers and turn in the edited file.

Project 4 is due on 11:59pm, Nov. 5. (New!)

(5) Project 5: Data cleaning by SAS, part 2 Answer Key

This project is designed for you to understand how SAS merges two data sets. Based on what we have learned in the class, you should be able to answer the questions directly without any use of the SAS software.

Click here for the project description in .doc. You can simply edit this .doc file with your answers and turn in the edited file.

Project 5 is due on 11:59pm, Nov. 19. (new!)

(15) Project 6: Comprehensive exercise of SAS

This project is designed to improve your understanding of sas in a real, large data set.

The data include all the individuals interviewed by the US Census in the 2009-March Current Population Survey (CPS). For each individual, we have information on age, gender, race, education, marital status, personal earnings and state. To save space, CPS reports each variable is numerically and elaborates the meaning of these numbers are explained in a codebook.

Before starting SAS programming, you need to understand the basic data structure. You can obtain such understanding by opening the codebook and the original .csv data in a text editor (e.g. wordpad or MS words).

Task 1: Read the data into SAS. You can use either a data step or "proc import". How many observations are there in the data set? How many variables do we have?

Task 2: Calculate the number of non-missing observations, mean, standard deviation, median, and quartiles for earnings and age respectively.

Task 3: Define an age categorical variable by age range 0-6, 7-17, 18-24, 25-34, 35-44, 45-54, 55-64, and 65+. Define an education categorical variable so that we group individuals in high school dropout, high school completion, some college, college degree, and above college degree. Count number of observations by (a) age category, (b) education category, and (c) all the possible combinations of age and education categories.

Task 4: Plot a histogram for the age categorical variable as defined in Task 3. Plot a scatter plot of earnings by age.

Task 5: Focus on the subsample with age at or above 25. Call it DATA1. Generate a separate data file that describes average earnings per state. Call it DATA2. How many individuals does DATA1 have in the state of Maryland? Which state has the largest count of individuals in DATA1? Which state has the highest average earnings in DATA2?

Task 6: Merge DATA1 with DATA2 by state. How many observations are matched? How many observations are from DATA1 but unmatched with DATA2? How many observations are from DATA2 but unmatched with DATA1?

Task 7: Use DATA1 to test (1) the average earnings of Maryland is statistically equal to the average earnings of Virginia; and (2) the average earnings are similar across Maryland, California, New York and Massecheusetts.

Task 8: Use the merged data of DATA1 and DATA2 to regress earnings as a linear function of age, gender and education. Write down the regressional model. What is the R-square of the regression? Is the coefficient of age significantly different from zero? Is the coefficient of gender significantly different from zero?

Task 9: Repeat Task 8 but include state fixed effect on the right hand side of the model. What is the R-square? Is the coefficient of age significantly different from zero? Is the coefficient of gender significantly different from zero?

Special note: even if you find Projects 4 and 5 easy, writing your own program is much harder than reading an existing program. We encourage you start Project 6 as early as possible. You are welcome to ask us questions about Project 6 even before we officially assign it in class.

Project 6 is due at 11:59pm of Dec. 18. The final report should include your sas program (.sas), the log file (.log), the output file (.lst) and a separate word file answering all the questions. If you were in a team for project 3, you must work with the same team partner for project 6.


Course Outline

Sept. 1: Introduction

Introduce instructor
Discuss syllabus
Talk about recommended books
Assign Class IDs
Clarify login IDs and other computer issues
First-class Questionnaire
Manage a small data set collected from the questionnaire

Assign project 1

Readings: Getz e.stat Chapters 1, 2.

Sept. 3: Data collection and data description

Follow the data in our first-class questionnaire, emphasize sample design and sampling method Go over data-summary-example.xls:
Mean (weighted and non weighted)
Median
Order statistics
Variance and standard deviation
Skewness

Readings: All sections in e.stat Chapters 3 and 4 except 4.14 and 4.15.

Sept. 8: Histogram

Histogram
Relative frequency polygon

Readings: e.stat 4.3, 4.4, 4.5.

Sept. 10: Probability

Distinguish population and sample

Flip coins and dice

Theory and practice do not necessarily match
law of large numbers
CDF and PDF
statistical independence
expectation.

Readings: Getz e.stat Chapter 5.1-5.6, 6.1-6.7.

Sept. 15: Distribution and simulation

Bernoulli Process (flip coins or roll dice)
Uniform PDF
Normal PDF
t distribution
X2 distribution
F distribution
Data simulation and the law of large numbers

Readings: Getz e.stat Chapters 7,8.
Emphasis: 7.1-7.5, 7.12, 7.14, 7.18-7.20, 8.1-8.3, 8.8-8.13.

Project 1 due. Assign Project 2

Sept. 17: Practice on simulation and the law of large numbers

Assign Project 2

Sept.22 : Simulation Exercise

Sept. 24: Mean estimation


Use Monto Carlo to generate estimates for:

population mean
t distribution
Confidence interval
Testing
Why sample size matters?

Readings: Getz e.stat Chapter 11,12.
Emphasis: 11.1-11.4, 11.6, 11.7, 11.9, 12.1-12.9, 12.10-12.12.

Sept. 29: Hypothesis Testing

Null hypothesis versus alternatives
Type I and Type II errors
One tail test vs. two tail test
Assign Project 3

Readings: Getz e.stat Chapter 13.1-13.10, 13.13, 13.15-13.18.

Oct. 1: Testing of two samples

Testing equal mean
Independent samples
Matched pairs

Readings: Getz e.stat Chapter 14.1-14..5, 14.8-14.11, 14.14-14.16.

Oct. 6: Regression

Scatter plot
basic regression theory
R square , F test
Standard error of coefficients
Testing
Results Interpretation
Readings: Getz e.stat Chapters 19, 20, 21, 22. Emphasis: all sections in Chapter 19, 21.6-21.7, 21.10, 21.12-21.14, 22.1-22.5, 22.7

Oct. 8-13: Practice of Regression, Midterm review

Oct. 15 (new!): Midterm

Oct. 20: Review of Midterm and Introduction to SAS

Shortcomings of Excel
Introduce SAS programming rationale
Read in data
Assign Project 4

Readings: The Little SAS Book Chapters 1and 2.

Oct. 22-Oct.29: Data manipulation of a single data set

Read in complicated data
Data recoding
Converting Date to values
SAS functions
Use SAS to generate and present sample statistics

Readings: The Little SAS Book Chapters 2,3,4.

Nov. 3 - Nov. 17 : Data Manipulation of multiple data sets

Sort a data set
Generate a subset of data
Add in new observations to a data set
Merge and update data sets
Assign Project 5

Readings: Cody & Pass SAS Programming by Examples Chapters 3, 4.

Nov. 19 - Nov. 24: Hypothesis testing in SAS
Dec. 1 - Dec.3: Regression and Testing
Generate dummy variables
Imputing missing values
Regression command
Testing linear restrictions

Assign project 6
Dec. 8- Dec. 10: Comprehensive example of SAS. Review of Excel and SAS materials, practice final

Dec. 15: Session 1 Final 8am-10am Plant Sciences 1129

Dec. 18: Session 2 Final 10:30am-12:30pm Plant Sciences 1129


Important Dates
 

Sept. 15 -- Project 1 due
Sept. 29 -- Project 2 due
Oct. 15 (new!) -- Midterm and Project 3 due.
Nov. 5 (new!) -- Project 4 due
Nov. 19 (new!) -- Project 5 due
Dec. 15 -- Session 1 Final
Dec. 18 -- Session 2 Final
Dec. 18 -- Project 6 and team evaluation (if you belong to a group) due


Example Files Used in Class

Excel examples

Notes for Excel.ppt (to be updated before every class)

data-summary-formula.doc

data-summary-example.xls

coinflip-dierolling-example.xls

simulation-formula.xls

show-central-limit-theorem.xls

SAS program
Notes for SAS Programming.ppt

sas-classexample-proj3data.sas

sas-classexample-reshape.sas

sas-classexample-meancompare-regression.sas

A comprehensive example: reg-cityreg.sas

Final review
final-review.ppt

practice-final.doc

data used in practice final




Links to computer related resources

On-campus computer labs

On-line tutoring for Intermediate Excel - made available by UMD peer training program

MS Excel help

Hands-on Tutor for Windows, MS Office and Internet, a CD-ROM published by the Corporation for Research and Educational Networking (CREN), is available at the UMD information technology library (computer and space science building #1400) for a free on-site review or an on-site purchase with $20.
Microsoft Frequently-asked-questions and highlights for Excel 2000
SAS on-line help
SAS Institute, Inc.
Samples & SAS Notes
UMD on campus SAS help

UCLA web resources for SAS

SAS Topics - Data Management (offered by UCLA)
SAS Topics - Regression (offered by UCLA)


Links to public data sets: on-campus, U.S. domestic and international

Data available on campus


U.S. Domestic Data sets

International Data sets