Sunday, February 13, 2011

Convert pre-2002 CPS industry and occupation codes

The idea is straightforward. Most of the pre-2002 occupations have been redistributed into other occupations post-2002. In order to get to the new codes generate a random number over the entire data set. If the old occupation has a random number within the range defined by Census then we set this to be its new code.

The rules for the proportions of old occupations and industries to be redistributed to the new codes are available at the Census web site or at IPUMS. The spreadsheets containing the rules for industry were downloaded from here while the rules for the occupations were downloaded from here.

The caveats are that the porportions are assumed to remain constant over time and that the data set is large enough that how the random numbers are generated does not matter 'much'.

The SAS code first converts the occupation spreadsheet to SAS data to be written out to a text file:
PROC IMPORT OUT= WORK.temp 
DATAFILE= "I:\occ_90-00.xls"
DBMS=EXCEL REPLACE;
RANGE="Sheet1$";
GETNAMES=YES;
MIXED=YES;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;

data temp1(drop = Table_2___1990_Census_Occupation f6);
set temp(drop = f5 f7 f8);
rename /* Table_2___1990_Census_Occupation = oldcode*/
f2 = oldlabel f3 = newcode f4 = newlabel ;
if _N_ < 5 then delete;
if index (Table_2___1990_Census_Occupation, 'categor')>0 then delete;
oldcode = input (Table_2___1990_Census_Occupation,4.);
if f6 ne 'NA' then percent = input(f6, 8.);
run;

data temp1(drop = oldcode oldlabel newcode);
retain c clabel;
length clabel $ 65;
set temp1;
if oldcode ne . then c = oldcode;
if oldlabel ne ' ' then clabel = oldlabel;
ncode = input(newcode, 4.);
if percent = . then delete;
run;

data temp2;
retain c ncode;
set temp1(where = (ncode ne .)); by c;
retain cumpct;
if first.c then do;
cumpct = percent;
lpercent = 0;
end;
else do;
lpercent = cumpct;
cumpct = cumpct + percent;
end;
run;

data _null_;
set temp2; by c;
file "I:\CPS\SAS Programs\occ1990_2000_xwalk.sas";
if first.c then do;
if _N_ = 1 then put "if old_code = " c " then do;";
else put "else if old_code = " c " then do;";
put " if " lpercent "< y <= " cumpct " then new_code = " ncode ";";
end;
else if not first.c then do;
put " if " lpercent "< y <= " cumpct " then new_code = " ncode ";";
end;
if last.c then do;
put "end;";
end;
run;


The following does the same for industry:
PROC IMPORT OUT= WORK.temp 
DATAFILE= "I:\ind_90-00.xls"
DBMS=EXCEL REPLACE;
RANGE="Sheet1$";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;

data temp1;
set temp(drop = f5 f7 f8);
rename Table_1___1990_Census_Industry_C = oldcode
f2 = oldlabel f3 = newcode f4 = newlabel f6 = percent;
if _N_ < 5 then delete;
if index (Table_1___1990_Census_Industry_C, 'categor')>0 then delete;
run;

data temp1(drop = oldcode oldlabel newcode);
retain c clabel;
length clabel $ 65;
set temp1;
if oldcode ne ' ' then c = input(oldcode, 4.);
if oldlabel ne ' ' then clabel = oldlabel;
ncode = input(newcode, 4.);
if percent = . then delete;
run;

data temp2;
retain c ncode;
set temp1; by c;
retain cumpct;
if first.c then do;
cumpct = percent;
lpercent = 0;
end;
else do;
lpercent = cumpct;
cumpct = cumpct + percent;
end;
run;

data _null_;
set temp2; by c;
file "I:\CPS\SAS Programs\ind1990_2000_xwalk.sas";
if first.c then do;
if _N_ = 1 then put "if old_code = " c " then do;";
else put "else if old_code = " c " then do;";
put " if " lpercent "< x <= " cumpct " then new_code = " ncode ";";
end;
else if not first.c then do;
put " if " lpercent "< x <= " cumpct " then new_code = " ncode ";";
end;
if last.c then do;
put "end;";
end;
run;


Now there are two SAS programs to be included. First a search and replace is used to replace the "old_code" to PEIO1OCD (occupation) and PEIO1ICD (industry) and "new_code" to NEW_OCC and NEW_IND respectively (for example). It would have been simpler to have SAS write them out but I wanted to be able to use this if some other data set used a different variable name and I needed to remember which was old and new. The following is a snippet to do the actual recode:

  call streaminit (6875309&yr);
x = rand('uniform') * 100;
y = rand('uniform') * 100;

%include "occ1990_2000_xwalk.sas";
%include "ind1990_2000_xwalk.sas";


Here's a snippet of what the actual code will look like (for occupation):
if peio1ocd = 3  then do;
if 0 < y <= 100 then peio1ocd_r = 3 ;
end;
else if peio1ocd = 4 then do;
if 0 < y <= 77.143 then peio1ocd_r = 1 ;
if 77.143 < y <= 100 then peio1ocd_r = 43 ;
end;
else if peio1ocd = 5 then do;
if 0 < y <= 1.571 then peio1ocd_r = 1 ;
if 1.571 < y <= 4.712 then peio1ocd_r = 2 ;
if 4.712 < y <= 6.806 then peio1ocd_r = 10 ;
if 6.806 < y <= 8.377 then peio1ocd_r = 11 ;
if 8.377 < y <= 13.613 then peio1ocd_r = 12 ;
if 13.613 < y <= 14.66 then peio1ocd_r = 13 ;
if 14.66 < y <= 16.754 then peio1ocd_r = 15 ;
if 16.754 < y <= 17.278 then peio1ocd_r = 16 ;
if 17.278 < y <= 17.802 then peio1ocd_r = 22 ;
if 17.802 < y <= 19.896 then peio1ocd_r = 23 ;
if 19.896 < y <= 20.42 then peio1ocd_r = 30 ;
if 20.42 < y <= 20.944 then peio1ocd_r = 36 ;
if 20.944 < y <= 21.991 then peio1ocd_r = 41 ;
if 21.991 < y <= 26.703 then peio1ocd_r = 42 ;
if 26.703 < y <= 67.541 then peio1ocd_r = 43 ;
if 67.541 < y <= 71.73 then peio1ocd_r = 54 ;
if 71.73 < y <= 73.824 then peio1ocd_r = 62 ;
if 73.824 < y <= 76.965 then peio1ocd_r = 81 ;
if 76.965 < y <= 77.489 then peio1ocd_r = 84 ;
if 77.489 < y <= 82.725 then peio1ocd_r = 93 ;
if 82.725 < y <= 88.484 then peio1ocd_r = 122 ;
if 88.484 < y <= 89.008 then peio1ocd_r = 164 ;
if 89.008 < y <= 91.102 then peio1ocd_r = 202 ;
if 91.102 < y <= 93.72 then peio1ocd_r = 211 ;
if 93.72 < y <= 94.244 then peio1ocd_r = 215 ;
if 94.244 < y <= 94.768 then peio1ocd_r = 354 ;
if 94.768 < y <= 95.292 then peio1ocd_r = 382 ;
if 95.292 < y <= 95.816 then peio1ocd_r = 395 ;
if 95.816 < y <= 100.005 then peio1ocd_r = 525 ;
end;

No comments: