21.8.09

Combining SAS data sets using UPDATE statement

Prepared by Rajeev V (rajeev@kreara.com)


UPDATE is an executable type SAS statement generally used in DATA steps. By updating a SAS dataset replaces the values of variables in one dataset with values from another dataset. The dataset containing the original information is the master data set, and the data set containing the new information is the transaction data set. UPDATE performs much the same function as merge with two exceptions:


Ø Only two datasets can be combined using UPDATE statement

Ø If the observation of a variable in the transaction dataset (i.e. data set containing new information) is missing, then the updated dataset containing the value of the observation same as in the master dataset


Syntax

DATA updated data set;

UPDATE master-data-set transaction-data-set;

BY variable-list;

RUN;


Where Master-data-set names the SAS data set that is used as the master file. Transaction-data-set names the SAS dataset that contains the changes to be applied to the master data set. Variable-list specifies the variables by which observations are matched.


Basic use of UPDATE statement

Consider two datasets, Lab1 and Lab2. Both data sets have four subjects with information about their blood pressure at two different times. The first dataset Lab1 contains the subject id, name of the subject and Blood pressure reading. The second dataset contain the latest blood pressure reading for the same subjects but the name of the subjects is not given.

It is required to replace the latest Blood pressure values in LAB1 with the values in LAB2. Here LAB1 is known as Master data set and LAB2 is transaction dataset.



Lab 1

Subject Name BP

001 AAA 120

002 BBB 130

003 CCC 140

004 DDD 150


Lab 2

Subject BP

001 160

002 160

003 160

004 160

The following program updates LAB1 (Master Dataset) with LAB2 (Transaction dataset).


DATA lab1_updated;

UPDATE lab1 lab2;

BY Subject;

RUN;

Printed output of lab1_updated is given below;




In the above example it is found that the latest blood pressure value is populated into updated dataset based on the BY variable Subject identifier.

Suppose Lab1 contains a duplicate observation. For example,

Lab 1

Subject Name BP

001 AAA 120

002 BBB 130

002 BBB 130

003 CCC 140

004 DDD 150


DATA lab1_updated;

UPDATE lab1 lab2;

BY Subject;

RUN;

If the above program is executed updating will not work for duplicate BY value.


That means if master data set contains two observations with the same value of the BY variable, the first observation is updated and the second observation is ignored. SAS writes a warning message to the log when the data step executes.

Printed output of lab1_updated is given below;




Observations two and three have duplicate values of BY variable subject identifier. However, the value of variable Blood pressure was not updated in the second occurrence of the duplicate BY value.

A situation may arise where a missing observation can occur in master dataset or in transaction dataset. Below example illustrates updating when dataset contains unmatched and missing observations.


Lab A

Subject Name BP

001 AAA .

002 BBB 130

003 CCC 140

004 DDD 150


Lab B

Subject BP

001 160

002 .

003 160

004 160


In the above dataset Lab A (Master Dataset) contains missing value corresponding to the subject 001 and the Lab B (Transaction dataset) contains missing value corresponding to the subject 002. The output of the below code gives the latest known BP value for each subject.


DATA lab1_updated;

UPDATE LabA LabB;

BY Subject;

RUN;

Printed output is given below;



The following is a more advanced example where the UPDATE statement is used to “flatten” a dataset containing different variable data values for a key spread across several observations. The goal is to combine non-missing values into one record per unique key value. The master dataset structure is read, but the OBS=0 option stops the DATA step from reading any data from it. The same dataset is then uploaded as transaction dataset, flattening non-missing observations for each unique key into one observation.


DATA lab;

INPUT Subject $ Calcium Albumin Chloride;

CARDS;

001 2.25 . .

001 . 49 .

001 . . 100

002 3.1 . .

002 . 50 90

002 . . .

;

RUN;


DATA lab;

UPDATE lab (obs=0) lab;

BY subject;

RUN;


Output is given below;





No comments: