Kreara: October 2008

10.10.08

SURVIVAL ANALYSIS
Prepared by Jose Abraham

Survival analysis (also called time to event analysis) is concerned with studying the time between entry to a study and a subsequent event. These methods are most often applied to the study of deaths. In fact, they were originally designed for that purpose, which explains the name survival analysis. Survival analysis is an important medical concern and is extremely useful for studying events like onset of disease and recurrence of disease.

The point of survival analysis is to follow subjects over time and observe at which point in time they experience the event of interest. The data which is obtained from survival studies may contain censored observations. Censoring comes in many forms and occurs for many different reasons.

For example if we consider a cancer study in which the subjects after response from treatment were followed up for a specific period of time for the recurrence of cancer (event of interest). If a subject experiences recurrence at time t, which is not known exactly and all we know that the event occurred after a specific time T (i.e. t>T), then the last time at which the subject was observed is recorded and the survival time for that subject is considered as right censored. Also if the recurrence is experienced before a specific time, and the exact time is unknown, then the survival time recorded from that subject is considered as left censored. So the times obtained from subjects who are having no recurrence until the end of the study and those who were lost to follow up, before the end of the study period are censored.

In the aforesaid study, the basic structure of the data is that for each case there is one variable which contains either the time that recurrence happened or, for censored cases, the last time at which the case was observed, both measured from the chosen origin. Another variable that denotes the censoring status of each case is also present (uncensored =1 and censored=1). Also the data contain values of other variables such as markers, tissues etc…. A small part of data in this form is given below

data molecules;
input marker surv censor stage histo;
datalines;
0 75 1 2 1
1 115 0 3 2
1 96 1 1 1
0 110 0 2 3
0 178 0 3 2
1 149 1 2 3
1 163 1 4 4
0 211 1 1 2
1 167 1 2 1
0 195 0 2 1
1 140 1 3 4
0 202 0 4 4
0 153 0 2 2
1 147 0 1 3
0 132 0 4 1
0 178 1 3 2
;
run;

Analysis of censored data can be easily performed in SAS with the help of various procedures like PROC LIFETEST, PROC PHREG etc.The purpose of the analysis is to model the underlying distribution of the survival time variable and to assess the dependence of the survival time variable on the independent variables.

The Kaplan Meier curve is plotted by taking disease free survival time on the horizontal axis and survival probability on the vertical axis. This curve is useful to measure the proportion of patients surviving at a specific time. Also we can compare the survival experience of two groups by comparing their curves. This comparison of survival estimates can be done by making use of the strata statement in PROC LIFETEST. The significant differences of the Kaplan Meier curves can be tested by Logrank test. If the p-value in the log-rank test is large (>0.05) then we can say that there is no difference in survival. The piece of SAS code for doing this comparison of survival curves between those cases in which the marker is present (marker=1) and those in which it is not present (marker=0).

proc lifetest data= survdata method=km plots=(s,lls) outsurv=option;
time surv*censor (0);
strata marker;
run;

The strata statement provides the log rank test and Wilcoxon test statistics. The outsurv= option in the proc lifetest statement to create a SAS data set that has the KM survival estimates. Plots=(s, lls) produces log-log curves as well as survival curves. The log-log survival curves will be parallel or nearly parallel if the proportional hazard assumption is met.

We can make use of this lifetest procedure to compute the non parametric estimates of the survivor function by Kaplan Meier method. The option method=km in the PROC LIFETEST produces Kaplan-Meier survival estimates and the option Plots=(s) plots the estimated survival function. A part of SAS output showing the survival estimates and the Kaplan-Meier curve is given below

Kaplan – Meire Curves

Hazard ratio is a reasonable estimate for representing the effect of different factors in event occurrence. Cox regression model can be used and it models the time to event data. This can be done in SAS using the PROC PHREG. The following piece of code can be used to model the data

proc phreg data =molecules;
model surv*censor(0) =marker stage histo /rl ties=breslow selection=b;
baseline out=out1 survival=s logsurv=ls loglogs=lls;
run;

The backward selection procedure (with the option selection=b) in Cox’s regression removes the non-significant variables from the regression model and it includes only significant variables in the final model. The option ties= breslow is used to handle the ties. Proc phreg produces the regression coefficients and their standard errors for the variables which were included in the final model along with the p-values obtained from the Wald’s chi-square test. Hazard ratios and their 95% confidence intervals for those variables are also included in the output.

A part of the output is given below

Hazard ratios can be interpreted similarly as that of interpreting odds ratios, i.e. a hazard ratio of 1 for an explanatory variable can be interpreted as it has no effect on the hazard. While a hazard ratio less than 1 denotes that the variable effect results in a decreased hazard. And a hazard ratio greater than 1, denotes that the variable effect results in an increased hazard.

Quality Control (QC): steps towards effective programs and outputs
Prepared by Mohanan K K

The term SAS QC refers to the maintenance of the quality of data and quality of programming. The quality of data refers to the Accuracy, Completeness, Consistency, Timeliness, Uniqueness and Validity of data. The quality of programming on the other hand means that the SAS program should produce correct and meaningful outputs and at the same time meet all standards like indentation of statements, optimization of code, use of drop and keep statements and error/warning free logs to name a few.

The QC personnel involved in the SAS QC process are SAS programmers who are independent of the actual programming that is being carried out for the study. At Kreara, the SAS QC process involves the following steps

a) Checking the quality of SAS code and outputs developed.
b) Entering the review comments into the issue tracker.
c) Tracking the resolution of the review comments.

The QC personnel review the code, log and output and raise issues in an issue tracker which is a web application accessible to the team members and which helps in tracking the resolution of issues. The issues raised by the QC are made available to the SAS Programmer who in turn corrects the code as per comments and flags the issue as resolved. The issues are closed by the QC personnel when the correction is satisfactory.

At Kreara intensive QC of codes and outputs is done in four stages as described below

1. Output QC
2. Log QC
3. Code QC
4. Parallel Programming

1. Output QC

The Output QC further includes

a. Sample or complete check against actual data
b. Check against the template

During the output QC, the QC personnel checks whether the outputs produced are as per template or requirements. In addition a sample check is performed against the database. The sample size for sample check varies depending on the study. In case of tables a complete check is done on the outputs produced.

2. Log QC

The SAS program always generate a log file. As part of the Log QC, the QC personnel is required to look for errors, warnings and critical notes in the log. Critical notes may include “Missing values were generated as a result of performing an operation on missing values”. The Log QC is aimed at making the log free of warnings, errors and notes.

3. Code QC

The code QC involves step by step review of the code. The following aspects of the code need to be reviewed

a. Adherence to the requirements
b. Logic
c. Syntax
d. Optimization
e. Presentation
f. Completeness of datasets

A peer review is carried out during which the SAS Programmer is required to explain the code to the QC personnel. Any discrepancies are discussed and corrected by the SAS Programmer.

4. Parallel Programming

Parallel programming is included in the QC process at Kreara. Parallel programming helps in both the output QC and optimization of code. During the parallel programming the QC personnel generates the outputs as per requirement without emphasis on the template and SAS programming standards/practices. The outputs generated by parallel program are compared to actual output for discrepancies. The parallel program may also be compared with actual program in order to improve the Quality of programming.

The following example explains the parallel programming in QC process.

If a SAS programmer calculates the confidence interval, using the ‘tinv’ function in the data step as shown below, a parallel programmer or QC personnel uses the simple proc means procedure and then the outputs are compared.

PROC UNIVARIATE DATA=old;
BY id ;
VAR anal;
OUTPUT out=new n = n1 n2
mean = mean1 mean2
std = std1 std2
;
RUN;

DATA new1;
SET new;
n1 = PUT(n1,4.0);
n2 = PUT(n2,4.0);
mn1 = PUT(mean1,6.1);
mn2 = PUT(mean2,6.1);
sd1 = PUT(std1,7.2);
sd2 = PUT(std2,7.2);

IF n1 NE 0 THEN DO;
clh1 = mean1 + tinv(0.975,n1-1) * std1 / sqrt(n1);
cll1 = mean1 - tinv(0.975,n1-1) * std1 / sqrt(n1);
ci1 = "("PUT(clh1,6.2)","PUT(cll1,6.2)")";
END;
IF n2 NE 0 THEN DO;
clh2 = mean2 + tinv(0.975,n2-1) * std2 / sqrt(n2);
cll2 = mean2 - tinv(0.975,n2-1) * std2 / sqrt(n2);
ci2 = "("PUT(clh2,6.2)","PUT(cll2,6.2)")";
END;
RUN;

The output can be simply obtained using proc means procedure as follows

proc means data=new ;
BY id;
VAR anal;
OUTPUT out=sum1
n = n1 n2
mean = mean1 mean2
std = std1 std2
lclm = ll1 ll2
uclm = ul1 ul2;
RUN;

The given parallel program suggests the reduction of steps and on the other hand the output produced by the parallel program is compared with actual output, as part of the output QC.

The Quality Control process is carried out until good quality outputs and code are obtained.

Cochran-Armitage Trend Test
Prepared by Sreeja E V

In clinical trials, it is often of interest to investigate the relationship between the increasing dosage and the effect of the drug under study. Usually the dose levels tested are ordinal, and the effect of the drug is measured in binary. In such cases, Cochran-Armitage trend test is most frequently used to test for trend.

Here, the Null hypothesis (H0): There is no linear trend in effect of the drug under study across increasing levels of dosage. Alternative hypothesis (H1): There is linear trend in effect of the drug under study across increasing levels of dosage.

Consider an example. The data set effect contains hypothetical data for a clinical trial of a case control study. The clinical trial investigates whether the variable cascon relates with different genotype statuses. Subjects have one of either three genotype statuses 1, 2 or 3 where 1 represents abnormal, 2 represents partially abnormal and 3 represents normal status. The variable cascon has values 1=’Case’ and 2=’Control’. The number of subjects for each group is represented by the variable Count.

data effect;
input status cascon Count @@;
datalines;
1 1 15 1 2 26
2 1 19 2 2 10
3 1 20 3 2 3
;
run;

proc freq data=effect;
tables status*cascon / trend measures cl;
weight Count;
title 'Clinical Trial for case control study’;
run;

The output will appear as follows

Cochran-Armitage Trend Test
*************************

Statistic (Z) 4.0252
One-sided Pr > Z <.0001

Two-sided Pr > Z <.0001

Sample Size = 93

We consider the two-sided -value tests against either an increasing or decreasing alternative. Using trend test we get Pr > Z <.0001. Thus the null hypothesis is rejected and it is concluded that there exists a trend in binomial proportions of response across gene status.

Execute a set of files in SAS.
Prepared by Rupesh R

There are situations where prior to running a SAS Editor(s) one may be required to run a list of files (SAS or non SAS). Our objective is how to call them in a single statement.

We need to run a set of SAS files prior to a SAS code. The usual scenario is to execute them one by one. But it is possible to access a set of files or members from this storage location by a single statement as follows.

Using filename statement we assign the fileref storage in an aggregate location.

filename storage “An- Aggregate –storage- location”;

Several files or members from this storage location can be accessed by listing them in parentheses after the fileref in a single %INCLUDE statement
%inc storage (Monthly, Quarterly); Non SAS files can also be accessed using quotation marks around the complete filename listed inside the parentheses. %inc storage ("file-1.txt","file-2.dat","file-3.cat"); Auto call SAS macros and Formats When the SAS editors contain user defined macros the above %inc statement does not work. In this situation we use sasautos option. Usually the formats and macros will be in separate folders. In those situations they can be called as follows. Libname project “project-path”;Libname formlib “format –path”;Libname mymacros “macro-path”; options fmtsearch=(formlib project) sasautos=( mymacros) mautosource ;
The fmtsearch option searches the formats in the files or libraries in the following order

1. Work.formats
2. formlib.formats
3. project.formats

Sasautos option invokes the macros in the file reference storage.

The auto call facility is usually used when all user-defined macros are stored in a standard location and they are not compiled until they are actually needed.
But when we get formats and macros in the same folder it is better to use the following statement.Options fmtsearch=( storage project) sasautos=( storage ) mautosource ; Here even though library work is not specified the fmtsearch option will search for the formats in the work library by default as mentioned earlier.The formats can also be called without using fmtsearch option if we know the name of SAS file which contains all the formats for the particular study. We can access the formats using a single %include statement.

%inc storage (formats);

Here the SAS editor formats contain all the formats for the study.

WARNING: 25% of the cells have expected counts less than 5. Chi-Square may not be a valid test.
Prepared by Rajeev V

The chi-square test is a one of the tools in statistics to compare observed data with data we would expect to obtain according to a specific hypothesis. Procedure freq with option chisq help us to determine evidence for the association between two categorical variables in SAS.

While performing chi square test using proc freq in SAS one may sometimes encounter such a warning, WARNING: 25% of the cells have expected counts les than 5. Chi-Square may not be a valid test.

This type of warning arises in output window when the expected count in any of the cells is less than 5 when performing Pearson chi-square test; for example the expected counts for row2column1 in the below table is calculated as (16X235)/766 equal to 4.9086. If we see this type of warning just avoid p value corresponding to Pearson chi-square test and take into consideration p value of Fisher’s exact test. These are explained below using an example:-

Data _cat_;
Input grp $7. Cat1 $4;
Cards;
Case ABN
Control NRM
………………….
………………….
;
run;

ods output Chisq=chitb_1(where =(statistic in ("Chi-Square")))
FishersExact=fishexctb_1(where=(label1="Two-sided Pr<=P"));

proc freq data=_cat_;

table grp*cat1/chisq expected nocol norow nopercent ;

by _name_;

run;

ods output close;

The output will be,

The p-value obtained on performing the chi-square test above may not measure the association because (2, 1) th expected count is less than 5. SAS automatically performs Fisher's exact test for 2×2 tables but for tables larger than 2x2, exact tests are requested by using the exact option on the table statement or using exact fisher statement after table statement. SAS generate every possible table that is compatible with the given marginal totals, and calculates the exact probability p of each table, using Fisher’s formula (1934). By summing the probabilities of the extreme tables we obtain a P value that is used in the usual classical way as a test of the null hypothesis.From our example we will consider the two sided p value (1.00) from Fisher’s Exact Test.

Calculation of adjusted odds ratio using Proc logistic

Prepared by Resmi Sukumar

Adjusted odds ratios are the odds of a dichotomous event adjusted for or controlling for other possible variables in the model. For example in a model where the response variable is the presence or absence of a disease, an odds ratio for a binomial exposure variable is an adjusted odds ratio for the levels of all other risk factor included in the model.

Adjusted odds ratio and corresponding 95% confidence interval is obtained by performing logistic regression analysis, this technique is implemented in the SAS® System using PROC LOGISTIC.

Logistic regression analysis provides adjusted odds ratio if adjustors are used as additional predictors, otherwise it provides unadjusted odds ratio.

The general syntax of PROC LOGISTIC is:

PROC LOGISTIC DATA=dsn ;
MODEL depvar = indepvar(s)/options;
RUN;

Example:
Suppose we are interested in conducting a case control study to evaluate the relation among cases and controls between different genotypes. The different gene statuses are ‘abnormal’ and ‘normal’ where normal is considered as referent group. For this purpose we generate a dataset as follows

/*Random data generation*/

data genestat;
do i=1 to 50;
gene=round(1 + (3-1)*uniform(10));
age =round(1+(3-1)*uniform(15));
ethnic=round(1+(3-1)*uniform(14));
status=round(1+(2-1)*uniform(16));
cascon=round(1+(2-1)*uniform(17));
output;
end ;
drop i;
run;

/*formats */
proc format;
value gene 1= 'Gene1' 2= 'Gene2' 3= 'Gene3';
value cascon 1='Case' 2='Control';
value age 1='<18' 2="'18-35'" 3="'">35';
value ethnic 1='Asian' 2='Caucasian' 3='Other';
value status 1='Abnormal' 2='Normal ';

proc sort data=genestat ;
by gene;
format gene gene. cascon cascon. status status. age age. ethnic ethnic.;
run;

Let’s consider a model with variable status, age and ethnic are as predictors.

/*proc logistic for calculating adjusted odds ratio*/
ods trace on;
ods output CLoddsWald=gene_cancer(where=(Effect="status Abnormal vs Normal"));

proc logistic data=genestat;
class status/param=ref ref=last;/* reference parameter ref=last
i.e. ref='Normal'*/
model cascon=status age ethnic / clodds=both;/* clodds =gives WALD confidence Interval for odds ratio*/
by gene ;
run;

ods output close;
ods trace off;

The MODEL statement names the response variable and the explanatory effects, including covariates, main effects, interactions, and nested effects. The CLASS statement names the classification variables to be used in the analysis. The CLASS Statement permits specification of a reference level. By default, the lowest level of the variable placed in the CLASS Statement is treated as the reference category. The BY statement is used to obtain separate analyses on observations in groups defined by the BY variables.

The output table is obtained as

This shows that Gene1 and Gene2 are less likely to have the abnormal genotype status in case than control. Adjusted odds ratio for Gene3 shows that the odds of abnormal genotype occurring in the case group are higher than it occurring in control group.

Let's now consider the model where status is the only predictor.

/*proc logistic for calculating unadjusted odds ratio*/
ods trace on;
ods output CLoddsWald=gene_cancer(where=(Effect="status Abnormal vs Normal"));

proc logistic data=genestat;
class status/param=ref ref=last;/* reference parameter ref=last
i.e. ref='Normal'*/
model cascon=status / clodds=both;/* clodds =gives WALD confidence interval
for odds ratio*/
by gene ;
run;

ods output close;
ods trace off;

The output table is obtained as

This shows that the odds of abnormal genotype occurring in the case group are higher than it occurring in control group for gene 3 while it same for genes 1 and 2.

Scrum - Agile project management

Scrum is a process skeleton that includes a set of practices and predefined roles. The main roles in Scrum are the ScrumMaster who maintains the processes and works similar to a project manager, the Product Owner who represents the stakeholders, and the Team which includes the developers.
During each sprint, a 15-30 day period (length decided by the team), the team creates an increment of potential shippable (usable) software. The set of features that go into each sprint come from the product backlog, which is a prioritized set of high level requirements of work to be done. Which backlog items go into the sprint is determined during the sprint planning meeting. During this meeting the Product Owner informs the team of the items in the product backlog that he wants completed. The team then determines how much of this they can commit to complete during the next sprint. During the sprint, no one is able to change the sprint backlog, which means that the requirements are frozen for a sprint.
There are several implementations of systems for managing the Scrum process which range from yellow stickers and white-boards to software packages. One of Scrum's biggest advantages is that it is very easy to learn and requires little effort to start using.

6.10.08

Maintaining document standards at Kreara

Prepared by Sreedevi Menon

Documentation is an integral part of clinical trial studies and maintenance of standards while preparing various documents is of utmost importance. The documents related to clinical trial study may be anything from SOPs to documents related to project management, data management, statistics or SAS.

At Kreara, all or some of such documents are prepared as per requirement of the study. The emphasis is not only to make these documents as informative as possible but also to convey the information in a concise and effective manner. Further, effort is taken to maintain the quality of the information contained and the way of presentation.

An SOP for General Documentation Guidelines is maintained at Kreara and all personnel in the organization are trained on the same. This standard operating procedure describes the various guidelines to be followed during preparation and amendment of SOPs in general. It also presents guidelines for preparation of project related documents like the naming conventions to be followed.

In addition to this, individual SOPs are maintained for each and every document and to maintain standards, templates with instructions regarding the contents, layout and formatting of the contents are maintained in a central repository. The personnel responsible for writing the documents are required to follow the format in the templates while preparing the documents. The QC personnel check for any non-compliance to templates in the document in addition to the relevance of contents. Further the QA manager is responsible to ensure that the process is followed correctly.

The personnel at Kreara are trained in the SOPs related to document writing. A great deal of exposure in the related field is provided to them so that they are capable of preparing informative and effective documents.

3.10.08

SAS COLON MODIFIER “ =: ”
Prepared by : Sujith K G.

Usually we use substr () function to select a string which starts with specific characters or for selecting a part of a string. We can also use Sas Colon Modifier “=:” for performing the same task. Both methods allow comparison of values based on the prefix of a text string. Both these methods have been explained using the following example

Here we have a dataset adverse which contains patient id and name of adverse event.

data adverse;
Input id ae $;
cards;
001 asthma
002 chesttightness
003 dizziness
004 cold
005 headache
006 dysphonia
007 commoncold
008 nausea
009 cough
;
run;

We are interested to flag the adverse events starting with “co” namely cough, cold and common cold as Yes and others as No.

This can be performed by using the SUBSTR() function as described below

data event;
set adverse;
if lowcase(substr(ae,1,2))='co' then res="Yes";
else res="No";
run;

The same purpose can be served by applying the Colon Modifier “=:” as described in the following steps

Now using, Colon Modifier the condition is,

data event;
set adverse;
if lowcase(ae) =: "co" then res="Yes";
else res="No";
run;

As can be seen from the above examples in the substring function we need to specify the position to extract the first two letters while in Colon modifier such a requirement is not needed.