Since 2012, two computational challenges were ran by Prize4Life and The DREAM Project utilizing Data from
to promote ALS computational research. The Data used in these challenges is available on the Data tab as it was available in the challenge (separately from the general
data). General information and specific information regarding the challenge data files is available below.
Before the launch of the
database, as a demonstration project, we held a $50,000 challenge competition to predict the progression of ALS. The DREAM-Phil Bowen ALS Prediction Prize4Life was developed and run in partnership with The DREAM Project (Dialogue for Reverse Engineering Assessments and Methods). It ran on the Innocentive prize platform and was sponsored by Nature, Popular Science, and the Economist.
The challenge used a subset of the
database consisting of 1822 patients. Solvers had to develop an algorithm that used the first three months (time 0–3 months from trial onset) of clinical trial data available for a given patient to predict the slope of change in that patient’s disease over months 3–12 (based on a widely used functional rating scale: the
. Measures of prediction success were RMSD and Pearson correlation. For those interested, the specific data subset used for the challenge is available
(see more below).
The challenge ran from July 11 to October 15 and drew in 1073 solvers from 64 countries. The first prize was split between two teams: team y7717, comprised of Lilly Fang, a recent JD, and Dr. Lester Mackey, a mathematician, both of Stanford University, and team Sentrana comprised of Mr. Guang Li and Dr. Liuxia Wang of the Washington DC-based scientific marketing group Sentrana and its spin off Origent Data Sciences. A second prize was awarded to Dr. Torsten Hothorn from the University of Zurich.
You can read more about the specific challenge results in a recent publication summarizing the challenge: Küffner R, Zach N et al. Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. Nat Biotechnol. 2015 33:51-7. You can also read more about Dr. Hothorn’s approach in: Hothorn T, Jung HH. RandomForest4Life: a Random Forest for predicting ALS disease progression. Amyotroph Lateral Scler Frontotemporal Degener. 2014 15(5-6):444-52
The data used in the challenge was similar to that available in the full
database (as of 2012), except the data were less standardized and contained more errors, including misspellings and lack of standard units for lab data. As the challenge dataset only included 1822 (out of 8635) patients, a few less common measures contained within the full
dataset were not included.
Another important difference is that the challenge data were not subdivided into different types of data (demographic, vital signs, etc.) as they are in the
data. Rather, in the challenge dataset all of these data types are found in the same file one after another. In addition, while the
data is in a tabular format and saved as an excel file, the challenge data was available in a non-tabular format as a .csv file.
To allow you to view the data as it was presented for the challenge, the data are separated into training data (n=918; given in full to the solvers to train their models on), test data (n=279; used during the challenge for the leaderboard i.e. solvers were allowed to submit their code to be tested on the first three months of that data and the results were published on the challenge's leaderboard) and validation data (n=625; used only at the very end of the challenge to test the submitted algorithms and declare a winner). During the challenge solvers never saw the test and validation sets, and all testing was only done on the first three months of data. Here, however, the validation and test data are available in full to allow assessment.
Specific considerations for looking at the challenge data:
The algorithms developed for the challenge were required to predict the slope of change between months 3 to 12 after trial onset. The slopes are available to be downloaded
. In order to determine slopes, solvers were required to do the following:
- To assure similarity when designing the prediction algorithms, solvers were asked for the predictions of slopes based on 10 questions (either 1-10 in ALSFRS or 1-9 +10a (dyspnea) in ALSFRS-R).For patients with ALSFRS scores, their ALSFRS Total sum
should be used. For patients with ALSFRS-R scores, the total is generated using the sum of the following parameters:
[ALSFRS-RSpeech, Salivation, Swallowing, Handwriting, Cutting (with and without Gastrostomy), Dressing and Hygiene, Turning in Bed, Walking, Climbing Stairs, Dyspnea]
(the results of questions 10b and 10c are discarded when calculating the sum). In both cases, the number should range between 0–40. See more here
- Merge ALSFRS questions
5a. Cutting without Gastrostomy and 5b. Cutting with Gastrostomy
- Remove all ALSFRS values for the time points in which NOT all 10 ALSFRS questions are available
- Convert days to months: m= (days/365.24) *12)
- Slopes between months 3 and 12 are then calculated as:
(ALSFRS (LastVisit)-ALSFRS(FirstVisit)) / (months(LastVisit-FirstVisit))
- First Visit: Assign first visit > 3 months (= 92 days) from the first time ALSFRS was fully measured (Reference Visit) as "First Visit" (this is the first visit after 3 month)
- Note: for the calculation, set the first visit with 10 ALSFRS questions as the Reference Visit for slope calculation and hence calculate all differences relative to this visit. Note that the Reference Visit is not necessarily at delta=0.
- Last Visit:
- If there are multiple visits > 12th month, assign the earliest visit > 12th month (from the Reference Visit for slope calculation) as ‘Last Visit’.
- Otherwise: use final assessment of ALSFRS as “Last Visit”.
The DREAM ALS Stratification Prize4Life Challenge
The challenge webpage including full challenge information is available
While the ALS Prediction Prize brought in significant benefit for the ALS community, it focused on predicting the disease progression over time for the entire population of ALS patients. As it turned out, such models typically performed best for the "average" patient but were less able to predict the disease course of very slow or fast progressing patients. Prediction for the entire population is therefore limited by the inherent heterogeneity of the ALS manifestation. Therefore, with the
DREAM ALS Stratification Prize4Life Challenge
(ALS Stratification Challenge), collaboration between Prize4Life, the DREAM Project and Sage bionetworks, we aimed at the development of tools that accurately assign individual patients to specific sub-groups of patients with clear clinical implications for either survival or disease progression. There are good reasons to believe that there are identifiable, meaningful subgroups of ALS patients within the population that are more homogenous, but have not been fully characterized yet.
The goal of the ALS Stratification Challenge is to identify subgroups of patients with distinct clinical outcomes that can be distinguished by the clinical features. Challenge participants will create models to (1) cluster patients according to the outcome clinical targets, and (2) based on this classification, they will identify for each patient a small subset of predictive features and predict the outcome clinical targets. Performance will be assessed based on the quality of the predictions. The challenge prize was $28,000 eaised through a crowdfunding campaign by over 100 donors.
Participants are asked to use data collected from patients over a 3 month period to predict one of the following:
- Disease progression, as measured using the ALS functional rating scale (ALSFRS), the score used for monitoring ALS patients. Participants are provided with ALSFRS measured between 0-3 months and are asked to predict the slope of ALSFRS changes between 3-12 months.
- Survival, given as probability of death within a 0-12 months, 0-18 and 0-24 months from trial onset.
There are two data sources within this challenge - data collected within clinical trials from the
database and data collected directly from patients within a community through the Irish and Italian ALS Registries.
The challenge Run from June 22- Sept. 30 2015 with 288 participants, and 80 final submissions by 32 teams from 15 countries.
- In Sub-challenge 1, the winner, Team UglyDuckling is from the Department of Computer Science and Information Engineering at the National Cheng Kung University, Taiwan. The team comprises Dr. Wen-Chieh Fang, Huan-Jui Chang, Chen Yang, Prof. Hsih-Te Yang, and Prof. Jung-Hsien Chiang.
- In Sub-challenge 2 and Sub-challenge 4, the winner, Team Guanlab_Umich, is from the Department for Computational Medicine and Bioinformatics at the University of Michigan, USA comprising of Prof. Yuanfang Guan.
- In Sub-challenge 3, the winner, Team Jinfeng_Xiaofrom is from the Center for Biophysics and Quantitative Biology at the University of Illinois at Urbana-Champaign, USA, comprising of Jinfeng Xiao.
The data used in the challenge was similar to that available in the full
database (as of end of Dec. 2015).
Similarly to the first challenge (see above), the datatypes were merged into one file. Data is provided in a tabular format. Each line represents a single feature with several pipe-delimited columns. The columns in this tabular format specify the patient ID (column 1), the feature context (form name in column 2), the unique feature name (column 3), the feature value (column 4), and, if applicable, the feature’s unit (column 5) and time delta in days (column 6). Beyond the SubjectID, the data contain different assessments and their respective results including data types. Specific measure, value, unit of measurement, and delta (time in days from trial onset when the assessment was made). You can identify these variables through the column name in the data file.
|The data format lists
||PatientID|datatype(form name)|feature name|value|units|delta
||7824|Vital Signs|Blood Pressure (Systolic)|140|MMHG|14
Patient 7824 had, at delta= 14 (day 14 from beginning of measurement), the following vital signs: a blood pressure(Systolic) of 140MMHG and a Pulse of 76 BEATS/MIN. At a delta of 0 (first day of measurements) their ALSFRS total is 30.
Due to the limitation on number of features to be used in this challenge, we added several composite scores combining several intercorrelated ALSFRS questions: Q1-3 are [
], Q4-5 are [
] Q6-7 are [
], Q8-9 are [
] and either Q10 or R1 ,whichever is available, is [
]. For ALSFRS-R there is also the composite score [
], combination of questions R1-3. Similarly, for FVC, the average of the (up to three) attempts is available as [
FVC and FVC_percent
Note that the data format for Adverse Events and Concomitant Medication is different than the rest of the data:
PatientID| Adverse Event | High_level_term | Lowest_level_term, preferred_term, System_Organ_Class, Outcome, Severity | Unit| Start Delta, Stop Delta
PatientID|Concomitant Medication| Meidcation Coded| Dose,Frequency,Route| Unit| Start Delta, Stop Delta
The data is divided to three datasets: two sets of training datasets, one derived from the
dataset as of 2013 and the second from the additional trials added at 2015; a leaderboard set used during the challenge to allow the participants to submit their models to be assessed by organizers and results were published on the challenge webpage; a final validation set used by the organizers to assess the performance of the models at the end of the challenge in order to identify the best performing teams.
In the challenge, participants had to predict either ALSFRS slope or survival. Therefore, for each dataset there is a calculation of Slope prediction (see above for how it was calculated) and survival ([
]=time in days since trial onset). For the subjects that didn’t died (status=0) the Time_event is the delta indicated by the trial managers or by the last time the patient was assessed).