27 Nov Is near enough good enough? How reliable is your sleep study data and analysis?
Is every sleep study the same? Are results from home sleep studies as accurate as one in a hospital monitored by a technician or nurse? Would there be a different result if a human analysed the study, or a computer, and would two humans even come up with the same results?
In attempt to answer some of these questions here at TSGQ we are conducting some research into the specific ways the analysis is performed to see if this has an impact on possibly improving the quality and reliability of the results.
Currently the gold standard method for sleep study analysis is for a trained human to manually score the study as per specific criteria. The actual criteria used plays in important role in the outcome as this criteria has changed over the years. For instance, the criteria for marking a hypopnoea varies from requiring either a 30 or 50% reduction in flow with an arousal from sleep or either a 3 or 4 % oxygen desaturation. More events would be marked if the criteria is only a 30% reduction and a 3% desaturation and thus give a higher AHI. The diagnostic criteria based on AHI has not changed however and Ruehland et al11 found up to 40% of patients may be misclassified depending on the hypopnoea rule used.
If two different scientists were to score the same study, how similar would the result be? Research by Magalang et al9 has shown that sleep stage agreement between scientists is strong with Kappa 0.78 for all stage agreement while respiratory index statistics are highly correlated with correlation coefficient 0.95. However large variations exist particularly for assessing sleep depth with variations in the amount of N1 and N3 sleep scored as Younes et al7 reported.
Recently the adoption of automatic computerised scoring has started to become more widely used in attempt to reduce the time to perform manual scoring. Studies1,2,3 have shown that some autoscoring algorithms appear to be as accurate as human scoring. Other studies have shown that using automatic scoring can help bring concordance closer together particularly when difficult to identify sections are involved5.
The research to date does not look at the agreement with other sleep metrics including sleep or REM latencies, and the scoring of limb movements and TSGQ has been gathering data in these areas for comparison.
Additionally there is limited data comparing the use of auto scoring with Type 1 attended studies to Type 2 home based sleep PSG. Due to the nature of the unattended type 2 devices, signal loss is usually higher with increased signal artefact which can impact on the clarity of the analysis by both human and computer.
With increases in home sleep testing and medicare billing, the government has scrutinised this practice. More people are needing diagnosis with sleep studies, and many are not able to have an in-lab sleep test as they may not be available in their area. With increases in study numbers, there is an increased burden of analysis time as each study takes considerable time to be scored manually by humans. But how appropriate is it to use a computer to analyse these studies? It may be useful to improve accuracy provided a proper human review is being performed.
Anecdotal experience has shown me that auto scoring is ‘mostly’ accurate and can be a definite time saving tool. While it can be accurate for some parts of the study however, other sections can be very inaccurate with miss-classified respiratory events, misplaced arousals and incorrect sleep staging. While a human review fixes all these errors, how much human review is really required and how much trust can be placed on the computer derived analysis?
There is real potential for autoscoring to reduce the financial burden of sleep studies on medicare. However it is important to understand the clinical implications of this and determine what the most appropriate method of analysis is so law makers don’t short change the public with accuracy of sleep results. Additionally it is important that providers ensure a proper accurate analysis is performed for each sleep study in all metrics being reported. Is near enough good enough?
Reference List
1 https://www.ncbi.nlm.nih.gov/pubmed/25902809 – Computer-Assisted Automated Scoring of Polysomnograms Using the Somnolyzer System
2 https://www.ncbi.nlm.nih.gov/pubmed/20829636 Computer-assisted sleep classification according to the standard of the American Academy of Sleep Medicine: validation study of the AASM version of the Somnolyzer 24 × 7.
3 https://www.ncbi.nlm.nih.gov/pubmed/18002875 – Automatic sleep classification according to Rechtschaffen and Kales.
4 https://www.ncbi.nlm.nih.gov/pubmed/15838184 – An E-Health Solution for Automatic Sleep Classification according to Rechtschaffen and Kales: Validation Study of the Somnolyzer 24 X 7 Utilizing the Siesta Database
5 https://www.ncbi.nlm.nih.gov/pubmed/27448418 – Minimizing Interrater Variability in Staging Sleep by Use of Computer-Derived Features
6 https://www.ncbi.nlm.nih.gov/pubmed/27070243 – Staging Sleep in Polysomnograms: Analysis of Inter-Scorer Variability.
7 https://www.ncbi.nlm.nih.gov/pubmed/29351821 – Reliability of the American Academy of Sleep Medicine Rules for Assessing Sleep Depth in Clinical Practice.
8 https://www.ncbi.nlm.nih.gov/pubmed/23565004 – Agreement in computer-assisted manual scoring of polysomnograms across sleep centers.
9 https://www.ncbi.nlm.nih.gov/pubmed/23565005 – Agreement in the Scoring of Respiratory Events and Sleep Among International Sleep Centers
10 https://www.ncbi.nlm.nih.gov/pubmed/26350603 – Agreement in the Scoring of Respiratory Events Among International Sleep Centers for Home Sleep Testing.
11 https://www.ncbi.nlm.nih.gov/pubmed/19238801 – The new AASM criteria for scoring hypopneas: impact on the apnea hypopnea index.