Resource-frugal classification and analysis of pathology slides using image entropy

https://doi.org/10.1016/j.bspc.2020.102388Get rights and content

Abstract

Pathology slides of lung malignancies are classified using resource-frugal convolution neural networks (CNNs) that may be deployed on mobile devices. In particular, the challenging task of distinguishing adenocarcinoma (LUAD) and squamous-cell carcinoma (LUSC) lung cancer subtypes is approached in two stages. First, whole-slide histopathology images are downsampled to a size too large for CNN analysis but large enough to retain key anatomic detail. The downsampled images are decomposed into smaller square tiles, which are sifted based on their image entropies. A lightweight CNN produces tile-level classifications that are aggregated to classify the slide. The resulting accuracies are comparable to those obtained with much more complex CNNs and larger training sets. To allow clinicians to visually assess the basis for the classification — that is, to see the image regions that underlie it — color-coded probability maps are created by overlapping tiles and averaging the tile-level probabilities at a pixel level.

Introduction

“Deep learning” approaches have been applied to a wide range of medical images with the objective of improving diagnostic accuracy and clinical practice. Many efforts have focused on images that are inherently small enough to be processed by convolutional neural networks (CNNs), or which can be downsampled without loss of fine features necessary to the classification task [[1], [2], [3], [4], [5], [6]]. In general, CNNs perform best at image sizes below 600×600 pixels; larger images entail complex architectures that are difficult to train, execute slowly, and require significant memory resources. Among the most challenging medical images to analyze computationally are digital whole-slide histopathology images, which may be quite large — 10,000 to more than 100,000 pixels in each dimension. Such proportions make even traditional visual inspection by trained pathologists difficult.

To render whole-slide images amenable to CNN analysis, researchers have decomposed them into much smaller tiles that are processed individually. A probability framework is applied to the tile-level classifications to classify the slide. Khosravi et al. [6], for example, utilize tiles of 224×224 and 229×229 pixels, which is typical for the complex CNN architectures that have been employed to date. Coudray et al. [3] use 512×512 tiles. Such tiles correspond to an extremely small region of the original slide. For a typical 1 Gigapixel (GP) slide image, even the larger tiles used by Coudray et al. [3] each cover only 0.026% of the slide area. Moreover, not all tiles are equally meaningful for classification. Coudray et al. [3] exclude tiles having backgrounds exceeding 50% of the image, while Yu et al. [1] take a staged approach. First they decompose a whole-slide image into non-overlapping 1000×1000 pixel squares and select the 200 densest squares, then further subdivide the selected squares into tiles of unspecified sizes (but presumably comparable to those of Khosravi et al. [6] since similar CNN architectures are used).

The most successful recent studies have achieved performance comparable to that of experienced pathologists. Among the more challenging tasks is distinguishing LUAD from LUSC tissue. Coudray et al. [3] obtained an AUC value of 0.950 for LUAD/LUSC classification when probability aggregation is based on averaging of classification probabilities. Yu et al. [1] report AUC values of 0.883−0.932. These studies have involved large datasets and complex CNN architectures. Coudray et al. [3] utilize the Inception v3 architecture, which has 48 convolutional layers; at their tile size of 512×512 pixels, an Inception v3 CNN is about 1.8 Gigabytes (GB) in size. Yu et al. [1] utilize several architectures: VGGNet-16, Residual Network-50 (ResNet), GoogLeNet, and AlexNet. The smallest of these, AlexNet, has only eight functional layers but nonetheless imposes a high bandwidth burden: its design involves 60 million parameters and requires over 729 million floating-point operations (FLOPs) to classify a single image [7]. Coudray’s [3] dataset contained 567 LUAD slides and 608 LUSC slides, while Yu et al. [1] employed 427 LUAD slides and 457 LUSC slides. Another study [8] of lung adenocarcinoma slides used 422 whole slide LUAD images obtained from a single source.

A longstanding impediment to clinical adoption of machine-learning techniques is their frequent inability to convey the rationale behind a classification, diagnosis or other output [9]. Black-box models whose reasoning is opaque or impervious to retrospective analysis may pose clinical dangers that outweigh the benefits of a computational approach [10]. Until recently, CNNs have fallen squarely within the black-box category, but techniques such as gradient-weighted class activation maps (“Grad-CAM”) [11] have pried the box open, highlighting the image regions important to a CNN classification. Yu et al. [1], for example, show Grad-CAM images of individual tiles processed by their system.

While the ability to visualize regions of an image important to classification is useful, it does not necessarily address clinical acceptance. Yu’s [1] Grad-CAM images each represent perhaps 0.0045% of the total slide area. Painstaking analysis of many such images could help validate the proposition that the CNN is “looking” where it should. But for any given slide classification, Grad-CAM cannot realistically illuminate its underlying basis; the Grad-CAM images are too small and a readable map of them superimposed on the slide would be impossibly large. Moreover, identifying which image regions attract the attention of a CNN does not reveal the underlying rationale for a classification — only the pixels on which the classification, whatever its basis, depended most strongly.

The computational demands of CNNs pose an additional challenge to their widespread deployment, particularly given the trend toward telemedicine and increasing clinical use of mobile devices [12]. Although the computational capacity of tablets and phones continues to grow — today’s devices typically include graphics processing units and high-end devices may feature dedicated neural processing units — the processor typically runs many background and foreground tasks that compete for cycles and battery life. Indeed, compute-intensive applications can increase a device’s thermal profile and trigger throttling, which slows computation altogether. In parallel with efforts to augment hardware capacity, researchers have attempted to reduce the memory and energy burden imposed by CNNs. One promising approach uses quantized representations of internal CNN values to make them more compact. Quantization may be employed to make CNN training tractable in constrained execution environments [13] or to simplify processing of already-trained models [14,15]. Practical medical deployments will involve trained CNNs, quantization of which has been shown to accelerate inference and reduce memory footprint in some applications without adversely affecting classification performance [16].

Here, rather than simplifying or accelerating execution of a complex model, our strategy is to adapt simple CNN architectures to difficult classification tasks using staged image resizing and visually based data sifting. Like Yu et al. [1], we first downsample whole-slide images but to a much larger intermediate size that preserves visible anatomic characteristics. These intermediate images are decomposed into smaller square tiles of varying sizes that can be evaluated individually. The tiles are sifted using image entropy as a visual criterion. These steps — initial image rescaling followed by trials at multiple tile sizes — provide two “knobs” for optimization. The initial rescaling must preserve sufficient anatomic detail to support accurate classification. Previous work [17] suggests that one or two tile sizes will emerge as optimal for a given classification task, a result observed here as well. Clear winners emerged when the best CNN models were tested with out-of-sample tiles. A model’s performance on the out-of-sample set reflects its generalizability.

CNNs for even the largest tile sizes tested are lightweight enough to be deployable on mobile devices such as phones and tablets. To permit visualization on such devices of the basis for classifying an input image, color-coded probability maps are created; in particular, tiles that survived sifting are overlapped and their tile-level probabilities averaged at a pixel level. Unlike Grad-CAM images, these maps show the distribution and intensity of the actual predictions across the image.

Section snippets

Whole slides

Forty-two LUAD slides and 42 LUSC cell carcinoma slides were downloaded from the GDC portal of the National Cancer Institute. These averaged about 1 GP in size and contained varying amounts of empty, non-image area. The pulmonary pathologists’ evaluations from the TCGA study were used as the ground truth for classification.

Tile generation

The whole-slide images were first rescaled such that the longer dimension of the rescaled image did not exceed 6000 pixels. This size seemed appropriate to the diagnostically

Results

The purpose of this study was to explore whether small CNN architectures could be adapted for difficult classification tasks on very large medical images. A deliberately small dataset was used in order to “stress test” the different tile-sifting criteria and reveal performance differences among them. The small dataset also permits an out-of-sample test to be readily defined, as described below, and helps us assess whether acceptable model performance can be attained even when data is limited.

Visualization

Grad-CAM technique uses the “feature maps” produced by a convolutional layer (typically the final one). Class activation maps (CAMs) are produced by projecting back the weights of the output layer onto the convolutional feature maps. This has the effect of ranking the image regions by their importance to the classification. Grad-CAM generalizes this technique to essentially any CNN architecture. While the CAM and Grad-CAM techniques help us recognize the visual sources of CNN classification

Discussion

The foregoing results suggest that lightweight CNNs can assist in making diagnoses based on very large histopathology images. The strategy of intermediate image scaling followed by training and testing over multiple candidate tile sizes enabled simple CNN models to successfully distinguish difficult-to-classify lung cancer types. Particularly when pre-trained on Entropy tiles, the accuracies were equivalent to those obtained with much larger architectures. Probability maps enable ready

Declaration of Competing Interest

There are no financial or personal relationships with other people or organisations that could inappropriately influence or bias this work.

References (20)

  • P. Khosravi et al.

    Deep convolutional neural networks enable discrimination of heterogeneous digital pathology images

    EBioMedicine

    (2018)
  • K.H. Yu et al.

    Classifying non-small cell lung cancer types and transcriptomic subtypes using convolutional neural networks

    J. Am. Med. Inform. Assoc.

    (2020)
  • B.E. Bejnordi et al.

    Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer

    JAMA

    (2017)
  • N. Coudray et al.

    Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning

    Nat. Med.

    (2018)
  • V. Gulshan et al.

    Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs

    JAMA

    (2016)
  • A. Esteva et al.

    Dermatologist-level classification of skin cancer with deep neural networks

    Nature

    (2017)
  • J. Wu et al.

    Shoot to know what: an application of deep networks on mobile devices

  • J.W. Wei et al.

    Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks

    Sci. Rep.

    (2019)
  • Knight Will

    The dark secret at the heart of AI - MIT technology review

    Technol. Rev.

    (2017)
  • M. Matheny et al.

    Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril

    (2019)
There are more references available in the full text version of this article.

Cited by (0)

1

Co-founder, Art Eye-D Associates LLC, www.art-eye-d.com

View full text