Simulating Auto-Annotated Artificial X-Ray Data Sets

June 14, 2022

In all aspects of our life, there is a growing need for sophisticated data-driven solutions. Medical imaging, autonomous driving, smart homes, and many more sectors have applications in AI. Large volumes of training data like medical datasets are required for such systems to function effectively and reliably. This training data must be labelled and in, some cases, annotated. Because of the painstaking nature of work, annotation of medical images is particularly time-consuming and error-prone in health applications. Such well-suited data is not always easily available, particularly in highly regulated domains where data safety or data privacy is critical. Both characteristics are important in the field of artificial intelligence in healthcare.

There is no doubt that machine learning has the potential to revolutionize the healthcare business. The possible applications are diverse and cover the full medical imaging life cycle, from image production and analysis to diagnosis and outcome prediction. However, medical practitioners face plenty of challenges that prohibit them from successfully using AI technology in clinical practice. In this article, we will discuss medical data annotation and more.

What is medical data annotation?

Annotating medical images is the process of marking medical imaging data such as X-rays, CT scans, MRI scans, mammography, or ultrasound. It is used to train AI algorithms for medical image processing and diagnosis, which aids doctors in saving time, making better judgments, and improving patient outcomes. Access to healthcare datasets is a significant issue that explains existing limits in the creation of robust machine learning algorithms.

Confined sample numbers from small geographic areas, as well as the time-consuming (and costly) data preparation process, create bottlenecks that result in algorithms with limited utility.

Your dataset must be indicative of the context in which the model will be deployed in order for the model to be accurate. Using images from many diverse datasets (for example, different imaging machines, people, and medical locations) is good for reducing bias. Typically, the training, validation, and testing data ratios are similar to 80:10:10. After gathering data and training your model, utilise the validation set to check for overfitting or underfitting and weak parameters as needed.

Finally, the model’s performance is measured against a set of test cases. Obtaining a high-quality testing dataset is crucial since it serves as the reference standard and will determine whether your trained model is approved for further regulatory approval. Although smaller datasets can still be used to train a relatively trustworthy model for specific focused applications, larger sample sizes are preferable.

As a result, the more diverse and vast your dataset is, the more accurate your model will be. When the differences across imaging phenotypes are modest, or when data is collected on populations with significant heterogeneity, large, relevant datasets are extremely critical. To construct generalizable ML algorithms in medical imaging, statistically powered data sets containing millions of images are required.

The majority of medical imaging will be in DICOM format. What exactly is DICOM? DICOM (Digital Imaging and Communications in Medicine) is the industry standard for communicating and managing medical imaging and related data. A DICOM file represents a case that may or may not contain images. From the standpoint of machine learning, the DICOM file will be transformed to another lossless picture format during training; thus, employing DICOM data for AI research is not required.

Data from Synthetic X-rays

The procedure of simulating artificial X-ray datasets based on real physics is simple. In general, an x-ray source directs x-rays toward a detection plate. Our X-rayed individual is sandwiched between the source and the detector. The tissues attenuate x-rays as they move through our bodies. Each type of tissue, such as muscles, fat, etc, has a particular attenuation constant. As a result, depending on the type of tissue and the amount of tissue between the source and the detector, the image shows a variable shade of grey. Various papers provide typical attenuation coefficients for various tissues.

On a consumer-grade laptop, simulating such pictures takes only a few seconds, fast simulations, along with automatically changing models, enable the simulation of vast, diverse datasets. Even more importantly, artificial data can be automatically annotated. This has various benefits:

The truth is completely known
Rapid and low-cost

The annotation style is adaptable and determined by your application. The following image (from left to right) depicts three distinct styles:

X-ray simulation without annotation
As a single annotation, each vertebral body and its processes
Just the vertebral bodies
Each vertebra and its processes are annotated separately.

GTS and Medical Datasets

It is usually a good idea to begin by collaborating with a company that has previously committed the time and effort required to comply with the numerous data formats, regulatory regulations, and user experience essential for a successful medical AI project. Global Technology solutions are one of them. We have the required expertise and experience in collecting and annotating medical datasets.

Search This Blog

Simplifying Text Data Annotations And Text Labelling.