Data collection is one of the most strenuous steps in Machine Learning (ML) model creation. The unique nature of federal Artificial Intelligence (AI) initiatives often requires highly specified data types. However, it can be environmentally or economically undesirable for your agency to acquire. Synthetic data might be able to fill the data collection gaps required to create a performant ML model.
What is Synthetic Data?
Synthetic Data is artificially generated data extrapolated from real world situations and generated by computers. Now instead of trying to source a complete repository of real data, you inject the limited real data you have into a 3D model and let the computer make its own assumptions.
You can think of computer vision synthetic data as creating a video game environment for labeling instead of raw data. Maybe you only have a few pictures of a foggy area. There is not enough data, and it is not high enough quality for a traditional ML model to generate accurate outputs. While ineffective for model training, those photos can create an entire 3D environment of buildings, vehicles, weaponry, and other objects in the scene to get a clear picture.
Since it is 3D, you can inject the model with “renders” – sometimes called “image synthesis”, which is a computer program that generates photorealistic visuals of a 3D model using lighting and materials – to generate every possible angle and environmental condition. This new data can then be labeled and used by ML models to provide accurate outputs.
It can be hard to relate the concept of synthetic data to your specific scenario. There is a wide variety of potential uses, and many of them are still being discovered.
Self-driving: synthetic data can be used to function safely. Self-driving cars need all possible situations mapped out to teach them how to safely navigate various traffic signs, evade potential traffic accidents, look for terrain changes, and respond to unpredictable pedestrians. This means millions of miles driven, and you still may not cover all tangential situations. Simulations based on the available data can make predictions based on those scenarios rather than driving for millions of miles in a testing environment, like Waymo.
Threat Detection: With millions of ever-evolving weapons, it is nearly impossible to capture every data point per weapon, like every potential angle of view, accounting for obstructions and environmental scenarios. Synthetic data can take a sample of photos and create 3D models of each weapon rendered with proper materials in each view and environmental condition.
Biome Obfuscation: As we talked about in the threat detection example, different environments can have big implications on the accuracy of your model’s predictions. Maybe all your data is covered in a layer of fog, a thick tree cover blocks the view of the target, or maybe you cannot get into the area to collect a larger repository of data from what exists. These environments are called “Biomes”. Synthetic data can use the limited information to infer other angles and create complete 3D simulations of not only objects, but also entire cities.
Humanitarian Relief: In active emergency situations – wildfires, hurricanes, blizzards – visibility is often greatly reduced, but by creating synthetic environments, the model can still predict the boundaries of disasters based on climate, typography, and other scenarios. For example, implementing a synthetic data model on a live wildfire, you can predict where the fire is most likely to spread you can evacuate the proper areas and prepare emergency relief beforehand.
Adversarial Subversion: synthetic data could also be used to deliberately plant falsified information to mislead enemies. Deepfakes is one example of creating this kind of falsified information. By combining small samples of visual and audio data of a person, synthetic data could recreate their face and voice and then make them appear to say or do anything. This can extend beyond fake and replicated identities into manipulating satellite imagery to add or remove objects.
Why Consider Synthetic Data for Your Initiative
There are several reasons you might want to consider supplementing your original dataset with synthetic data.
- Less time and money: Synthetic data creates inferences from existing limited real-world datasets, without having to physically collect the data.
- Edge Cases: Situations that occur infrequently, or “Edge Cases”, are critical to the success of deployed ML, but you might be waiting years to capture those niche situations, with synthetic data, you don’t have to wait.
- Overcome privacy concerns: The more you distribute sensitive data – like medical files – you increase the potential likelihood of privacy breaches. Even attempts to anonymize data could still have enough extraneous details to be enough to triangulate a person or area. Synthetic data minimizes these concerns because it needs only a small portion of this data to make enough inferences to generate its own fictitious, but functionally authentic data.
- Full User Control: Synthetic data has extra quality control checks because the data is fully user controlled. So even though it allows a compounding increase of data, that data would all need to be validated for integrity and allow for higher standards of accuracy.
What are Its Challenges?
Synthetic data is not a perfect solution. As it is an ever-expanding field, there are limitations to this technology.
- It is not standalone: You cannot rely solely on synthetic data. When input into an ML model, there needs to be some real data for the model to replicate. It only improves accuracy by widening the breadth of scenarios a model can use but needs some real-world instances to help verify the accuracy and give a basis for the data to be generated from.
- It is still synthetic: While it can mimic many properties, it doesn’t copy the original exactly. Depending on your situation the amount of authentic data, its success could be limited. Models look for common trends in the original data to create the synthetic so the less data you have, the less it can pull to generate its own scenarios, especially edge cases. That can severely limit the capabilities and accuracy of the output.
- High enough quality model: The quality of the output depends on the quality of the model. Synthetic data can be especially susceptible to “Statistical Noise” – a random irregularity in any real-life data that does not fit into the pattern. If synthetic data is learning from a dataset that contains a piece of noise, it assumes that is part of the pattern and cannot identify it as an outlier. Thus, causing the model to classify the data incorrectly, and then creating incorrect outputs.
- Output control: There is a need for an extra step of a “verification server”, or intermediary computer doing the same analysis of the synthetic data’s output. This continuously makes sure the initial model is a “good model” and is not learning any bad habits.
Find Out If Synthetic Data is Right for You
As the technology used to create synthetic data continually improves, breakthroughs and new use cases are constantly emerging. At Figure Eight Federal, we can help walk through your scenario and data availability to help determine if and how synthetic data can support your initiative. Then we will work with you through the entire process, from developing the synthetic data through continuous model management and retraining. To explore your possibilities with, contact Figure Eight Federal.