The first step of the attack is to record keystrokes on the target's keyboard, as that data is required for training the prediction algorithm. This can be achieved via a nearby microphone or the target's phone that might have been infected by malware that has access to its microphone. Alternatively, keystrokes can be recorded through a Zoom call where a rogue meeting participant makes correlations between messages typed by the target and their sound recording.
Then, they produced waveforms and spectrograms from the recordings that visualize identifiable differences for each key and performed specific data processing steps to augment the signals that can be used for identifying keystrokes.
The spectrogram images were used to train 'CoAtNet,' which is an image classifier, while the process required some experimentation with epoch, learning rate, and data splitting parameters until the best prediction accuracy results could be achieved.