Hacking or Not – Where is the Line?
As you may have heard, a critical flaw was found in a web application provided by the state of Missouri to search for educators’ credentials. The flaw was discovered by a reporter, who immediately notified the state after confirming the vulnerability. The reporter also agreed not to report the flaw until the state could take corrective action by disabling the site. Governor Mike Parson (in)famously said that they had directed state law enforcement to investigate the incident, stating his belief the reporter had broken the law.
This situation has led to an outpouring of criticism on social media and in the press. BreachQuest has been quoted on the story repeatedly, but in this post, we want to break down what is and isn’t “hacking” (at least in the criminal sense). Note that I am not a lawyer, and nothing in this post should be construed (misconstrued?) as legal advice.
Facts of the Case
The reporter searched an online repository that is publicly accessible. Upon viewing search results, the reporter viewed the source of the webpage. This activity is so common that most browsers have multiple hotkeys that allow users to view the HTML source code. In Chrome, a user can press CTRL-U (Command-U on a Mac) to view the source code of the webpage. Additionally, the source can be viewed by using the Developer Tools functionality of the Chrome browser by pressing the F12 key.
Governor Parson was quick to note that more than just viewing the HTML source code was required in this case, noting on Twitter, “An individual accessed source code and then went a step further to convert and decode that data in order to obtain Missouri teachers’ personal information.” He then added, “This data was not freely available, and by the actors (sic) own admission, the data had to be taken through eight separate steps in order to generate a SSN.” Governor Parson included screenshots of the relevant Missouri statutes he believed the reporter violated in his research.
Because the reporter gave the state time to fix the vulnerability before disclosing it, we can’t independently validate the specific steps required to decode the data and render SSNs. Some reporting hypothesized that the SSNs were stored using base64 encoding. This seems extremely likely, mostly because we see it regularly.
Base64 encoding results in what is known as a 7-bit ASCII representation of arbitrary binary data (though technically, each byte of base64 payload only represents 6 bits of data), typically using the characters a-z, A-Z, 0-9, +, and /. Effectively, this means that any data can be encoded such that it can be printed. Base64 is the standard used to encode email attachments, data in HTTP POST requests, and many other implementations.
Base64 is easy for the trained eye to identify. First, for any given input data length, the output length will be identical. Assuming the SSNs were encoded individually (as was likely the case), each base64 payload would be an identical length. Next, it’s important to note that because each base64 payload character represents six bits of payload data, each three bytes (characters in this case) of payload data are represented as four bytes of base64 payload.
To see this more clearly:
- Each byte is eight bits.
- Three bytes of input data is 24 bits (3 x 8 = 24)
- Each base64 character can encode six bits of input data
- Four base64 characters are required to represent this (24 / 6 = 4)
Base64 should never be used as an encryption or obfuscation method. But this is an especially serious problem in the case of social security numbers. Until 2011, the first three digits of a social security number were assigned as what was known as an area number. In the case of Missouri, those area numbers were 486-500. In the 100,000 educators in the affected dataset, it would be extremely likely that the majority of them have SSNs where the first three digits are in this range. This means they will encode to the exact same output data. For example, see the figure below. Note that the first four characters remain static (as does the total length of the base64 output).
Another dead giveaway that the payload is base64 data are the equal signs at the end. Every base64 encoded payload must end on a four-byte boundary. When the payload doesn’t end on a four-byte boundary, it is padded with one, two, or three equal signs. These are not part of the standard base64 alphabet and are understood to be padding. A nine-digit SSN encoded using base64 will always have two equal signs for padding.
This is the equivalent of saying, “we have everything locked in a safe. Here is the safe for you to hold. Here is the combination and instructions for opening said safe. But you definitely cannot see what is inside the safe.” Nobody would confuse this real-world analogy for real data security, and we shouldn’t confuse the digital equivalent either.
Why Were the SSNs In the Browser at All?
HTTP is a stateless protocol. Every time the browser makes a new request, the server responds as if it has never seen the browser before. Because this doesn’t make for a very rich web experience, web developers need a way to store state in the browser. Thanks to GDPR, CCPA, CPRA, and various other privacy regulations, you’re probably familiar with at least the existence of cookies by this point. A cookie is one method used to store data in the browser, preserving state. The server sets the cookie in the response. The browser passes the cookie data back to the server with each subsequent request.
Another method frequently used by web developers to maintain state are hidden form fields. These form fields are encoded in the Document Object Model (DOM) of the webpage – effectively the page source. Hidden form fields can be trivially viewed by examining page source and are prime candidates for inspection.
The fact that a hidden form field exists indicates the developer intended to store some state in the variable. Many developers don’t consider that even though the form field is hidden, users can still interact with the data, both to view and change it. As a result, hidden form fields are a very common source of vulnerabilities in web applications. Developers know they must validate user input, but many fail to view hidden form field data as user input, rationalizing that it was created by the server and sent back to the server.
To analogize this with a real-world example, consider this the equivalent of giving a user a box to hold that contains a tee shirt. They will take the box in another room out of your view. Later, they will hand the box back to you, at which point you will carry the box through a TSA checkpoint. Would you be willing to trust that nothing has changed in the box and risk a body cavity search? Of course not. So too, should a webserver not blindly trust that no dangerous contents have been placed in the box.
Similarly, the box might contain extremely sensitive data, like social security numbers. To align with the probable case that SSNs were base64 encoded, in this example, the SSNs are present in the box but have been printed onto 9 jigsaw puzzle pieces. They are there for anyone who wishes to assemble the puzzle, which is trivial for anyone to do. Would you hand this box containing your personal data to any stranger that requested it, even if they needed to assemble puzzle pieces to understand it fully? Again, of course not.
The issue at hand is not the number of steps required to view the sensitive data. The problem is that the sensitive data should never have been in the browser in the first place.
Does Intent Matter When Vulnerabilities Are Disclosed?
The impact of the flaw is significant. Based on reporting from the St Louis Post-Dispatch, there were likely more than 100,000 SSNs available through the application. Governor Parson stated to the press that the reporter “took the records of at least three educators.”
While we can never truly measure intent, the number of records accessed is a strong indicator. If the reporter had accessed all 100,000 SSNs to “confirm the vulnerability”, this might be viewed with suspicion. If the reporter subsequently released the SSNs to the public, this would be ethically questionable. But would even that be breaking the law? That seems unclear.
At the federal level, there is precedent for downloading publicly available information and releasing it potentially being a crime (but maybe not). For those not familiar with Andrew Auernheimer (aka “Weev”), in 2010, he and his friend downloaded publicly accessible data for 120,000 AT&T customers and provided it to Gawker. Auernheimer was convicted under the controversial Computer Fraud and Abuse Act in 2012, but his sentence was vacated in 2014 due to an improper trial venue. In 2013, Auernheimer filed an appeal based partly on the fact that the information he accessed was publicly available (even if AT&T didn’t intend it to be). If you’ve ever seen a laptop sticker that says “wget is not a crime” (probably at a security conference), this is what it refers to. After the sentence was vacated, the government did not retry the case. This leaves the question of whether accessing publicly available (but sensitive) data in bulk and providing it to a third party is a federal crime. One court said yes, but lacked venue, and the merits of the case were never tested under appeal.
While “is accessing publicly available data a crime” is still an open legal question, that’s not what happened here. The reporter discovered the vulnerability, confirmed it by accessing three records, did not provide the data to external parties, notified the system owner, and didn’t publish information about the vulnerability before it could be mitigated. Had the reporter accessed thousands of records (or enumerated all records), there would be more reason for concern. But here, it’s fairly obvious the reporter did not intend to steal personal data. Intent matters elsewhere in the US legal system; it likely should here as well.
Threatening a reporter with legal action is almost always a bad idea and usually creates an unintended Streisand Effect. But more generally, organizations should be careful not to shoot the messenger when security vulnerabilities are disclosed. The question of whether this was a crime might be more black and white if the reporter had enumerated all records before reporting the issue. That governor Parson said only three records were taken seems to contraindicate any malicious intent.
Instead of focusing on this so-called “hacking,” Governor Parson should be worried about the security of the state’s applications, particularly those that are available for public use.