To Password or Not To Password
The Anatomy of Authentication
Password vs. Passwordless, Centralized vs. Decentralized
By Linda Lu
authentication | ɔːθɛntɪˈkeɪʃ(ə)n | noun
The process or action of proving or showing something to be true, genuine, or valid; In computing, authentication is a process or action of verifying the identity of a user or application.
Businesses are dealing with large volumes of private, sensitive data (e.g. employee or customer’s personal data) on a daily basis. At the same time, they are also fighting malicious and deliberate attempts to breach their information systems. No enterprise is immune to hackers, and how they address this growing need will have a significant impact on their company. Former Cisco CEO John Chambers once said, “There are two types of companies: those that have been hacked, and those who don’t yet know they have been hacked.”
Security continues to be a growing concern for businesses across industries. Given the huge amount of private information at stake, businesses need to implement layered cybersecurity architecture to defend their operations.
Outside Threat Protection
As shown in the diagram, there are 5 layers of security: Perimeter, Network, Endpoint, Application, and Data. Enterprises typically apply best of breed security products to help them protect these 5 primary layers. Notable players at each layer include:
- Perimeter: Honeywell, Bosch Security Systems
- Network: Palo Alto Networks, Fortinet
- Endpoint: McAfee, Symantec, Cylance, CrowdStrike
- Application: Veracode, Checkmarx, Qualys
- Data/IAM: Okta, Duo, Auth0, BeyondTrust
Because of limited budgets and shortage of in-house security professionals, SMBs generally outsource cyber defense to MSSPs (managed security service providers), who supply a mix of security consulting services, ongoing monthly maintenance services, and product sales (often sourced from best-of-breed security vendors for each layer) to their client base.
In this article, we focus on authentication, which has become the most critical part of IAM (identity and access management).
Data security is the fundamental layer of the cyber defense architecture. This is because data breaches are very costly to enterprises. The average cost of a data breach for a company in the U.S. is almost $8 million, according to PC Magazine. Because of the potential impact to their enterprise, companies are continuing to increase their spending on IAM software. The IAM market alone is projected to reach $13 billion by 2021, according to Forrester Research.
Data is what malicious hackers are after and selling the stolen data on the Dark Web, a marketplace for stolen data and personal information, is one of the ways to monetize those well-planned attacks. The infographic below shows the 10 most common pieces of information sold on the Dark Web and their associated price range.
The general login info is sold at $1 per user and payment related login info is sold at $20-$200 per user. Here are the common steps involved to steal a user’s login credentials via email phishing:
Beyond customers’ login info, imagine a scenario where a data analyst who works at a healthcare organization and has access to patients’ medical records. If the analyst’s privileged account is compromised, hackers have access to both login info and medical records, which are much more damaging. With stolen login credentials, organizations could always ask employees and customers to change their login info quickly before hackers do too much damage to the system. For sensitive, immutable, appendable-only data such as medical records, the privacy violated is irreversible. The bells have rung, they cannot be unrung.
Authentication determines who/what can access a system, how they can access a system, and which data they have access to, by verifying information provided based on stored credentials.
There are 3 common authentication factors in the form of data that helps us authenticate users:
The Username/Password approach is the most common way to authenticate a user, but it is also proven to be very insecure. According to the Verizon Data Breach Investigation Report (DBIR), “81% of hacking-related breaches leveraged either stolen and/or weak passwords.” It’s very common for individuals to use the same password for both personal use as well as work related applications. According to PixelPrivacy.com, “Millennials aged 18–31 lead the lame password category parade, with 87% admitting they frequently reuse passwords despite knowing better”. 60 million Dropbox’s user credentials are stolen and available for sale on the Dark Web because of employee password reuse.
We will explain why password-based authentication is so vulnerable, why common user practices have made the problem even worse, and why a better solution needs to exist.
In the early days of computing, passwords were stored in a database as plaintext. The best-known password database is also known as Active Directory (AD), which is a directory service that Microsoft developed for Windows domain networks. Using Active Directory, a user logs into a computer that is part of a Windows domain, then AD checks the submitted password and determines whether the user is legitimate or not. AD covers a broad range of directory-based identity-related services, like management and storage of information and authentication and authorization services.
There are multiple problems with Active Directory. The first is that it creates a “honeypot” where hackers can access many passwords at the same time, if they are able to hack a system. The second is that passwords themselves are easily discoverable and replicable, because of their Active Directory allows simple plaintext passwords.
Remember our friend, Big Head from the HBO show Silicon Valley (Season 4, Ep 10):
Big Head: My username is “password” and my password is “password.”
Richard: Your username is “password”?
Big Head: It was just easier. 😀
In the early days, AD would store Big Head’s credentials as:
To solve the glaring vulnerabilities of systems such as Active Directory, engineers invented password hashing.
Hashing is a type of algorithm which takes any size of data and converts it into a fixed-length of data. Theoretically speaking, hashing algorithms have the following properties.
- Resist to collision: two different messages should generate two different hashes; m and m’ are two distinct messages, hence hash (m) <> hash (m’)
- Difficult to reengineer: given a hash h, it should be difficult to find a message m that hash (m) = h
The most common hashing algorithms being used today is
- SHA-256: convert into a 256-bit hash, 64 hexadecimal characters
- SHA-512: convert into a 512-bit hash, 128 hexadecimal characters
When a user tries to log on to the system, after submitting the username and password, the authentication service will convert the password into a hash and check if the hash is matched with the one on file. If it is a match, the user’s access right would be granted.
Hashing is more technologically advanced than plaintext from a security perspective, as it adds difficulties for adversaries to access users’ passwords. With hashing, businesses store a hash of a user’s credentials rather than the plaintext in the AD. If only hashed passwords are stored in the AD, when AD is compromised, hackers would get a list of hashes instead of plaintext passwords. We explain in the next section how to reverse engineer hashes and why it takes time and computing resources.
Hashing can also be cracked!
Reverse engineering hashes is not easy but is doable via brute forcing, i.e. to try all the possible inputs until the right one is found. Proof-of-work in bitcoin mining adopts this concept. Mining a block requires the miner to produce a value (aka nonce) that, after being hashed, is less than or equal to one used in the most recent block accepted by the Bitcoin network. The number is between 0 and 256-bit, and miners have to use hard-core GPU/FPGA/ASIC hardware to stay competitive in this brute force competition.
Besides brute forcing, hackers can create a very large library of common passwords (common words, namebirthday, etc.), generate hashes for those common passwords, and add those hashes to the library. With the compromised password database, hackers can cross reference the hashes in the library with the hashes in the stolen database. The ones that match are the ones hackers have successfully cracked. In that sense, we have seen services strongly against users using common words or predictable patterns as passwords as they are much easier to be reverse engineered.
Hashing with Salts
Salt is another layer added to the password protection. Let’s use Big Head’s credential again.
With just hashing no salting
Big Head’s Password: password
Hash stored in AD (assuming SHA-256) A 256-bit, jfu8ysxnngiruyshmxur93xn901imkl…….
Big Head’s Password: password
Add Salt: passwordtgujr (tgujr is the Salt)
Salted Hash stored in AD (assuming SHA-256): A 256-bit, ursiiej202hhncspouejj3nx98fnalkj…….
It’s worth noting that each password has its own randomly assigned Salt. A user’s password is stored as hash (password+Salt), and Salt is stored as plaintext in AD, which makes it relatively more difficult to crack. When hackers get access to the password database, they cannot simply rely on the pre-generated library of common passwords and hashes to find matches, as every common password is added with a Salt.
For a given salt, hackers need to regenerate the hashes for every common password in the library just to identify one password. To cover the whole directory, they have to do it repeatedly until they run through all the salts. Though very computing intensive and time consuming, it is doable!
We discussed at length how password-based authentication works. Basically, all the credentials are stored in AD, a centralized database. Once accessed improperly by hackers, all the credentials could be stolen. If stored as plaintext, the stolen credentials could be available for sale on the Dark Web. If stored with hashing, hackers need a bit of computing power to crack those hashes, and the ones with common/predictable passwords are the most vulnerable. If stored with hashing and salting, hackers need a lot more computing power to reverse engineer the salted hashes. The quick takeaway is having credentials stored in one centralized database is very risky. It’s a gold mine for every hacker, and is susceptible to a variety of other attacks (e.g. phishing, malware), and it’s hackers’ holy grail to crack those hashes.
Password-based Authentication: ZKP comes to the rescue!
The centralized password database seems to be the problem with password-based authentication, because it enables many passwords to be accessed in one fell swoop if a system is compromised. Is there a way to eliminate that centralized database but still use password as a primary way to authenticate users? There is, Zero Knowledge Proof (ZKP).
ZKP is a cryptographic approach to prove the validity of a statement without revealing any information of the statement. One party (prover) can prove to another party (verifier) that they know the value X without conveying any information of X apart from the fact that they know the value X.
Below is how password-based authentication works using Secure Remote Password (SRP) protocol, which conveys a zero knowledge proof from User to Server.
If Client has Password and Salt and Server has Verifier, they should both have the right inputs to come up with the same session key used to encrypt and decrypt messages. If messages can be verified, it not only proves that Client has Password, but also proves Server has the Verifier. Both are authenticated by their counterpart.
Via this approach, User’s Password (p) and Hashed password (x) never leaves Client or is never passed through the network or stored by Server. Server doesn’t need to know them to be able to authenticate User. In an adversarial scenario, an eavesdropper obtains Server’s Verifier (v), Salt (s) and g. Theoretically, they could reverse engineer x. With enough computing power, an adversary could also reverse engineer the password given the salt is already known. But this brute force approach is almost mission-impossible given 1) brute force is very time consuming; 2) systems will set up a short time limit for each login session. Users will be asked to re-login, a new salt will be randomly generated, and hackers will have to start over the brute force before they get shut out by the system again.
This password-based approach is more secure than the traditional approach that is dependent on a centralized password database. One caveat with this approach is that it requires more communication between Client and Server. In the traditional password approach, Client sends username and password to Server, Server authenticates, and only one round trip is needed. In ZKP, 3 round trips (exchange v, s; exchange A, B; exchange encrypted messages) are required to perform one authentication, which takes longer than one round trip.
The ZKP approach is not dependent on the Server knowing exactly what User’s password is, but this is by no means to encourage users to use simple common words, like abcd, 1234, and password (Big Head 😜). Users still need to follow the password best practice, like using both numbers, letters and special characters, and changing the password periodically. Employees should still separate their work-related passwords from their personal passwords, as we don’t know which application is using ZKP, which stores our hashed passwords and which still stores passwords in plaintext.
Passwordless Management: Distributed Authentication
We understand the caveats of the centralized password-based authentication approach. Developers have come up with an approach to make authentication passwordless by incorporating a user’s mobile phone and biometric authentication. This approach is passwordless by default, which is different from the traditional 2FA or MFA approach where password and centralized password management are still needed.
How does passwordless authentication work?
Part A: User Enrollment
Charlie is enrolled via the Passwordless App on his phone. Charlie’s phone scans Charlie’s face or fingerprints. The mathematical representation of a face/fingerprint scan, not the actual scanned images, is stored on Charlie’s phone. The information is held locally in the secure enclave, a smartphone chip that is dedicated specifically to securing sensitive data.
Part B: User Authentication
This approach leverages the widely available, mature technologies such as API, mobile SDK and biometric authentication embedded in smartphones. This approach conducts user authentication at the edge (i.e. user’s mobile phone), eliminates centralized password management, and neither Passwordless nor Dropbox has knowledge of Charlie’s credentials. Neither Passwordless nor Dropbox needs to collect or store users’ biometric data on their own servers. Charlie no longer needs to memorize different username/password combo for different apps, he only needs to memorize the pin for the Passwordless App. If Charlie’s phone is stolen or hacked, only Charlie’s credentials would be compromised and it would not impact the rest of the user base, which is a huge improvement on risk contamination vs. password-based approach.
The caveat of this approach is that it’s device dependent. From Charlie’s perspective, if he switches his phone or his phone is stolen, he’d need to register his new device and re-onboard himself, as his new device needs to authenticate and store his credentials, which might cause friction in the user experience.
Albeit at an early-stage stage, this approach is theoretically more advanced than the traditional password-based approach. We have startups such as Woven, HYPR, Groove.ID, Secret Double Octopus evangelizing this approach to enterprises as well as SMBs. It might take a while to educate the market, especially the enterprises, as this approach will introduce a bit of change to the customers’ existing security design. Users might also need some education as many users don’t want to take out their phones every time they log on to a service.
Passwordless Management: Decentralized Blockchain-based Authentication
Decentralized blockchain-based (or onchain) authentication is to distribute the authentication work to nodes/verifiers in the blockchain network by comparing fresh biometric samples submitted by users from any device with the users’ biometric templates stored onchain. Let’s start with how it works from a user’s perspective.
Part A: User Authentication
Decentralized onchain authentication, if scalable, high-availability and secure, is theoretically superior to both centralized password-based authentication and passwordless authentication at the edge, as onchain authentication eliminates the centralized password database and distributes the authentication work to the nodes/verifiers without users being device dependent. However, we understand it’s a big IF, as the technology is still at its infancy. We’ll decompose the decentralized authentication to identify the enabling technologies required to move this approach forward.
Part B: Decompose Decentralized Authentication (Step 5 and 6)
Enabler A: Randomness
Verifiers should be randomly assigned for each transaction to ensure decentralization and security. In our case, if this is a permissioned network and we have a few nodes dedicated to serving Dropbox’s authentication requests, this protocol is not necessarily far more superior than the centralized one where authentication is performed by one centralized node. From a security perspective, it’s actually worse than the device-dependent passwordless approach, whereas the authentication is fully distributed to the edge. The designated nodes could easily become targets for adversaries.
Bitcoin solved the randomness issue by adopting Proof-of-Work (PoW). We mentioned this earlier in the password approach. This consensus algorithm asks miners to come up with a value that, after being hashed, is less than or equal to the one used in the most recent block accepted by the network. Whoever comes up with a value first gets to propose the next block. The randomness is assured because the only way to do this is through brute force. The security of the chain is also assured as the only way to attack the chain is by owning over 50% of the network’s computing power, which is at the cost of $1.4B as of Nov’18. The major weaknesses of Bitcoin network are 1) very energy consuming, and 2) low-throughput, which make it impossible to handle mission-critical transactions.
The high throughput, secure and decentralized protocol is still at its infancy. We have a few Layer 1 and Layer 2 protocol teams are working on this blockchain trilemma.
The discussion on how project teams solve the trilemma (like sharding, plasma, different consensus algorithms, etc.) is beyond the scope of this article. We’d like to focus on onchain technologies that are more directly related to the key topic being discussed today, authentication. Assuming we have a proven high-performance, secure and decentralized protocol that ensures randomness, what are the enabling technologies required for us to have a functioning onchain authentication?
Enabler B: Privacy-First Protocol
Mainstream protocols like Bitcoin and Ethereum are not built for privacy-preserving smart contracts. People would argue Bitcoin/Ethereum is anonymous because users can transact bitcoins/ethers on chain with no personal information required. However, it’s not anonymous. It’s pseudonymous, as every transaction is linked to two specific addresses, a sender’s address and a receiver’s address. Transaction details including amount and time are recorded and revealed onchain. Given how sensitive users’ biometric data is, an open decentralized blockchain protocol is clearly not a viable choice for authentication. The mathematical representations of users’ face and fingerprints should not be made publicly available on the blockchain.
Authentication requires a privacy-preserving computing protocol. Let’s take a granular look at the approaches protocol teams have tried.
B1: Fully Homomorphic Encryption
Fully Homomorphic Encryption is an encryption technology that enables computation on encrypted input to generate output that, after decrypted, matches the output of the computation on the original input. In the case of authentication, the protocol would encrypt the mathematical representation of Charlie’s biometric sample as well as the pre-stored biometric template. The protocol then conducts the one-on-one comparison of the encrypted captured sample with the encrypted template, and generates a matching score. Charlie’s access would be granted If the score is above the threshold.
Homomorphic encryption is still very much theoretical at this point. It can be used to perform simple arithmetics like addition/subtraction, but it is too early to be adopted in complicated computations, like authentication.
B2: Trusted Execution Environment (TEE)
TEE is a secure area of a main processor. It guarantees code and data loaded inside of secure enclaves to be protected with respect to confidentiality and integrity. Unlike other privacy approaches, TEE has already been widely adopted and accepted in our day-to-day life. For example, iPhone’s Touch ID and Face ID are enabled by a secure enclave inside the phone, which stores critical authentication information. The encrypted biometric data goes into the enclave, which decrypts and processes the data, and encrypts the output before it leaves the enclave. The mainstream hardware chips that offer TEE are Intel’s Software Guard Extensions (Intel SGX), followed by AMD’s Secure Execution Environment and ARM’s TrustZone.
In the case of authentication, Charlie’s encrypted biometric sample and stored template are forwarded to verifiers in the blockchain network. Verifiers perform authentication in their own TEE without Charlie’s data being exposed to outside of secure enclaves. This approach is scalable as computing is offloaded to offchain, which is not subject to onchain scalability issues. The approach is also easier to adopt as most of the computers are loaded with one of the three mainstream chips mentioned above.
Despite the high-throughput nature, a major risk associated with this approach is being Intel/AMD/ARM dependent. We need to trust the chip makers that the TEE they created can be trusted. Last year, vulnerabilities like Meltdown, Spectre, Foreshadow uncovered by researchers gave people pause on the implicit security guarantee. Although Intel issued patches for the vulnerabilities discovered, given that secure enclaves have been involved and will continue to be involved in most of the mission-critical tasks, it won’t be surprising if they continue to be adversaries’ target.
We have initiatives from Oasis Labs and Keystone teams to build an open-source secure enclave. The open-source design would 1) allow the community to openly evaluate and improve on the system; 2) allow chips to be manufactured by any chip maker, which reduces the concentration risk on Intel/AMD/ARM.
B3: Zero-Knowledge Proof (ZKP)
We discussed ZKP at length earlier in the article, as we argue ZKP is a lot more secure than a password-based approach that stores hashed passwords in a centralized database. ZKP could be more easily adopted by password-based approach than passwordless, as it’s much easier to provide ZKP of passwords than biometrics. For biometrics, researchers proposed to generate a biometric key based off the biometric sample via 1) feature extraction; 2) vectorize characteristics; 3) weighing the characteristics; 4) find the key vector that is closest to the series of weights; 5) generate a cryptographic key based on the key vector. Performing ZKP on the cryptographic keys should be much easier than ZKP on pixel-based biometric images. This technology is still very early in terms of implementation, as generating a proof is still a slow process that is far from enterprise-grade yet.
B4: Secure Multiparty Computing (sMPC)
sMPC is to have multiple parties jointly do computation based on the inputs they have. Every party only see a portion of the encrypted data. Unless every party colludes, no one has access to the full input.
We want to compute Z = X * Y, where X = 30 and Y = 70; but we don’t want to reveal the value of X and Y
We split X into X1, X2 and X3, where X1= 2, X2= 3, X3= 5
We split Y into Y1, Y2 and Y3, where Y1= 5, Y2= 2, Y3= 7
We distribute (X1 , Y1), (X2 , Y2), (X3 , Y3) across 3 nodes. Each node comes up with their own multiplication result, Z1= 10, Z2= 6, Z3= 35
Z = Z1 * Z2 * Z3 = 2100
No node has the value of X or Y, unless all of them decide to collude
Besides low-throughput, one caveat of sMPC is that the output depends on each party to perform the computation honestly. If one party has gone rogue, the output will be incorrect.
In the case of authentication, which involves a lot more complicated computation than simple addition and multiplication, the sMPC-based protocol would need to split the binary data of Charlie’s biometric sample and stored template. Each party only gets a portion of both the sample and the template, and compare them to see if there is a pattern underneath. Then the challenge is how to split the sample and template that each piece is small enough that is not anything meaningful on its own, but also big enough that allows for pattern recognition.
Enabling C: Decentralized Storage
Beyond privacy-preserving, the blockchain network needs storage to hold users’ biometric templates to perform verification. How does storage work on blockchain? The mainstream network such as Bitcoin requires every full Bitcoin Core node download/store the entire ledger of onchain transactions. Bitcoin transaction data is pseudonymous by nature, so it is reasonable for every full node to have the full copy.
Users’ login credentials such as biometric templates are private data and should not be publicly stored where every node has a full copy. Hence it’s important to introduce decentralized storage, where data is encrypted, split into pieces and shared across nodes that have idle storage capacity.
We’ll use IPFS as a decentralized storage protocol for illustrative purposes. To store users’ biometric templates, the full template and all the blocks (which means the single unit of data) within it will be assigned with a unique cryptographic hash. Blocks are stored across multiple random nodes. Every node is equipped with a storage index to know which node is storing what. During retrieval, the network will find nodes storing the data based on the unique cryptographic hash.
This enabling technology is still early, as we need to solve technical issues such as scalability, proof of space and proof of replication, etc..
We live in a world dominated by password-based authentication, and we have seen risks and incidents associated with that approach. We discussed the dominant approach and a couple of its alternatives. A quick rundown on pros and cons:
Authentication is critical to users’ day-to-day life as well as businesses’ day-to-day operations. We believe the three new approaches we discussed could coexist in the future. There is no clear winner at this point that could perfectly balance security, technology maturity and user experience.
For most companies out there that are using password-based authentication, we encourage them to:
- Make sure customer/employee passwords are stored in hashes with salts
- Implement 2-step user authentication: Once hackers gain access to the active directory, it’s just a matter of time before they crack some user accounts and passwords, especially the ones using common words and phrases. 2-step verification adds another level of security. Big Head’s account could be easily hacked. If he receives a text code to access his account, and he wasn’t the one asking for it, he’d know someone is trying to access his account without his permission. He could immediately log in and change his password, hopefully not password as password this time.😉
- Keep personal and professional passwords separate: Dropbox’s incident where 60M+ user accounts were breached happened due to an employee password reuse. Hackers gained access to that employee’s password from the previous LinkedIn breach. As we cannot be sure every product/service is implementing a strong authentication protocol, we should at least separate work-related passwords from personal passwords to contain the damage.
- Stay open-minded about the ZKP and passwordless based authentication approaches as the fundamental technology continues to evolve and mature
For companies who are working on device-dependent passwordless authentication
Approach go-to-market in the Enterprise space with patience
- The current market might be hard to crack at this point as passwordless is quite disruptive to the existing centralized password management; there are also alternatives such as to shore up their network, endpoint and cloud security to protect their active directory from being hacked.
- Some enterprises might be willing to do pilots for specific use cases with limited deployment; use them as your design partners to iterate on the product and to build credibility.
- Build partnerships with existing IAM vendors. Secret Double Octopus, the password-free enterprise authentication vendor, recently announced a partnership with Okta. As part of the joint solution, Secret Double Octopus complements Okta’s solution by providing passwordless access to all enterprise’s services and applications.
- SMBs are more likely to become early adopters as passwordless management is theoretically secure and simple to manage (the actual authentication is distributed to users’ devices at the edge).
- Need to come up with a solution or back-up plan for edge cases like phone stolen/lost/broken
- This approach requires a lot of evangelism to educate the market at the early stage. Make sure you have a talented, creative product marketer. 😄
For builders and project teams that are working on the privacy methodologies such as fully homomorphic encryption, sMPC, TEE and ZKP and Web 3.0 stacks e.g. protocol layer, network layer, infrastructure layer
- Hats off to all of you that are involved in this Web 3.0 initiative that is to bring the Internet back to what it was supposed to be — to assure the open development, evolution and use of the internet for the benefit of all people throughout the world — and to return privacy and control to users.
- Authentication is very likely to be tackled via Decentralized Identity (DID), a much broader initiative that enables users to own their own identity, control and protect their information, and share their information with entities that request it without being dependent on centralized third parties such as governments and banks.
- The enabling technologies such as privacy methodologies, consensus, decentralized network and decentralized storage are critical to the development of Web 3.0 (aka decentralized web), and the front end of authentication or DID would sit on the application layer among many other decentralized applications.
Authentication is such a fascinating topic. This article is by no means comprehensive coverage of everything you need to know about authentication or how it will evolve in the future. If anything, we just scratched the surface. Many enabling technologies mentioned above need to happen before we can have a secure, high-performance authentication approach with great user experience. We look forward to seeing multidisciplinary collaboration among computer scientists, mathematicians, cryptographers, economists, etc, to advance these developments. 🤓
We feel extremely honored to get to know a lot of amazing entrepreneurs and builders in this community. We consider knowledge sharing as a way for us to pay it forward as we have learned so much from the community. Feel free to email me at firstname.lastname@example.org to exchange thoughts. (Fun fact about Linda: When she’s not at work, she’s either browsing cute corgis and shibas on Instagram or lost in her own thoughts on decentralization vs. centralization 😜🤓🐶)