Important: Research and details described in this post are meant to help defenders/users protect their passwords from cracking. The purpose of this data is to demonstrate the resiliency of effective password creation. **Do not** request access to the raw data, including the final cracked passwords. These requests will be ignored.
What makes a password hash survive for years after a leak and remain uncracked? Despite some website breaches happening nearly 4 years ago, and scores of cracking hobbyist working the Hashes.org dataset, 1,708,665 MD5 hashes across many sources remained uncracked. I thought there must be something new to discover that could yield additional knowledge of what makes a strong user-generated password. So i fired up the "budget cracking rig" and "pentester rig" for a little over two months to find out. They would both run continuously for this time period iterating over different attacks, sifting the found passwords, and begin the process again. All told 75,971 were cracked from the original 1,708,665 MD5 hashes hosted on Hashes.org. It's worth noting some analysis of the results may be skewed due to regional language of the dataset and the site from which they originated. FYI, I didn't find the equivalent of the lost city of Atlantis in the newly cracked passwords ;)
The cracking methodology used for this experiment was to modify Larry Spohn's Hate Crack (pretty awesome check it out) Hashcat wrapper script. Modifications allowed for Hashcat to perform a series of attacks, each series lasting for 24 hours.
ATTACK #1 - Purple Rain Attack
In an attempt to find extremely unique character combinations the Purple Rain Attack was implemented with 500,000 random rules generated each series. This bolstered the "randomness" of the rule creation. Since we are in search of things possibly never seen in other cracked datasets it helps to be unpredictable in our exploration.
- Upon completion of this series, sort out the newly cracked passwords and add them into a custom wordlist 'cracked_list.txt'.
ATTACK #2 - Prince w/ MASTERBLASTER
A traditional PRINCE Attack was performed combined with ALL the rules that Hashcat provides. I removed duplicates and combined these rules into one rule file called MASTERBLASTER.rule. I also ensured the “:” (Do Nothing) was the first rule in the list to take advantage of the PRINCE output immediately.
- Example command of what ATTACK #2 looked like:
cat cracked_list.txt dict1.txt dict2.txt dict3.txt | shuf | pp64.bin --pw-min=8 | hashcat -a 0 -m 0 -w 4 hashes.txt -r MASTERBLASTER.rule
ATTACK #3 - Mask Attack of recently cracked
When Attack #1 & #2 completed I extracted out the most recently cracked password from the hashcat.potfile in the last 48 hours. These passwords were run through PACK’s maskgen to identify any unique character class combinations. This allowed me to dive deep into these newly cracked passwords to see how common their compositions were among the user base. Lastly, it ensured that I’d “shake out” the simplest of the passwords before we started the series over.
Repeat this process nearly uninterrupted for 65 days.
You might ask yourself, “I can’t wait to get my hands on that debug output from Hashcat so I can view the successful rules.” Unfortunately at the conclusion of this experiment I realized I deleted the file by accident :( This is again a cautionary tale when moving and deleting files, and rookie mistakes still performed by the best of us. I did look over the rules for many days before this calamity and reality is that the uniqueness of the rules wasn’t really a factor in the cracking process. PRINCE's ability to generate new candidates was the main factor for the modest success. If a rule was successful once, it typically was such an outlier that its usefulness in the future was nearly zero, or it was a common rule combination already in the base Hashcat ruleset. Do not, fear I do have the relevant mask output from PACK. You can find that dataset <HERE> on Github. If I ever attempt to improve on this process again i’ll be sure to protect those debug rules with my life :) If you need a reference for raking and generated rules check out Evil Mog's generated2.rule writeup.
To perform analysis on the basic structure and statistics of the passwords I used PACK's functionality. It truly is a quick, indispensable little utility for compiling statistics of passwords. The most basic of statistics to evaluate is the overall length of the passwords in the dataset.
LENGTH: PERCENTAGE of 75,971 (COUNT)
13: 21% (16508)
11: 21% (16262)
12: 20% (15358)
10: 14% (11280)
14: 11% (8994)
16: 02% (2254)
9: 02% (2202)
15: 02% (1651)
17: 00% (476)
18: 00% (297)
19: 00% (163)
20: 00% (116)
7: 00% (85)
6: 00% (78)
21: 00% (77)
8: 00% (59)
22: 00% (45)
23: 00% (25)
24: 00% (13)
25: 00% (11)
26: 00% (7)
27: 00% (2)
28: 00% (2)
2: 00% (1)
4: 00% (1)
5: 00% (1)
32: 00% (1)
35: 00% (1)
38: 00% (1)
Not surprising since this dataset has been around for years that anything under the length of 8 characters is very rare.
Character classes used throughout the recovered password sets.
loweralphanum: 40% (30422)
loweralphaspecialnum: 16% (12839)
mixedalphanum: 15% (12050)
upperalphanum: 05% (4351)
loweralpha: 04% (3677)
mixedalpha: 04% (3229)
all: 03% (2781)
loweralphaspecial: 03% (2297)
specialnum: 01% (1071)
mixedalphaspecial: 01% (1008)
numeric: 01% (946)
upperalphaspecialnum: 01% (810)
upperalphaspecial: 00% (234)
upperalpha: 00% (187)
special: 00% (69)
Simple masks gives a high-level observation on character class combinations. It can be seen the largest pattern converging on digits throughout the dataset.
stringdigit: 29% (22389)
digitstring: 13% (10539)
othermask: 13% (10029)
stringdigitstring: 12% (9659)
string: 09% (7093)
stringspecialdigit: 05% (4150)
stringdigitspecial: 03% (2838)
digitspecialstring: 02% (1903)
stringspecialstring: 01% (1248)
digitstringspecial: 01% (1221)
digitstringdigit: 01% (1004)
digit: 01% (946)
stringspecial: 01% (942)
specialstringdigit: 00% (332)
digitspecial: 00% (318)
specialstringspecial: 00% (296)
specialdigitspecial: 00% (240)
specialdigit: 00% (222)
digitspecialdigit: 00% (209)
specialdigitstring: 00% (165)
specialstring: 00% (159)
special: 00% (69)
A quick summary of the most prevalent mask character compositions it can be seen these passwords most likely have an Asian country influence. I mention this tendency in my manual Hash Crack that Eastern country users preference for using digits in their passwords. This has two simple explanations in my opinion:
1) Asian cultures tend to value numbers for personal, religious, national, or cultural significance.
2) Due to encodings, character creation, and universal use of Latin characters for most websites, it makes more sense to steer towards the use of numbers for compatibility and recall.
[?l?l?l?l?d?d?d?d?d?d?d?d?d?d: 03% (2429)
?l?l?l?l?l?l?l?l?d?d?d?d: 02% (2189)
?l?l?l?l?l?l?l?l?l?l?l: 02% (2127)
?d?d?d?d?d?d?d?d?d?d?l?l?l?l: 02% (1906)
?u?u?u?u?d?d?d?l?l?l: 01% (1509)
?d?d?d?d?d?d?d?d?d?l?l?l?l?l: 01% (1419)
?d?d?d?d?d?d?d?d?d?d?l?l?l: 01% (1255)
?l?l?l?d?d?d?d?d?d?d?d?d?d: 01% (1218)
?l?l?l?l?d?d?d?d?d?d?d?d?d: 01% (1178)
?l?l?l?l?l?l?d?d?d?d?d?d?d: 01% (1137)
?d?d?d?d?d?d?d?d?d?d?d?d?d?d?d?d: 01% (925)
?l?l?d?d?d?d?d?d?d?d?d?d?d: 01% (884)
?d?d?d?d?d?d?d?d?l?l?l?l?l: 01% (866)
?l?l?l?d?d?d?d?d?d?l?l?l: 01% (842)
?l?l?l?l?l?l?l?l?l?d?d: 01% (760)
I used Fnord Pattern Extractor, which is used for extracting patterns from obfuscated code, to measure entropy, hex sequence, and Levenshtein distance. I find a little pleasure in being able to repurpose tools for an unforeseen use case. The output of this tool was troublesome to get into text format for displaying purposes so I've take a screenshot of the results. It can be tuned with several different parameters and this example is the default settings.
Reviewing the top patterns show they are challenging but nothing that seems too surprising. Looking at a random sample of strings beside their masks it remains difficult to decipher the underlying password composition. As stated earlier the Eastern country influence is apparent in these passwords but pulling out the underlying linguistic pattern posed a difficult task.
Looking over the above examples some are just unpredictable letter combinations picked for an unknown reason by the composer. Other strings point to a translation between foreign keyboard mappings and how those mappings manifest themselves into Latin ascii (pinyin). Take for instance the password Vacjz.5201314 contained within dataset and the composition. The digits 5201314 when spoken out loud in Chinese phonetically translate into "I Love You Forever and Ever". The Vacjz translates into what I can assume is pinyin for the individual that is loved.
More information related to Chinese passwords and composition read these excellent post:
Sunnia Ye "A Study of Chinese Passwords"
Comparative Analysis of Three Language Spheres:Are Linguistic and Cultural Differences Reflected inPassword Selection Habits?
Length and unique character combinations contributed greatly to these hashes survival. The lost city of untapped password knowledge was not found and thus the same recommendations still remain: use a password manager, create longer unique passwords, and NIST Standards (although I disagree on 8 character minimum). Really hoped to find something more impressive than the same old recommendations. Maybe there could be more to the data than what the eyes see...open to feedback on Twitter @netmux