Hacker Email Address Parsing
It’s pretty common nowadays to see any type of email addresses that are publicly posted on the web to be formatted so that they are hard to spot by spambots that just rip through pages, parsing addresses out and adding them to various nefarious mailing lists. Such addresses often look a lot like “myaddress @ mysite . com” or a variation of such.
One thing that’s always struck me as funny is that a halfway decent programmer can pretty easily use Regular Expressions to parse through text and still be able to grab ‘hidden’ addresses. The idea is that even though these addresses are not traditionally formatted, they must still follow a pattern to be readable/understandable. If there’s a pattern you should be able to model it to some degree of accuracy using regular expressions – that’s exactly what the purpose of regex is. To that end, I decided to see how much effort would be entailed in writing said regular expressions, and how accurate I could make them. It’s important to note that I didn’t spend a lot of time optimizing my regex or getting the kinks out – this was just a toy project. All the regex that follows is C# flavored regex too, so if you use the Perl/JavaScript/whatever other flavor, you’ll have to rewrite them to suit your needs.
Here’s a good regex string that validates email addresses, which I used as a base for my modifications:
\w+(?:[-+.]\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
Here was my first address to validate:
webaddress @ somesite . com
I modified my regex as such to allow spaces in an email address:
\w+\s*(?:[-+.]\s*\w+)*\s*@\s*\w+\s*(?:[-.]\s*\w+)*\.\s*\w+(?:\s*[-.]\s*\w+)*
Sweet, that worked! The above regex just allows whitespace in the address, and it works remarkably well. Next task:
webaddress dot somethingelse @ blah dot something . com
This address substitutes ‘.’ with the word ‘dot’, but it should still be pretty easy to overcome:
\w+\s*(?:[-+.'dot']\s*\w+)*\s*@\s*\w+\s*(?:[-.'dot']\s*\w+)*[\.'dot']\s*\w+(?:\s*[-.'dot']\s*\w+)*
Again, easy as pie. All you have to do is look for ‘dot’. It is trivial to extend this to include things like ‘dash’ and ‘plus’ and ‘period’. Now, let’s throw some illegal characters in there too:
<webaddress>@somesite.com dot au
Here’s the regex I created to solve this one:
\w+[\s\W]*(?:[-+.'dot'][\s]*\w+)*[\s\W]*@[\s\W]*\w+[\s\W]*(?:[-.'dot'][\s\W]*\w+)*[\.'dot'][\s\W]*\w+(?:[\s\W]*[-.'dot'][\s\W]*\w+)*
Hmmm… that regex is getting kind of ugly, but it does what I asked of it (for now). I assume that since I used \W, which matches any non-word character, I’m opening myself up for some false positives. Next one up:
webaddress dot blah at findme dotcom
Here’s the regex for that one:
\w+[\s\W]*(?:[-+.'dot'][\s]*\w+)*[\s\W]*(?:@|at)[\s\W]*\w+[\s\W]*(?:[-.'dot'][\s\W]*\w+)*[\.'dot'][\s\W]*\w+(?:[\s\W]*[-.'dot'][\s\W]*\w+)*
There’s only one minor, seemingly trivial difference between this one and the last – it just adds the word ‘at’, much like when I added ‘dot’ earlier. No big deal, right? Well, if I think I may have had some false positives before, let me tell you, it’s really starting to break down now! The word ‘at’ is so ubiquitous in the english language that I ended up with so many false positives when this regex was embedded in random text that the results became almost meaningless. An example of text that would show a false positive now is ‘I met my brother at nine. He was…’, which would find the email address ‘brother@nine.He’ At this point I called it a day – that was a lot of regex to work through!
So, I know that my regex could be formatted much better, and that would help with false positives. If I was getting paid to write these, they would definitely be more polished. However, one thing that I did find is that if you’re really into using obfuscation to hide your email address, using ‘at’ instead of ‘@’ is probably one of the best obfuscation techniques. Take note so you can protect yourself from the evil spambots. Here is a copy of the text that I was working with, if you want to try these techniques on your own.

Leave a Reply