A.I. & Optimization

Advanced Machine Learning, Data Mining, and Online Advertising Services

Mining Email Addresses



One interesting problem that we worked in the past was understanding how people generate their email addresses. This problem has applications in different areas like fraud detection and generating emails for cold email reachout.

Email Language

Let's asume you have a person's first name (fn), last name (ln) and email address. One way to think about this problem is to formalate a function f(.) taking "fn" and "ln" and generate email as output: email = f(fn, ln). Function f() is a language function transforming "fn" and "ln" to an email address.

If we express f(.) programmatically, we can approach the email language modeling as a pattern mining problem in which we generate an f(.) for each tuple (fn, ln, email) as email pattern. Next, we merge similar patterns and report the overall count as the pattern frequency.

How to Express f(.)?

Let's take my gmail address for example. My tuple is (fn, ln, em)=(kazem, jahanbakhsh, k.jahanbakhsh) where "k.jahanbakhsh" is the local part of my gmail. So, my email can be expressed as follows:


      em = f(fn, ln) = fn[1].ln
      

In words, my email is composed of the first character of "fn" appended by "dot" and "ln".

Predefined Regex Rules

One simple (this is not exhaustive) way to approach this problem is to rely on our intuition. People are likely to pick some part of their "fn" and "ln" and sometimes a number (e.g. birthday) and compose those pieces to create their email address. So, one way to express f(.)'s is by pre-defining likely patterns using the regular expression. Say we guess many people would take their "fn" and "ln" and append them to create their email. So, we can express such an f(.) using the following regex:


      re.match(value["fn"] + value["ln"] + '$', value["local_part"])
      

We came up with 12 predefined regex rules to mine one of our datasets and count frequent patterns. The list below shows some of the most frequent patterns we found in the dataset we mined:


      fn.ln --> 68
      fnln --> 52
      fn --> 47
      fnln\d+ --> 34
      fn[0]ln --> 30
      

"fn.ln" and i"fnln" appear as the most frequent patterns for emails. The pros for above approach is its simpilicity. However, the issue is its limitation of not being able to automatically generate all possible patterns.

Mining Emails Programmatically

In our second approach, we want to automate generating patterns. People are likely to generate their email addresses by picking some characters from their "fn" and "ln" and append them with some especial characters such as: ".", "_" and digits. Also, due to sequence in human memory people will pick characters from their "fn"/"ln" from left to right.

Based on above observations, we can come up with a simple greedy search strategy scanning "fn" and "ln" from left to right and try to generate the email address. Below you see a sample Python code for searching and generating a pattern by analyzing fn, ln, and email:


#generate pattern
def gen_pattern(fn, ln, em):
    #first drop all especial characters and numbers
    pattern = re.compile("[^a-z]")
    clean_em = pattern.sub('', em)
    clean_fn = pattern.sub('', fn)
    clean_ln = pattern.sub('', ln)
    flag = False
    for fn_len in xrange(len(clean_fn) + 1):
        for ln_len in xrange(len(clean_ln) + 1):
            if clean_fn[:fn_len] + clean_ln[:ln_len] == clean_em:
                extract "p1", "inter", "p2", "end" parts from (fn, ln, em)
                flag = True
            elif clean_ln[:ln_len] + clean_fn[:fn_len] == clean_em:
                extract "p2", "inter", "p1", "end" parts from (fn, ln, em)
                flag = True
            if flag:
                em_pattern = p1 + inter + p2 + end
                return em_pattern
      

After generating patterns, we combined and counted the frequency of each pattern. Below, see the list of most frequent patterns generated by a greedy search technique:


      ('fnln\d{4}', 4)
      ('fn.ln.', 4)
      ('fn\d{4}', 4)
      ('fnln\d{1}', 5)
      ('ln.fn', 5)
      ('ln', 6)
      ('fn[1]ln\d{3}', 6)
      ('fn[3]ln', 6)
      ('fn[1]ln\d{4}', 6)
      ('fn_ln', 7)
      ('fn.ln.\d{1}', 7)
      ('fn[1]ln\d{2}', 14)
      ('fnln\d{2}', 21)
      ('fn[1]ln', 31)
      ('fn', 50)
      ('fnln', 52)
      ('fn.ln', 70)
      

Above results show extracted patterns with their frequencies. This approach is more exhaustive than the first one. The most frequent pattern is 'fn.ln'. However, it's interesting to see fnln\d{2}' and 'fn[1]ln\d{4}' patterns when people probably attach the last two digits or their birthdate to their "fn" and "ln".

Last words

In this post, we described the interesting problem of exploring different ways that people employ to generate their email assresses by formulating the problem as a simple data mining problem. You can also apply the same technique to mine people social media usernames such as Facebook or Twitter.

Feel free to leave a comment below if have any suggestions/questions. You can also reach us at info@AIOptify.com .