linux peeps - how to compare md5 files?

Discussion in 'Mail Chat' started by debase, Sep 9, 2011.

  1. debase

    debase VIP

    Joined:
    Mar 31, 2011
    Messages:
    74
    Likes Received:
    1
    Trophy Points:
    0
    I've got a huge md5 suppression file (3gb) and I want to compare it to my mail list and only pull out entries that are the same in each file, so I don't have to try inserting the entire md5 suppression file into my database.

    The suppression file is just each md5 on a line. The mail file is email,md5. I'm pretty sure this is a simple thing to do with grep, but not sure where to start.

    Any input is much appreciated....

    Thanks!
     
  2. DKPMO

    DKPMO VIP

    Joined:
    Mar 31, 2011
    Messages:
    1,452
    Likes Received:
    68
    Trophy Points:
    48
    Location:
    Elaborate Underground Base
    I'd do it in the database and 3Gb file is not something I would consider "huge". 300 Gb sure. 30 Gb probably.

    Of course even with 3Gb you cannot be casual with your database design - any wrong move and you are screwed.
     
  3. kidbrando

    kidbrando VIP

    Joined:
    Apr 5, 2011
    Messages:
    87
    Likes Received:
    6
    Trophy Points:
    0
    Here is one trick:

    File A: Suppression file consisting of either email addresses or MD5s
    File B: Your email list consisting of either email addresses of MD5s depending on A.

    Combine File A and File B into one File.

    Dedupe the combined file using linux commands. Any duplicates could be considered removable.

    sort "combined.txt" | uniq -d > "combined_duped.txt"
     
  4. nickphx

    nickphx VIP

    Joined:
    Apr 2, 2011
    Messages:
    1,139
    Likes Received:
    363
    Trophy Points:
    83
    Gender:
    Male
    Location:
    guadalajara, chiuhuahua
    a quick and dirty perl script could get that done pretty easily.
     
  5. debase

    debase VIP

    Joined:
    Mar 31, 2011
    Messages:
    74
    Likes Received:
    1
    Trophy Points:
    0
    That absolutely works - thank you!

     
  6. debase

    debase VIP

    Joined:
    Mar 31, 2011
    Messages:
    74
    Likes Received:
    1
    Trophy Points:
    0
    I ended up using the findstr command in windows, which seemed to work but took some time:

    FINDSTR /G:mail_list.TXT suppression.TXT > my_suppression.TXT
     
  7. holyearth

    holyearth VIP

    Joined:
    Apr 16, 2011
    Messages:
    85
    Likes Received:
    2
    Trophy Points:
    0
    That does NOT work and I've seen/heard of this being done....if you combine the suppression list with your mailing list then addresses on the suppression list NOT in your mailing list WON'T be removed when you dedupe causing you to MAIL records from your SUPPRESSION file !!
     
  8. brasom

    brasom VIP

    Joined:
    Apr 12, 2011
    Messages:
    34
    Likes Received:
    6
    Trophy Points:
    8
    These are md5'd emails, meaning he'd only be able to decode his own emails anyways.
     
  9. dynamik

    dynamik Member

    Joined:
    May 5, 2011
    Messages:
    46
    Likes Received:
    2
    Trophy Points:
    8
    FYI, md5 is case sensitive. Make sure your list is in all lowercase before generating your hashes to get more ideal results (assuming the person who generated the md5 hashes lowered the case of the existing list pre-md5)
     
  10. debase

    debase VIP

    Joined:
    Mar 31, 2011
    Messages:
    74
    Likes Received:
    1
    Trophy Points:
    0
    Yeah, I thought about that and made sure my list was all lowercase. As for their data, that's out of my hands.

    Actually, so how does this work: if I get an MD5 suppression file, and if the advertiser did not convert it to all lowercase, I upload it to robomail. And if my list is lowercase, their list was a mix of upper and lower, then potentially any advertiser addresses that were/contained uppercase characters may NOT be suppressed?

    Seems like a faulty system to me.
     
  11. mx10

    mx10 VIP

    Joined:
    Mar 30, 2011
    Messages:
    535
    Likes Received:
    18
    Trophy Points:
    18
    yeah, md5 suppression is definitely flawed and no doubt misses a lot of addresses.

    using the raw emails is much more reliable, what everyone has done since the beginning of time, and i don't see it causing a global crisis, i hate this trend towards md5.
     
  12. kidbrando

    kidbrando VIP

    Joined:
    Apr 5, 2011
    Messages:
    87
    Likes Received:
    6
    Trophy Points:
    0
    Yes, I can see how this would be an issue with md5. But this would be an issue using a db and md5 too. Stick with the raw emails....
     
  13. afc_pmc

    afc_pmc New Member

    Joined:
    May 30, 2011
    Messages:
    192
    Likes Received:
    17
    Trophy Points:
    0
    Just a note about md5 emails, I am not giving an opinion of whether I like it or not, and many of you already know this ...

    But for those that do not, companies md5 their suppression lists so that a spammer can't get a hold of the suppression list, and try to mail it. There are programs that try to match emails to a list, but it basically requires a program to keep trying email addresses, converting them to md5, and seeing if that matches anything in the list making it not worth the effort.

    MD5 is a one way encryption, meaning it can't be decrypted, this is most popular with passwords in a database. Someone tries to log in, the system doesn't know the actual password, but will take the password the user enters in the form, md5 it, and match that against the md5 hash in the database. So if a hacker stole all the passwords in the database, they wouldn't be able to log in.

    Same goes for md5 suppression lists. The email program (mta) takes every email address it is about to mail, makes an md5 hash of it, and then looks to see if that hash is found in the suppression list. If a match is found, it knows to suppress it. So the suppression list doesn't directly have to list the email addresses in plain text, but still is able to tell the mta which emails to not mail to.
     
  14. DKPMO

    DKPMO VIP

    Joined:
    Mar 31, 2011
    Messages:
    1,452
    Likes Received:
    68
    Trophy Points:
    48
    Location:
    Elaborate Underground Base
    Things like that is exactly why I prefer to do processing in the database. There are just too many things you can screw up with flat file operations and quality control of your results is usually not obvious.
     
  15. afc_pmc

    afc_pmc New Member

    Joined:
    May 30, 2011
    Messages:
    192
    Likes Received:
    17
    Trophy Points:
    0
    I agree with DK above, database queries will be far more reliable, forgiving and easier to work with.

    If you still choose to use flat files though, take from my earlier post above and use this logic in whatever language you are writing in (I write in php, I don't code in linux to give you a code example), but the logic in plain english is:

    Take your mailing list file (or pull down all records you are mailing from the database if you do it that way) and load it up in to an array. If your file is big or your dont have enough ram to do this, then only load 1 email address in to memory at a time.

    Loop thru each record you are mailing, before you push the email to the mta, md5 it and look for a match in the suppression file (you can grab the entire suppression file in an array if its isnt huge, 3g is way too big to do that however), so instead loop thru each line in the suppression file (only loading 1 line in to memory at a time), and check if it matches. If you find a match, exit the suppression file loop and skip that email address since it should be suppressed. If the supp loop ends with no match, the email can be sent.

    Also, what holyhearth said is completely true, brasom's reply is also true that since those extra emails are in md5 they won't be mailable, but think about the load those messed up addresses would put on your mta. Hopefully the mta will try to validate them as real addresses before mailing them (not safe to assume tho) and if they are invalidated by the mta, still puts a bigger load on the system than needs to be.

    Only send the deliverable emails to your mta, be kind to your server and your server will be kind back. And even tho its a
    md5 list, if you had a script doing that without you noticing, and then try a plain text supp file on another offer, bye bye ips.
     

Share This Page