I've got a huge md5 suppression file (3gb) and I want to compare it to my mail list and only pull out entries that are the same in each file, so I don't have to try inserting the entire md5 suppression file into my database. The suppression file is just each md5 on a line. The mail file is email,md5. I'm pretty sure this is a simple thing to do with grep, but not sure where to start. Any input is much appreciated.... Thanks!
I'd do it in the database and 3Gb file is not something I would consider "huge". 300 Gb sure. 30 Gb probably. Of course even with 3Gb you cannot be casual with your database design - any wrong move and you are screwed.
Here is one trick: File A: Suppression file consisting of either email addresses or MD5s File B: Your email list consisting of either email addresses of MD5s depending on A. Combine File A and File B into one File. Dedupe the combined file using linux commands. Any duplicates could be considered removable. sort "combined.txt" | uniq -d > "combined_duped.txt"
I ended up using the findstr command in windows, which seemed to work but took some time: FINDSTR /G:mail_list.TXT suppression.TXT > my_suppression.TXT
That does NOT work and I've seen/heard of this being done....if you combine the suppression list with your mailing list then addresses on the suppression list NOT in your mailing list WON'T be removed when you dedupe causing you to MAIL records from your SUPPRESSION file !!
FYI, md5 is case sensitive. Make sure your list is in all lowercase before generating your hashes to get more ideal results (assuming the person who generated the md5 hashes lowered the case of the existing list pre-md5)
Yeah, I thought about that and made sure my list was all lowercase. As for their data, that's out of my hands. Actually, so how does this work: if I get an MD5 suppression file, and if the advertiser did not convert it to all lowercase, I upload it to robomail. And if my list is lowercase, their list was a mix of upper and lower, then potentially any advertiser addresses that were/contained uppercase characters may NOT be suppressed? Seems like a faulty system to me.
yeah, md5 suppression is definitely flawed and no doubt misses a lot of addresses. using the raw emails is much more reliable, what everyone has done since the beginning of time, and i don't see it causing a global crisis, i hate this trend towards md5.
Yes, I can see how this would be an issue with md5. But this would be an issue using a db and md5 too. Stick with the raw emails....
Just a note about md5 emails, I am not giving an opinion of whether I like it or not, and many of you already know this ... But for those that do not, companies md5 their suppression lists so that a spammer can't get a hold of the suppression list, and try to mail it. There are programs that try to match emails to a list, but it basically requires a program to keep trying email addresses, converting them to md5, and seeing if that matches anything in the list making it not worth the effort. MD5 is a one way encryption, meaning it can't be decrypted, this is most popular with passwords in a database. Someone tries to log in, the system doesn't know the actual password, but will take the password the user enters in the form, md5 it, and match that against the md5 hash in the database. So if a hacker stole all the passwords in the database, they wouldn't be able to log in. Same goes for md5 suppression lists. The email program (mta) takes every email address it is about to mail, makes an md5 hash of it, and then looks to see if that hash is found in the suppression list. If a match is found, it knows to suppress it. So the suppression list doesn't directly have to list the email addresses in plain text, but still is able to tell the mta which emails to not mail to.
Things like that is exactly why I prefer to do processing in the database. There are just too many things you can screw up with flat file operations and quality control of your results is usually not obvious.
I agree with DK above, database queries will be far more reliable, forgiving and easier to work with. If you still choose to use flat files though, take from my earlier post above and use this logic in whatever language you are writing in (I write in php, I don't code in linux to give you a code example), but the logic in plain english is: Take your mailing list file (or pull down all records you are mailing from the database if you do it that way) and load it up in to an array. If your file is big or your dont have enough ram to do this, then only load 1 email address in to memory at a time. Loop thru each record you are mailing, before you push the email to the mta, md5 it and look for a match in the suppression file (you can grab the entire suppression file in an array if its isnt huge, 3g is way too big to do that however), so instead loop thru each line in the suppression file (only loading 1 line in to memory at a time), and check if it matches. If you find a match, exit the suppression file loop and skip that email address since it should be suppressed. If the supp loop ends with no match, the email can be sent. Also, what holyhearth said is completely true, brasom's reply is also true that since those extra emails are in md5 they won't be mailable, but think about the load those messed up addresses would put on your mta. Hopefully the mta will try to validate them as real addresses before mailing them (not safe to assume tho) and if they are invalidated by the mta, still puts a bigger load on the system than needs to be. Only send the deliverable emails to your mta, be kind to your server and your server will be kind back. And even tho its a md5 list, if you had a script doing that without you noticing, and then try a plain text supp file on another offer, bye bye ips.