Just thought I'd share a few useful commands for dealing with data from the linux command prompt for you linux noobs. With a few commands and some basic scripting, you can do a lot of stuff with your lists right from the command prompt. Combine and Dedupe two files: #cat file1.txt file2.txt | sort | uniq > deduppedfile.txt Seperate GI from the Majors in a file: Create a file with a list of the major domains you want to seperate out, leaving in the @ symbol. For example: @yahoo.com @comcast.net @gmail.com @aol.com @verizon.net We'll call that file domains.txt. We have a list called list.txt and want to seperate that into two files, list-majors.txt and list-GI.txt. Run this command: #grep -f domains.txt list.txt > list-majors.txt && cat list.txt list-majors.txt | sort | uniq -u > list-GI.txt You'll now have two seperate files, one with GI data, and one with the majors (or whatever other domains you may want to pull out). If you have a really large list of domains, this may not work or take way too long. In that case, I've found it useful to use another method using a loop to read through the list. Another useful command for various things is the cut command. It's very useful for dealing with data files or parsing logfiles. For example, if you want to extract the email only from a full record list, do: #cut -d "," -f1 file.txt This is assuming that a comma is your deliminator and the email is the first field. OR If you want to get a count of records by domain from that file, do: #cut -d "," -f1 file.txt | cut -d "@" -f2 | sort | uniq -c
this will remove crap from emails ,make separate files for each domain and remove dupes from a list. Code: #!/usr/bin/perl my %seen; open(FH,"$ARGV[0]"); while(<FH>) { s/\n//g; s/\r//g; $email = $_; lc($email); $email=~s/www\.//g; $email =~s/\\//g; $email =~s/\+//g; $email =~s/\`//g; $email =~s/\~//g; $email =~s/\///g; $email =~s/\'//g; $email =~s/\,//g; $email =~s/\://g; $email =~s/\;//g; $email =~s/\{//g; $email =~s/\}//g; $email =~s/\[//g; $email =~s/\]//g; $email =~s/\&//g; $email =~s/\!//g; $email =~s/\#//g; $email =~s/\$//g; $email =~s/\%//g; $email =~s/\^//g; $email =~s/\*//g; $email =~s/\(//g; $email =~s/\)//g; $email =~s/\|//g; if ($email=~/^[a-zA-Z]/) { my ($user,$dom) = split(/\@/,$email); $domain = lc($dom); open(OUT,">>./split_by_domain/$domain"); print OUT "$email\n" unless($seen{$email}++); close(OUT); } }