Data Management using the linux command prompt

Discussion in 'Noob Central' started by Jers81, Aug 29, 2011.

  1. Jers81

    Jers81 VIP

    Joined:
    Apr 2, 2011
    Messages:
    98
    Likes Received:
    13
    Trophy Points:
    18
    Just thought I'd share a few useful commands for dealing with data from the linux command prompt for you linux noobs. With a few commands and some basic scripting, you can do a lot of stuff with your lists right from the command prompt.

    Combine and Dedupe two files:
    #cat file1.txt file2.txt | sort | uniq > deduppedfile.txt

    Seperate GI from the Majors in a file:
    Create a file with a list of the major domains you want to seperate out, leaving in the @ symbol.
    For example:
    @yahoo.com
    @comcast.net
    @gmail.com
    @aol.com
    @verizon.net

    We'll call that file domains.txt. We have a list called list.txt and want to seperate that into two files, list-majors.txt and list-GI.txt.
    Run this command:
    #grep -f domains.txt list.txt > list-majors.txt && cat list.txt list-majors.txt | sort | uniq -u > list-GI.txt

    You'll now have two seperate files, one with GI data, and one with the majors (or whatever other domains you may want to pull out). If you have a really large list of domains, this may not work or take way too long. In that case, I've found it useful to use another method using a loop to read through the list.

    Another useful command for various things is the cut command. It's very useful for dealing with data files or parsing logfiles.

    For example, if you want to extract the email only from a full record list, do:
    #cut -d "," -f1 file.txt
    This is assuming that a comma is your deliminator and the email is the first field.

    OR

    If you want to get a count of records by domain from that file, do:
    #cut -d "," -f1 file.txt | cut -d "@" -f2 | sort | uniq -c
     
    Last edited: Aug 29, 2011
  2. chillbrah

    chillbrah VIP

    Joined:
    Jul 29, 2011
    Messages:
    48
    Likes Received:
    2
    Trophy Points:
    8
    I do the first thing a lot, but I've never done the others. I rely on apps for that.
     
  3. nickphx

    nickphx VIP

    Joined:
    Apr 2, 2011
    Messages:
    1,139
    Likes Received:
    363
    Trophy Points:
    83
    Gender:
    Male
    Location:
    guadalajara, chiuhuahua
    this will remove crap from emails ,make separate files for each domain and remove dupes from a list.

    Code:
    #!/usr/bin/perl
    my %seen;
    open(FH,"$ARGV[0]");
    while(<FH>) {
    s/\n//g;
    s/\r//g;
    $email = $_;
    lc($email);
    $email=~s/www\.//g;
    $email =~s/\\//g;
    $email =~s/\+//g;
    $email =~s/\`//g;
    $email =~s/\~//g;
    $email =~s/\///g;
    $email =~s/\'//g;
    $email =~s/\,//g;
    $email =~s/\://g;
    $email =~s/\;//g;
    $email =~s/\{//g;
    $email =~s/\}//g;
    $email =~s/\[//g;
    $email =~s/\]//g;
    $email =~s/\&//g;
    $email =~s/\!//g;
    $email =~s/\#//g;
    $email =~s/\$//g;
    $email =~s/\%//g;
    $email =~s/\^//g;
    $email =~s/\*//g;
    $email =~s/\(//g;
    $email =~s/\)//g;
    $email =~s/\|//g;
    
    if ($email=~/^[a-zA-Z]/) {
            my ($user,$dom) = split(/\@/,$email);
            $domain = lc($dom);
            open(OUT,">>./split_by_domain/$domain");
            print OUT "$email\n" unless($seen{$email}++);
            close(OUT);
            }
    }
    
     
  4. bak3r

    bak3r New Member

    Joined:
    Apr 28, 2011
    Messages:
    8
    Likes Received:
    0
    Trophy Points:
    0
    terminal/cli is a mailers best friend
     

Share This Page