Matching elements in a file to a nested dictionary python


March 2019


84 time


Before I get into my question, I would like to provide you guys with what I have thus far.

First, I generated a nested dictionary from a file that I would like to use for comparison. An example of what my dictionary looks like is pictured below (with the only difference being the size):

Negdic = {'ADA': {'NM_000022': ['43248162', '43248939',
                                '43249658', '43251228',
                                '43251469', '43251647',
                                '43252842', '43254209',
                                '43255096', '43257687',
                                '43264867', '43280215', '']},
          'ALDOB': {'NM_000035': ['104182841', '104187124',
                                  '104187734', '104188836',
                                  '104189763', '104190750',
                                  '104192036', '104193057',
                                  '104197990', '']}}

Now this is where I'm struggling due to me being unfamiliar with Python and new to programing. I would like to use a second file to search through my dictionary for matches. My file looks as so:

chrom   exon_start  exon_end    strand  isoform exon_numer  gene    coding_length   total_mutations_reported    total_exonic_mutations  exonic_splicing_mutations   total_splice_site_mutations 3_ss_mutations  5_ss_mutations
chr20   43255096    43255240    -   NM_000022   4   ADA 144 12  9   0   3   3   0
chr9    104187734   104187909   -   NM_000035   7   ALDOB   175 7   4   0   3   2   1

What I want to do is search through my dictionary for the gene name, then match the isoform name, and then lastly search through the corresponding isoform list for the exon_start and print the position in the list where the exon_start was found.

Here is some example code that I've been trying to work with, but I'm not sure if I'm on the right track.

for line in open("NegativeHotspot.txt"):
    columns = line.split('\t')
    if len(columns) >= 2:
        Hotspotgenes = columns[6]
        Hotspotgenes2 = Hotspotgenes.split()
        print Hotspotgenes2

#print Hotspotgenes2
#x = type(Hotspotgenes)
#print x
#for k in Hotspotgenes:
#    if k in Negdic:
#        print k, Negdic[k] 

The first part is something I've been trying to mess with to create a list of the genes in the file to search the dictionary for my results, but I'm struggling to even create a list from my output of columns[6]. Plus, I'm not even sure if I'm tackling my code in the best possible way. The last part of that coding section was something I was just messing with in an attempt to find a match in my dictionary.

Help would be greatly appreciated. I'm so lost :(

2 answers


You have a tab-separated value file, so you should use the module dedicated to delimited file formats, csv.

import csv

You also have headers with meaningful names. It'd be way easier to understand doing row[header_name] than row[col_number], so let's use csv.DictReader

with open("NegativeHotspot.txt") as f:
    reader = csv.DictReader(f, delimiter="\t")

Now we can iterate through each row of reader and pull out the info you need using the list.index method

    for row in reader:
        gene, isoform = row['gene'], row['isoform']
        count = Negdic[gene][isoform].index(row['exon_start'])

You never say what your end-result is with the count variable, but count is now the index where exon_start occurs in your Negdic[gene][isoform] dictionary.


I will try to put you in the right track and point out a couple of things that could be useful for you in the future. When opening a file you are better using the "with" argument as this will close the file for you when you're done. So do something like:

with open('eggs.csv', 'rb') as csvfile:
...     spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
...     for row in spamreader:
...         print ', '.join(row)

Then what that guy is doing is creating a Python generator. Without going into details you need to bear in mind that when using an generator, you can iterate through your object only once. So it isn't like iterating through a list or a dictionary. So if you want to search for something else, you might need to run all your file again. To solve this, you could save your data into a more useful object like a list of lists where each row would be a list and then all your file would be a list of those lists.

Then you could create a header and parse your lists into a dictionary that you can index. So if I have a csv of the type:

fruits, vegetables, cars
banana, cucumber, audi

An option would be to have a list of dictionaries so each row would look like: {'fruits': 'banana', 'vegetables': 'cucumber', ...}. So this is better to index but perhaps not as compact as the list of lists. At the end I would recommend you to bear in mind how each object performs in Big O times because it will make a difference if your data set is large.

The problem with dictionaries is that they are great to search through their keys, but if you want to search banana in the example I showed you, it won't be efficient. You would have to iterate through the whole data looking for the dict with banana on it as a value.