The biologists were using a website to look at data for a particular gene, and comparing patterns of other genes to find interactions and inhibitions. They brought me in to try and automate this manual process.
This fits a pretty common pattern:
- Biologists find useful source of data.
- Biologists look at known gene and interactions, and describe a pattern.
- Bioinformaticians take that pattern, apply it genome wide (or across a big data set), and return a list of new candidates which seem to fit that pattern.
What do I need to do?
The graphs were generated in javascript, which means the data to generate them must have been avaialbe to the browser at some stage (but where?). My first guess was the page would make a JSON call to obtain the data, so I used chrome inspector (network tab) to view the requests. I didn't find anything, so I started looking at javascript and HTML files for Javascript arrays of data.I found the data in the main HTML page:
profile41949 = [
[0,1.31243,null],[1,1.26855,null],[2,2.08449,null],
[3,-1.09122,null],[4,0.408744,null],[5,0.159893,null],
[6,0.178368,null],[7,1.60982,null],[8,2.04564,null],
[9,6.64344,null],[10,8.78397,null],[11,5.50173,null],
[12,6.464,null],[13,7.20338,null],[14,5.66458,null],
[15,-1.02927,null],[16,-3.41265,null],[17,-4.97169,null],
[18,1.19087,null],
[19,-4.17216,null],[20,0.910848,null],[21,3.07987,null],
[22,-0.149686,null],[23,8.81042,null],[24,1.51925,null],
[25,-1.61159,null],[26,-1.73165,null],[27,1.70029,null],
[28,3.45068,null],[29,4.87374,null],[30,3.04062,null],
[31,5.19333,null],[32,5.38115,null],[33,2.54202,null],
[34,-6.46899,null],[35,-6.74378,null],[36,-5.09973,null],
[37,2.53315,null],
[38,-1.19128,null],[39,3.62427,null]
];
I tried entering multiple genes, but they were returned named and ordered according to some internal values, so figured it would be much easier to make 1 call per gene, and extract the only entry.Get the data
There are plenty of ways to do this, but one of the simplest is to use "curl" or "wget". I put my gene list in a text file and ran a bash script:
BASE_URL="http://endosomics.mpi-cbg.de/gws/gene/details/?aliases="
while read p; do
wget ${BASE_URL}${p} --output-document=${p}.html
sleep 1;
done <${GENE_LIST}
I used a sleep to not DDOS the site, went to lunch and came back to find the html files on my disk.
While you could write a single script to download the files, extract the data and run the analysis, I think it's much better to break each of those steps into separate tasks, complete each and move on.
Firstly it's simpler that way, and secondly once you have completed each step, there is no need to re-run the entire pipeline and re-download things etc as you iterate towards a solution at the tail end of a pipeline.
Extraction
I like working with CSV files, so I need to loop through each html file, extract the Javascript data and save it under the gene name:
for html_filename in args.html:
json_data = extract_json_graph_data(html_filename)
data = json.loads(json_data)
cols = [y for (_, y, _) in data]
gene_data[gene_name] = cols
df = pd.DataFrame.from_dict(gene_data)
df.to_csv("endosomics_data.csv")
I need to read through each HTML file, find the graph data, and return it as a string:
def extract_json_graph_data(html_filename):
start_pattern = re.compile("\s+profile\d+ = \[")
end_pattern = re.compile(".*\];")
in_data = False
data = []
with open(html_filename) as f:
for line in f:
if not in_data:
m = start_pattern.match(line)
if m:
data.append("[") # Forget variable name
in_data = True
else:
m = end_pattern.match(line)
if m:
data.append("]")
break
data.append(line)
return ''.join(data)
I return it as a string, so I can load it into Python code with:
data = json.loads(json_data)
Analysis
From this point on all I needed to do was read the CSV, and produce a ranked list.To experiment with different ranking methods, I drew the top 3 (co-activators) and bottom 3 (inhibitors) as green and red line graphs, so I could verify them by eye. I sent my initial results immediately, then using feedback and known positives, adjusted methods to find somethign that works.
I ended up using pearson and euclidean distance for genes vs gene of interest:
for gene_name, data in df.iteritems():
(r, _) = pearsonr(gene, data)
pearson_df.loc[gene_name, column] = r
diff = gene - data
euclidean_df.loc[gene_name, column] = sum(diff * diff)
Victory conditions
Nothing here is particularly hard, and the lab-head could have pulled a PhD student aside and have them spend a day or two to do this manually. Bringing in a bioinformatician would ONLY be worth it if I was much faster and better than doing it by hand.This gives straight forward winning conditions, of basically being fast.
The process isn't pretty: Bash scripts, pulling apart HTML with regular expressions, loading a string into Python data structures, eyeballing graphs to tune algorithms... but it was fast, so it was a success: A little bit of bioinformatician time saves a lot of biologist time, and gets them a bit closer to their discoveries. And isn't that what it's all about?
The next time I needed to rip a webpage, I used the python library Beautiful Soup, which allowed me to pull the document rapidly apart using JQuery-like selects. It was really easy - sudo pip install beautifulsoup4
ReplyDelete