Friday, 24 October 2014

Web crawler tutorial

I had an interesting side-mission today, and thought it would be useful as both a web-ripping tutorial, and as a glimpse into day-to-day bioinformatics.

The biologists were using a website to look at data for a particular gene, and comparing patterns of other genes to find interactions and inhibitions. They brought me in to try and automate this manual process.

This fits a pretty common pattern:
  • Biologists find useful source of data.
  • Biologists look at known gene and interactions, and describe a pattern.
  • Bioinformaticians take that pattern, apply it genome wide (or across a big data set), and return a list of new candidates which seem to fit that pattern.
The first step is always to get the data, and in a form which supports further analysis. In this case there was no way to download the entire data set, so we needed to rip the webpages.

What do I need to do?

The graphs were generated in javascript, which means the data to generate them must have been avaialbe to the browser at some stage (but where?). My first guess was the page would make a JSON call to obtain the data, so I used chrome inspector (network tab) to view the requests. I didn't find anything, so I started looking at javascript and HTML files for Javascript arrays of data.

I found the data in the main HTML page:

 profile41949 = [
I tried entering multiple genes, but they were returned named and ordered according to some internal values, so figured it would be much easier to make 1 call per gene, and extract the only entry.

Get the data

There are plenty of ways to do this, but one of the simplest is to use "curl" or "wget". I put my gene list in a text file and ran a bash script:


while read p; do
 wget ${BASE_URL}${p} --output-document=${p}.html
  sleep 1;
done <${GENE_LIST}

I used a sleep to not DDOS the site, went to lunch and came back to find the html files on my disk.

While you could write a single script to download the files, extract the data and run the analysis, I think it's much better to break each of those steps into separate tasks, complete each and move on.

Firstly it's simpler that way, and secondly once you have completed each step, there is no need to re-run the entire pipeline and re-download things etc as you iterate towards a solution at the tail end of a pipeline.


I like working with CSV files, so I need to loop through each html file, extract the Javascript data and save it under the gene name:

for html_filename in args.html:
    json_data = extract_json_graph_data(html_filename)
    data = json.loads(json_data)
    cols = [y for (_, y, _) in data]
    gene_data[gene_name] = cols

df = pd.DataFrame.from_dict(gene_data)

I need to read through each HTML file, find the graph data, and return it as a string:

def extract_json_graph_data(html_filename):
    start_pattern = re.compile("\s+profile\d+ = \[")
    end_pattern = re.compile(".*\];")
    in_data = False

    data = []
    with open(html_filename) as f:
        for line in f:
            if not in_data:
                m = start_pattern.match(line)
                if m:
                    data.append("[") # Forget variable name
                    in_data = True
                m = end_pattern.match(line)
                if m:
    return ''.join(data)

I return it as a string, so I can load it into Python code with:

    data = json.loads(json_data)


From this point on all I needed to do was read the CSV, and produce a ranked list.

To experiment with different ranking methods, I drew the top 3 (co-activators) and bottom 3 (inhibitors) as green and red line graphs, so I could verify them by eye. I sent my initial results immediately, then using feedback and known positives, adjusted methods to find somethign that works.

I ended up using pearson and euclidean distance for genes vs gene of interest:

    for gene_name, data in df.iteritems():
        (r, _) = pearsonr(gene, data)
        pearson_df.loc[gene_name, column] = r 

        diff = gene - data
        euclidean_df.loc[gene_name, column] = sum(diff * diff)

Then write them out to two separate CSV files. I also wrote the sorted rank orders to a file, and their sum, and sorted by that to get the combination of the two. The methods put known genes in the right place, and produced some promising candidates for further investigation.

Victory conditions

Nothing here is particularly hard, and the lab-head could have pulled a PhD student aside and have them spend a day or two to do this manually. Bringing in a bioinformatician would ONLY be worth it if I was much faster and better than doing it by hand.

This gives straight forward winning conditions, of basically being fast.

The process isn't pretty: Bash scripts, pulling apart HTML with regular expressions, loading a string into Python data structures, eyeballing graphs to tune algorithms... but it was fast, so it was a success: A little bit of bioinformatician time saves a lot of biologist time, and gets them a bit closer to their discoveries. And isn't that what it's all about?

1 comment:

  1. The next time I needed to rip a webpage, I used the python library Beautiful Soup, which allowed me to pull the document rapidly apart using JQuery-like selects. It was really easy - sudo pip install beautifulsoup4