Advantages of Online Data Science Courses

Data Science has been hailed as the transformative trend that is set to re-wire the industries and re-invent the ways people do things. Products and applications are being developed in agriculture…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Downloading PubChem Bioassays made easy

A simple script to download bioactivity data for small molecules

Before jumping into the code, it is important to understand the relation between PubChem Data Types. Bioassays, labelled with an Assay ID (AID), contain a general description of the experiment (Description), the description of the results columns (Assay Result Field Type ID or TID), and the bioactivity results itself. Each data point is linked to a unique Substance, identified by the Substance ID (SID). SIDs are subsequently standardized to unique chemical structures, the Compounds, identified by the Compound ID (CID). Multiple SIDS can have the same CID, therefore, but multiple CIDs cannot be assigned to the same SID. Importantly, we need the CID to obtain the SMILES string of each molecule (Wang et al, 2009, Kim et al, 2016).

In practice, what is happening here is that we download a list of all SIDs tested in the Bioassay using the PUG-REST API. One one hand, we obtain the corresponding CID and Canonical Smiles for each SID, and, on the other hand, we retrieve all results associated to each SID. This process has to be done in batches to avoid hitting the retrieval limit of the request.

Caption of the code snippet necessary to download a bioassay in JSONformat

This will download a JSON file in the specified folder, under the name: “PUBCHEM1851.json”. This file is organized in three main objects:

Not all SIDs might have been tested against all experimental conditions, therefore some TID fields might be empty.

JSON files are ideal for subsequent processing, but not very easy to read by humans. To convert the downloaded JSON file into a CSV table, follow the instructions:

In practice, what is happening here is that we convert the dict-like format of the Data object in the JSON file to a CSV file where each row corresponds to one SID/CID/SMILES, an each TID corresponds to a column with the bioactivity results.

Caption of the code snippet necessary to convert a bioassay from .json to csv

The CSV file we have created has the following columns:

So, finally, with two simple commands you can download any desired bioassay from PubChem with a easy-to-read CSV output!

I hope you found this useful, leave any questions in the comments or directly on GitHub issues if you encounter any problems. We will keep posting small hacks that make our lives easier when working with biomedical data, stay tuned to our channel for more!

Add a comment

Related posts:

Titanic Dataset with Logistic Regression

RMS Titanic was an Olympic-class transatlantic cruise ship owned by White Star Line. Manufactured at the Harland and Wolff (Belfast, Ireland) shipyards. On the night of April 15, 1912, it hit an…

When Ableism and Racism Collide

Before I could submit this article for publication, it appears Black Panther star Letitia Wright has deleted her Twitter and Instagram accounts following online backlash for a video she shared. I suppose that’s a frustratingly apt metaphor for what I’m about to discuss.

Reflections on Creating WellSpace at TCG Miami 2019

I love that this year’s TCG Conference, a reflection of where some of the theatre industry nationwide is at, interestingly aligns with where I am in my artistic and professional practice. After…