Thursday, May 25, 2017

Elegant overview of the proteome universe!


As you can tell from a blog post or two, I'm not a fantastic writer. I'd like to think I'm okay when I'm not in a hurry, but these authors can write!


If the title of this awesome Open paper didn't draw you in, I'm gonna guess you didn't grow up wanting to be this guy...


Once you get past a title that great -- this isn't fluff. This is one of the single best reviews of why biological complexity and the proteome are so intrinsically intertwined that I have ever read. It is easy to forget the expectations we once had for what doors the Human Genome Project would open into our understanding. What it opened was a door that showed how little we currently understand.

Look, even if I wasn't being lazy with the blog today, I can't read this paper over my coffee and do any aspect of it justice. If you want to read a great perspective paper on how far we've come -- and how amazingly far we still have to go, I can't recommend anything more.

Wednesday, May 24, 2017

Is de novo sequencing already a viable alternative to database searches?


If you look you might see some ravings on this blog regarding the DeNovoGUI, another incredible free resource out of the CompOmics group.

If you're interested, the original paper is here.

Since that paper came out the DeNovoGUI has expanded and incorporated more algorithms. When I booted my copy today it told me there were new updates as well! (Downloading now)

We all know de novo searching algorithms are out there. I know more and more labs that are using PEAKS as their primary software -- meaning PEAKS has come a long way! I think the consensus maybe 5-6 years ago was -- yeah, it was a nice tool, but it was your fallback plan if you didn't find what you were looking for with database tools.

As a sign -- and thorough measurement of this possible shift, check out this new paper from Thilo Muth and Bernhard Renard (the latter is a fun name to say! try it 3 times fast!)


The question they set out to answer -- are the de novo algorithms, right now, a good alternative to the database tools?

To test it they get 4 publicly deposited datasets. All were generated on Orbitraps. 3 are high/low (Orbitrap for MS1 and ion trap for MS/MS) and one is high/high. Yeast, human, mouse, and some weird thing -- oh, it's an extremophile! cool! Pyrococcus furiosis. My last Latin class was {dedacted} years ago, but I'm pretty sure it's name means something like -- "we found this tube shaped thing growing in an active volcano" I may need to check these RAW data out later!

For comparison they use PEAKS, Novor and PepNovo -- 2 of which can just be ran in the DeNovoGUI (but they may have ran them some other way, I didn't check).

To establish their working base, all the data was searched with MS-GF+ and X!Tandem. I'm a little fuzzy on the details (honestly, I skimmed a little...big day ahead!), but I think they took the peptide spectral matches that both engines agreed upon.

There is a TON to be learned from this paper -- including some really interesting info on what peptide sequences modern de novo engines have the most trouble with, which ones scale the best (more processors meaning much more performance), etc., etc.,

But check this out. Oxford Bioinformatics -- I love your Journal and only recently discovered what a treasure trove it is (thanks @PastelBio!). If it is a problem I used one of the images, please email me (orsburn@vt.edu) to receive my apology and instant removal, but this is an awesome chart!


Again, I'm not 100% on the metrics here -- but this looks pretty darned good, right? I found PepNovo really surprising. I've used it a lot over the years and it was the main reason I started using the DeNovoGUI (cause I did just have an old PC that only ran this program!), but I use PepNovo+ and I don't think that these authors did....

Ignoring this, Peaks and Novor did REALLY well!  Even SequestHT and Mascot disagree on correct matches by 10%-15% or so (crude numbers from long ago when I still had access to both -- don't hold me to them). 60-70% sounds pretty darned good -- given NO database!

The best data -- I won't show here -- is the crazy Volcano creature -- in an example where we don't have a good database to use the classical engines with (I am imagining them trying to kill this creature to get it's DNA out. After years of failure by every international team, President Michelle reveals the truth she's known all along -- that a partially completed deathray had been abandoned in a secret facility in Siberia at the end of the Cold War. An international treaty is established and 2 scientists from each country are selected to work on the team and to complete work on the project. By diverting all the electrical consumption of Europe for 2 weeks (almost 4 minutes worth for NYC or Vegas) they can accumulate enough power to fire the completed deathray one time. In doing so this will also destroy the device and all the resources required to every build another, but they know it is the only chance they will ever have -- and fire the ray, finally break the Pyrococcus cell wall.  Their hopes are shattered, however, when they find the process shreds the DNA completely, leaving only the proteins(??) intact...leaving us right where we started...) and our only choice is to use de novo tools, the comparison between engines is really interesting - and maybe the most pertinent

In summary -- maybe the current generation de novo algorithms aren't 100% ready to replace our current database-driven tools, but WOW have they ever gotten good!

Tuesday, May 23, 2017

Two new studies reveal some of the inner working of toxoplasmosis!


Toxoplasmosis is an absolutely fascinating and terrifying disease. It has been a big topic in the popular science realm -- due to how weird it is.  Here is an NPR article on the potential link between this disease and mental illness. Most interesting to the mainstream media has been odd (and controversial?) links between subtle human behaviors of people infected with this weird thing. Like this from Scientific American, which isn't the weirdest one I've heard about.

Toxoplasma gondii (Tg) is kind of a mystery, though. It has a 69Mb genome and a ridiculously complex life cycle....

Two new studies took completely different approaches to try and figure out how this weird little thing can do all the stuff that it can.

I took the picture at the top of this post from the newest one in that is in this month's Elsevier JOP that you can find here.


This is an interesting paper for a couple of reasons. The first being that -- Tg can infect just about any cell with a nucleus. So...if you've got a bunch of gerbils around, might as well see what it does to that one, right? Maybe seeing the effects that infection has on different organisms will help reveal some new information?

Very minor note -- the authors state a 5600 TripleTOF was used for this work and mention features in terms that the 5600 has. The resolution settings and the fact the output was .RAW indicates this was, however, performed on a Q Exactive. Just a little mixup in the methods section that I could have figured out a little quicker if the data had been publicly posted.

They find a fascinating number of differential proteins. Apparently Tg goes crazy in a gerbil brain with big changes in proteins linked to oxidative stress response and others! They check these with RT-PCR and Western and it looks pretty convincing -- and all sorts of terrifying!

The second study -- nuts, I'm running out of time this morning -- is this one in MCP.


I don't really have time to do this one justice at all. In a nutshell, they track a methylation on Arginine that contributes to how this thing regulates itself! They make some mutations to verify that this is involved. Interestingly, there appears to be a lot of crosstalk between this methylation event and phosphorylation. The methylations were found in a previous study with phospho-enrichment.

The downstream analysis is really convincing that this is a major component of parasite control. You don't have to do too much classic genetics + high resolution proteomics to convince me your on the right track!  If you are interested in this (or related) parasites, this one is a gold mine!  What else controls itself with this mechanism!?!?

 These were both really interesting reads inspired by my fear of anything making alterations to my poor glitchy brain and by the tiny natural reservoir of the parasite I found lost and dehydrated in the woods this weekend....


...after thorough review Isaiah Tomcat appears to have been accepted into the pack...with the final requirement being that he had to promise that he wouldn't infect anyone with brain parasites!

Monday, May 22, 2017

New York GC Hackathon for proteomics!! Applications due today by 5pm!


Are you a bioinformatician -or training to become one? NCBI and New York Genome Center are having an awesome hackathon June 19-June 21.

I just heard about this this morning -- and applications are due by 5pm TODAY!

You can check out the NCBI posting here! 

Interact with great teams finally get that great idea you have our of your head and making an impact!

neXtprot -- Fast new peptide uniqueness checker!


Hey! I just found this cool peptide that is upregulated 11-fold in all these patients with this condition. It has the full y ion spread and looks great!

Want to instantly check out whether it is unique to your organism (or something that shouldn't be there)?  Mathieu Schaeffer et al., just provided you the easiest and fastest way I've seen yet!  You can read about it in this open paper here (it's 2 pages!)

Or you can go to neXtprot and type your peptide sequence into the box!

It has to be at least 6 amino acids long -- and don't put spaces it will think that the next space is the next peptide (which means it can take a bunch of peptides at once -- up to 1,000!)

If you do it right, you're output looks something like this:


My sequence is completely unique to just one protein from one species.

You can download the output as a .csv file. Probably not that useful if you've only entered 1 peptide sequence, but if you are dumping in your de novo results....this could be invaluable!

Sunday, May 21, 2017

ProteoSign -- Powerful, easy statistics for Proteome Discover and MaxQuant!


Have you looked at your output report from Proteome Discoverer -- or even MaxQuant and said something like "Wow, it would be awesome if I could easily get some super advanced statistics on all this quan without having to work very hard?"

If so -- I've got GREAT news for you -- and it's called ProteoSign! You can check it out in this nice open paper here.

These authors want to supply you with great differential statistics in a fast, simple and free web interface. They set up a nice little server online somewhere that you can access directly here.

At this point, ProteoSign appears set up for supporting PD 1.4 and a couple versions of MaxQuant only -- but -- since it is taking the text file output from PD -- I think that it would be able to take data from the new versions as well -- heck, I think if you matched the formatting ProteoSign requires you could put in data from any proteomic software with quantification (but I haven't tried yet).

If you are someone like me who loves to just start dumping data into a program before you read any instructions at all -- you'll be very impressed with the speed of the server interface -- you can make a lot of very large and embarrassingly uninformed mistakes about how the whole thing works. If you are persistent with this strategy (there is some value to pressure testing an online resource, right? Please don't take this as my encouragement to not read the instructions. I was just really excited to try it out!) you can accidentally hover over very nice instructions that will tell you what you should be doing that will make you feel dumb and teach you how to use the software at the same time!



Some statistical tools like ANOVA and PCA and volcano plots are coming to PD 2.2, but if you are using PD 1.4 and want to keep using it -- here are those tools. There are features in ProteoSign that don't have analogues in the upcoming PD version, such as the (really cool) replicate scatterplot as well.

I really appreciate the work the authors put in here. They looked at two pieces of software with thousands of users around the world and thought -- "hey, let's add a bunch of tools to them and make it really easy for users to get and use those tools!" Stuff like this explains: 1) Why this is the coolest field in all of science 2) Why this field continues to advance at the amazing pace that it is.

Disclaimer: Yes, I know there are several downstream statistical tools for MaxQuant. The emphasis of this post was -- great statistics without having to really learn anything!

Saturday, May 20, 2017

How to view or open Proteome Discoverer results or files

This question pops up a lot. This post is really here to help people who might ask Google the question in the title: How do you open or view Proteome Discoverer results or files?  (Does it help the crawler if the words are in the text as well?)

The common scenario:

Your core lab or collaborator ran some samples for you on that mass spec thing and provided you with a spreadsheet of results. After evaluating this you find some really interesting results and you need to explore this protein or pathway further. Your pathway of interest is upregulated. Is the key post translational modification present? Is the shortened proteoform there as well? General proteomics relies on upfront information. Searching every post-translation mod or sequence variation can explode the computational time and that data might very well be there, but more focused searching or filtering will probably be necessary to find it.

You can communicate all this back to the lab and get back in the data processing queue OR you can do it yourself. If they processed the data in Proteome Discoverer you can hike over with a thumb drive, get the file and look more into the data yourself with the free Proteome Discoverer Viewer.

You can get the Proteome Discoverer Viewer from two different places.

1) The Thermo-Omics Portal (https://portal.thermo-brims.com/)
2) The Thermo Flex Net (https://thermo.flexnetoperations.com/control/thmo/login)

You'll need to register at one of these, wait for your approval and then download Proteome Discoverer (PD)

Once inside you have some choices. You can follow 2 different strategies:

1) Get the version that the mass spec nerds processed your data in.
2) Get the newest version -- because new PD will always open old PD results.

My suggestion is #2. As of the writing of this post, this will be PD 2.1.  PD 2.2 should be out in the middle of 2017

The newer versions of PD provide a lot more power to you, the operator of the viewer. There are specific PD viewer keys, but the best thing to do is to just go ahead and install the 60 day demo key.  For 60 days you have pure, unstoppable power and can do anything with your data that you want. At the end of the 60 days most of the searching features will expire, leaving you with the Viewer functions -- basically the "Consensus" workflow nodes and the ability to open files and search through them.

I recommend the demo key version for another important reason. During these 60 days you can add free nodes to PD created by amazing external groups like OpenMS and IMP. IMP has a whole suite of nodes that will continue to function after the demo key expires. It provides a ton of capabilities to the software -- the ability to search with MSAmanda, label free quan with PeakJuggler, differential statistics and a really advanced node for reporter ion (TMT/iTRAQ) experiments.


The IMP PD-nodes (I sometimes call this the PD free version) is really incredible and these can be found and installed at pd-nodes.org (It is under IMP nodes collection)



The free OpenMS community nodes are also fantastic and can be found at http://www.openms.de/. While this software is often associated with label free quan only, this is not the case. There are some functions here that are truly unique, including a node for Protein-RNA crosslink detection. Go around any big genomics department and I bet at least someone there is looking at this problem using DNA based tools. With the right sample prep and this software you can approach this from an alternative and complementary direction.

If you think you might ever want to search proteomics data on your own -- I highly recommend you take the ten minutes to install at least the IMP PD nodes and it won't hurt to add OpenMS to your free user interface!

Other software exists as well -- and more is coming -- I don't mean to slight anyone I didn't mention here. The point of this post is for people outside of proteomics --If you paid for proteomics samples or are collaborating with a mass spectrometrist there are sometimes communication gaps. This isn't due to any shortcomings on either side. It is just that 14 years of training in analytical chemistry for the mass spectrometrist didn't leave time to become a specialist in your biological field -- and you probably don't realize what level of expertise and jargon complexity you are using. If this gap is hindering your progress you have very nice free tools to open, view, filter and even re-interrogate those beautiful, dense, and probably foreign (and crazy looking) data files yourself.

If you did download the newest PD and your data was originally processed in PD 1.4, I made a short walkthrough here that shows you how to open these results.

To you computational nerds, I would also like to mention that PD result files are simple SQLite files with a swapped suffix. You can open them with any database tools you are comfortable with that support this file type.

Friday, May 19, 2017

Global proteome-scale crosslinking! Thousands of protein-protein interactions in one go!


Ever tried to crosslink some proteins, digest them and figure out what proteins were interacting with what (whom..?)

If you have, you should probably understand why I've been so excited about these new reagents, instrument methods and data processing software that make this way easier!! And the first paper I know about that uses this is now out!


I've talked about the MS-cleavable reagent they use on the blog previously as well as the data processing workflows. We saw some applications at ASMS, but this paper does true global scale work!

They crosslink the an entire E.coli proteome as well as a human cell line and pull out information on thousands of protein interactions! The do reduce the complexity by SCX fractionation -- but thousands of protein-protein interactions in one experiment?!?!  Come on -- if you've pulled this off before, you are way better at this than me!

Have you been down this road before and are skeptical? Don't worry, I don't blame you.

Maybe the awesome people at the Heck lab made XlinkX 2.0 a publicly available web Application that you can just go and use here.

Maybe they also put the example files from the E.coli (converted to MGF) available at this awesome site as well so you can check it out!

This team has been actively collaborating to bring the XlinkX 2.0 code into Proteome Discoverer as additional nodes if you want to search data like this within a framework you already know!

Thursday, May 18, 2017

Cross-sectional analysis of the salivary proteome and patient associations!


This is really cool and I think it might be foreshadowing of what part of the proteomics field may look like in the future!  You can check it out ASAP at JPR here. 


What is it? They took saliva from a bunch of people (almost 200, I think). They digested the proteins and did single shot proteomics on the peptides with an Orbitrap Velos with label free quantification. All normal stuff.

Where it gets really interesting is in the downstream analysis. They took all the info that they had about these participants they got the saliva proteomes from -- including the data that you can see in the title above -- and did fancy statistics based on the proteins identified and their relative quantification.

The genomics people are doing TONS of stuff like this. I bet you've heard of the GWAS stuff (Wikipedia article here). In these studies they take a snapshot of the genes, typically via SNP arrays or low read genome sequencing, of a bunch of people. In the simplest example, they do this on a group of people without a disease and a group with a disease and they try to figure out what areas of the genome associate with the disease. In the bigger and more ambitious studies, they just collect lots of info about people so they can separate them into classifications and then get genomic information on every participant they can afford.

This cool study is a page out of this, but instead of getting a picture of the area in the genome that there might be more copies of (which...might be transcribed...and that part might be translated...) they cut out the middle steps and go right to the proteins!

What do they find? Protein expression that strongly associates with some of these characteristics mentioned in the paper title! For example, 30 proteins can be associated with the saliva donor's BMI!

This is a nice method paper and proof-of-principle for this kind of study. The exciting part to me -- It doesn't take much imagination to come up with a way to apply it in a clinical sense, right? Collection of the sample couldn't be easier. We already know how to do the sample prep and analysis. Association with different diseases could be used to point us to individual proteins or patterns of proteins that could be early disease predictors.  And maybe patterns are the key point here, and we can easily steal the tools the genomics teams are using for GWAS and divert them to finding patterns in the protein data.

This is a great forward-thinking study that I couldn't be more excited about!

Wednesday, May 17, 2017

TMT-11plex kits are live!


Slow week for the blog as I get ready for that thing in Indianapolis.

Short bit of good news, though! The complete TMT-11plex kits (TMT is a trademark of Proteome Sciences) are live now! You can order them here!   I tried to update the workflow image and I think it turned out okay...



Monday, May 15, 2017

Going to ASMS? Have you downloaded the App yet?


Going to beautiful Indiana in a few weeks? Here is my first public service announcement.

The App is live and it is the best iteration I've seen so far!

You can build your calendar from the App on your phone (if you can remember your password) or you can build it on your PC with the online planner and then the App just pings you reminders when it's time for that talk or poster you're dying to see!

Direct Online planner link! 

Quantifying proteins in dried blood spots after decades of storage!


...and now for something completely different!


I just find this one all-around interesting. First of all -- how did they find blood spots from newborns that were 40 years old?!? Is this a commonly acquired and stored in normal clinical practice? If so, what a potential resource!

The paper is an analysis of the feasibility of large scale sample biobanks using blood spots dried on paper. If you were going to set up a biobank, this would be a cheap and easy way to do it. Finger prick, drop of blood on paper, store it.

To study the stability of the proteins they use an immunoassay technology called Proximity Extension (PEA) and assess 92 proteins across samples going back as far as 50 years (I think I remember it saying 50 years in the paper somewhere) and look at different storage and acquisition techniques. A large number of the proteins appear to be genuinely unaffected by degradation over time. Another population of proteins, however, appears to decline with almost predictable half-lives. I don't have time today to read up on this PEA thing, but it appears to be an established technique despite the lack of a Wikipedia page on it.

1) Impressed these resources might exist commonly out there in the world for analysis
2) Ultra-impressed that you can quantify proteins 30/40/50 years down the road from them!

Sunday, May 14, 2017

An example of what you can do with the OmicsDI!



I was really excited about the OmicsDI paper, but I realize at this point in time we're kinda being inundated with terms like "Big Data" and everyone's got a "Database" and an "API" and... it does seem like these terms mean different things regarding on who you are talking to...and this explosion of new terms and ideas makes it a little hard to separate the really good from the blustery jargon. 

First off -- OmicsDI can help us with this one of these problems. It does so by bringing a bunch of databases together. Here are some examples of what you can do with it! (this is OmicsDI.org, btw!)

I'm going to pick a random cell line. Let's go Colo205. Just typing the cell line into the little search bar gives me access to a ton of information: 



There are 3 proteomics studies and 5 transcriptomics ones that we can directly access via OmicsDI that feature this cell line! From just looking at the studies (T -- transcriptome, P- proteome, etc.,) you can see an overview of the study and learn some stuff about them.

For example, 5 studies by ArrayExpress. Microarrays! Right from the start you see there are 7 studies here that won't provide any meaningful data whatsoever and you can move right along (kidding, of course!)

Clicking on the provided link will take you to the ArrayExpress (which, btw, I'd never heard of before) where you can a summary of the study and direct links to download the processed and RAW data from the study.


If someone had said to me -- cool proteomics, has anyone done transcript analysis on this model before? I would have started like this:


Which, btw, doesn't lead you to anything about science on the front page at all. OmicsDI has already made me more efficient in this hypothetical.

Okay...so...I was a little bummed that there were only 7 studies on this cell line in the repository. Guess what? There is some disagreement regarding the nomenclature of the cell line. Is it Colo-205 or Colo205. If I type the search "Colo205 or Colo-205" I get 10 more studies.



Including another database I didn't know about (Expression Atlas). Let's follow that one!



It leads me to a table that I can search in the web interface or download in it's entirety. It is the expression levels of 24,000 transcripts  across a ton of cell lines with a heat map indicating relative up/down regulation stuff.

Remember that theoretical question I mentioned above? Did anyone find this in the transcriptomics? Take those proteins you found that were up- or down- regulated and search them here. And the data is at your finger tips!  No looking for a database you didn't know existed!

Saturday, May 13, 2017

The Cell Atlas launches officially today! A subcellular map of the human proteome!!



The Cell Atlas launches today!!  12,000 proteins mapped to 30 organelles!

I've gotta run. Check out this blog post at the Human Protein Atlas, it is better than anything I'd write anyway!! (and don't skip their cool video if you have time). 

Worried about the quality of a protein localization score because they use antibodies to help with this localization? Don't worry, they've integrated a metric into the quality of their localization score. Antibodies aren't perfect, but having a relative scale of the quality of performance sure won't hurt!!

Paper in Science today!


Was this the best week for proteomics in history, or have I just had too much espresso?

Friday, May 12, 2017

MsStatsQC -- Longitudinal monitoring of targeted proteomic experiments!



Yeah! More quality control! This one is different than what I'm normally rambling about. Check this out!


Not very colorful....it's still in press...may need to do something about this.... You can check it out here if you're in a hurry (and if you are doing targeted quan, you should be, this is AWESOME.)

MSStatsQC -- as you might get from the paper title -- is a method for determining your targeted peptide quan system performance over time. It looks seriously powerful and meets some standards set out by the United States Pharmacopeia (I walk by this all the time and have never been inside! Feel like I should go in after seeing they are also QC nerds!). The USP has data quality broken into these 4 components

1) Analytical instrument qualification
2) Analytical method validation
3) System suitability testing
4) Quality control checks

MSStatsQC is a freely available software package meant to address these USP requirements -- and does so with advanced statistics and graphical models. It takes a minute or two to figure out what the plots are, but once you figure out what you're looking at -- POW! -- instant information on how your instrument has been doing!

To be perfectly honest, as soon as the experimental section of the paper starts -- I have to stop. Wow, this paper is ridiculously math heavy. I'd blame it on 6am on a Friday...but I'd be lying. I don't have the background to follow whatever they're talking about. Fortunately -- I don't have to! I can just download MSStatsQC here. (MSstats.org)



Even better, however, is the fact that they've set up a Shiny interface page so you can just run your data through their website! 

Before you put some time into it -- does it work? Albert Heck yeah it does! These authors download a ton of CPTAC 9.1 (the SRM stuff) data and show that they can automatically monitor the data as well as other studies have using software that require manual validation as the final step in QC process! We may never get to the point where manual QC validation is unnecessary, but if software can help us to do less of it and more analyzing the cool stuff it is a win for everybody!

Percolator has done this for global discovery proteomics -- is it perfect? Nope, but even the most stringent labs I know (including groups that manually checked every PSM before submitting a study) have enough data now that they only randomly monitor the quality of their peptide IDs that pass Percolator FDR (or monitor small batches of low metric ones).  MSStatsQC is a step in this direction for SRM labs. Less manually monitoring of your controls and QC metrics because it can look at tons of them at once -- through all of your data!

Are you running SRMs in your lab? You owe it to yourself to check out this software! One more thing -- you don't have to use their Shiny interface. You can download the software and integrate it into your own processing pipeline. That's how cool these authors are -- they just want you to have a better way of having higher confidence in the data you are producing -- any way you want to do it! As good as this is, I expect to see common targeted quan software directly integrating at least components of this so the rest of us don't even have to think about it.

And I found a fix for the lack of color in this post!