Thursday, December 14, 2017

A simple workflow to diagnose some sample/instrument quality issues using PD!

A couple people have asked me to look into this over the years and I thought I'd finally give it a try.

Here it goes!

Sometimes you just need a quick snapshot that will tell you if the samples that are running on your instrument the next 2 weeks is worth your time. Could we just build a quick Proteome Discoverer template that would allow you to snapshot that first fraction to give you confidence that everything is okay?

To keep it simple for the first one I'm going to say these are the requirements:

1) A histogram of the relative mass discrepancy at the MS1 level
2) A measurement of the relative number of missed cleavages for determining your enzymatic digestion efficiency
3) A measurement of your alkylation efficiency?
4) Complete data analysis in under 10 minutes on a normal desktop.
5) Must use either Proteome Discoverer normal nodes or IMP-PD (the free Proteome Discoverer version)

#1 is super easy. #2 requires some serious computational power to do correctly on a modern complete RAW file and will require a bit of data analysis reduction.

If you are working on human samples, I'll walk through it. I'll try to post the FASTA and templates somewhere here a bit later (out of time).

If you are working on something else -- tough luck. You'll have to do this yourself.

Step 1) Generate yourself a good limited FASTA. Something small enough to allow you to perform very large data permutations rapidly, but large enough to get a good picture of your data.

To get this, do a normal search of a representative data file. Feel free to use the default Proteome Discoverer templates. The only thing we're doing here is finding the most abundant proteins in your data. Fractionation may complicate this, but I ain't never seen a human offline fraction that didn't have an albumin or Titin peptide in it. I'd also use the cRAP database, but it isn't super important at this step as long as you do use it here somewhere.

I threw in Minora, but don't feel as you have to here. If you are using IMP-PD, use MsAmanda, Elutatator, and PeakJuggler.  Normal tolerances for your search (10/0.02 for FT/FT & 10ppm/0.6Da for FT/IT)

Same thing for the consensus -- something normalish. I'd throw in the post-processing nodes as well as the ProteinMarker node so that you can clearly distinguish your contaminants from your matches.

Step 2 Run this full search search.

Let's find the most abundant proteins and make a FASTA. You can do this a couple of different ways. I recommended using Minora and/or PeakJuggler so that you can sort your proteins by XIC abundance.

Interestingly, the most abundant protein is a cRAP entry. I'm starting to remember why I was asked to troubleshoot this file a few years ago and why I marked it "keep for example purposes"

Step 3: Make a small FASTA! What I'm going to do is filter out the contaminants and make a FASTA of the 150 most abundant proteins. You can use your mouse to hover over it and your down button on your keyboard to scroll. Once you have the area covered, then right click "check all selected in this table" then File > Export > Fasta > Checked only

Step 4: Process with this crazy FASTA! Now you have a FASTA to work with! Import it into PD through the Administration tab. Once it's in make a crazy method.

I'm allowing up to 10 missed cleavages. 100ppm MS1 tolerance and 0.6 Da MS/MS tolerance for FT/FT (ion trap, maybe try 2 Da?) please note -- this database is likely too small for Percolator to work well on. I've turned it off here and am relying on Xcorr alone (Fixed value PSM validator)

Even with 10 missed cleavages, my old 8 core Proteome Destroyer completed the file in 4 minutes.

Step 5: Evaluate the data quality: Let's check the deltaM. This is the pic at the very top of this post and it looks kinda bad. However, this is mostly a histogram binning issue. Change the number of bins to 100 and it looks much better:

What about missed cleavages?

A few -- but it looks like you'd capture well over 90% if you used 2 missed cleavages on this data. I'd say the digestion was okay.

Alkylation output:

In order to see your relative alkylation efficiency (please keep in mind I'm assuming iodoacetamide. You will need to make it a dynamic modification.

In your output report you can see your relative alkylation efficiency by applying this filter:

Then go to and plot this data:

In this output we're looking at around 73% alkylation efficiency. A quick look shows me that about 49 of these peptides are from cRAP -- even if you take those out of consideration (which only makes sense for peptides introduced late in the process -- this still is pretty low. I'd check the expiration data on this iodoacetamide, or see if it has spent a lot of time exposed to direct sunshine.)

This is an evolving project (there is a lot more we can do here) but I'm going to stop here for now.

Tuesday, December 12, 2017

Trypsin may be a limiting factor in alternative splicing event detection in proteomics!

Thanks WikiPedia Alternative Splicing page!

Alternative splicing is a big deal in eukaryotic systems. One of the first big revelations in "next gen" transcript sequencing is that these aren't rare events.

However, there hasn't been a huge amount of data from the proteomics side to back up the numbers that the transcriptomics people have been finding.

This new study in press at MCP suggests it might be our fault -- primarily our reliance on good 'ol trypsin.

Earlier this year we had this bummer study on cysteine alkylation reagents -- and now another weakness of trypsin...?  Isn't anything perfect?

Really -- this is only going to be a problem if you're specifically focused on alternative splicing -- it turns out that lysines and arginines are involved more often than other amino acids in these junctions. The trypsin cuts them up to things far too small to sequence -- or cuts them at just the right point that there is no useful information about the alternative splicing. (The espresso must have kicked in just now, there isn't a single period in that paragraph. Maybe I should go back later and add one).

These authors in silico digest some human proteomes simulating different enzyme activities. Trypsin doesn't lose all the information, but it appears that chymotrypsin and AspN provide better coverage of these sites -- however, as in all things it looks like using all 3 (separately, of course) will provide the greatest amount of coverage

Monday, December 11, 2017

The Synaptosome Proteome!

At the very top of my "favorite new (to me) field to say 3 times fast list" -- I present this awesome new study on the synaptosome proteome!

A little looking around and I had to add the "to me" part. There are dozens of studies on the proteomes of synaptic junctions going back to before I ever learned how to use a mass spec, but having not read any of the others -- this is, by far, my favorite one!


1) Real human brains were employed.
2) Even more impressive? This is tissue from a bunch of human brains with really interesting phenotypes.
3) Multi-plex iTRAQ was performed (2 4-plexes) -- and performed expertly. 8 "normal" brain controls were combined in equal ratios and these were used as channel 117 in both 4-plex groups. That's really smart, right? They could look at 6 patients from their disease state (the other 3 channels times 2) and compare it to 8 of their control groups. The control mixture in 117 can be used to normalize between the two 4-plex sets AND all interesting observations in the patient samples can be obtained by just using the 117 channel as the denominator. Simple -- and I'm totally using this later.
4) Other sample are also studied using label free quan. I have the RAW data files on my desktop, but I'm not 100% sure how the data was processed. The findings from the LFQ and the iTRAQ analysis were compared and combined.
6) PD 2.1 was used for the data analysis, and InfernoRDN (which I'd forgotten about somehow!) was uses for statistical analysis. If you don't know about this -- I highly recommend you check it out!
6) PRM was used to validate the interesting findings!
6) This data is also integrated into the C-HPP's search for missing proteins!

I can't gauge the value of their biological findings, but the samples are really cool, and the proteomics is some top-notch stuff. They point out some pathways that seem to make sense with the theme of the paper and that's good enough for me!

All the RAW data is up on ProteomeXchange (where I actually found out about this study) here. 

Sunday, December 10, 2017

Awesome matrix detailing the free PD nodes from IMP!

Without fail, every time a new version of PD comes out, someone discovers their favorite free tool from IMP hasn't been ported over.

IMP provides these nodes and does so out of their own good will for the community, and it's a lot of work.

This super handy new matrix shows you what tools are compatible with which PD versions. Keep in mind that many of these work within the awesome free IMP-Proteome Discoverer version as well!

Saturday, December 9, 2017

What are the best open Linux Proteomics tools?

I run into this question a lot and I'm surprised I haven't thought to work on a post like this before.

Linux is a family of alternative (and mostly free) operating systems. I tried really hard during grad school to live my life with these -- primarily because I was broke -- but also because I had this strong anti-corporate thing (I appear to have misplaced that along the way...).  This experiment failed dramatically. The versions I tried just couldn't support my crappy hardware and I wasn't smart enough to install/alter my own drivers to make them work.

These operating systems have come a LOOONG way since then -- are still mostly free -- and when universities build supercomputing complexes this is what they're going to use (or UNIX, but we'll ignore that).

Quick note on these, however, I've got to mess with one quite a lot recently. These huge "cloud cluster" computers have unbelievable numbers of processing cores and threads available. However, the architecture of these processors might be very different from the desktop ones that we're more familiar with. I was first allotted 8 cores for Proteome Discoverer 1.4 and 2.1 that worked within a simulated Windows environment. It was SO slow. I complained and they gave me 16, then 32, and it still wasn't as fast as my old 8 core desktop. I got bored and did something else. Really it might be that software needs to be designed specifically for these things to work optimally on them....

And that's a lot of words leading up to  -- what proteomics programs can I install on Linux!?!
In no particular order:

#1 SearchGUI!  

I <3 SearchGUI. You know all those command line search engines everyone talks about in the literature? SearchGUI make them all work in one simple, easy interface. You can then bring all the results together in PeptideShaker.  I would make #2 the denovoGUI -- but it has now been integrated into SearchGUI so it only gets 1 entry.  You can get it here.

#2 OpenMS!

I've only installed the great OpenMS package in Windows. It is a full integrated environment for mass spectrometry with quantification and identification for proteomics and metabolomics. There may be some extra steps to get it installed in your Linux environment. Fortunately, those instructions are here.

Now, it is probably worth noting that you can really make any Windows program work in a Linux environment by creating a simulated Windows environment within the Linux system. This is what I was doing with Proteome Discoverer, but the performance was too much of a problem. This really might have been that I didn't know what I was doing...

This page is a resource that will help you set up MaxQuant working in the same way!

Heck, this guide will help you set up Proteome Discoverer (1.4) on Linux as well.  Please keep the limitations in mind.

There are other tools that will work in these environments as well, but I don't have hands-on with them personally:


The TransProteomic Pipeline!

And...of course you can install R and use R for Proteomics.

Many of the IMP tools (including MSAmanda) can be operated stand-alone in Linux

Please let me know what I've forgot, I'm sure there is a ton!

Thursday, December 7, 2017

Q Exactive HF-X paper is out!

This "just accepted" paper at JPR is a gold mine! 

Head to head comparisons -- QE HF with HF-X.

How fast is the HF-X in real proteomics samples? Really fast!
How sensitive? Same gradient conditions, 100ng of peptides on the HF-X gets the same coverage as the HF at 1000ng!
What is the overhead of the HF-X? Somehow -- it again drops in this model. It's somewhere between 3-4ms!

The paper goes through TMT, phosphoproteomics, single shot runs and it's almost an afterthought -- that these authors take the 46 offline high pH reversed phase fractions from the Rapid Comprehensive Proteomes Cell paper -- and cut the nanoLC runs to 19 min with the same degree of coverage.

At 19 minutes it is realistically possible to obtain 2 (TWO!) complete human proteomes in about 1 24-hour day of run time! (My crude math)

The thing that slows you down the most? The nanoLC column loading and equilibration time.

Tuesday, December 5, 2017

Proteoform Suite!

I don't know if this picture is related, but Google thinks it is and I like it!

I can't yet read this paper. It's ASAP at JPR and my library doesn't list those for a few days typically.

You can access it at JPR directly here: 

Check this out, though! Before paper launch these authors have already put an instructional video on their cool new software (quantify and visualize proteoforms!!) on YouTube.

I'm having trouble embedding the video here. I took a screenshot at this point so you can see some of the awesome output capabilities of this new tool. Direct link to the video is here.

Since we have to admit that proteoforms exist and complicate our work. Any tool that will help you group them into families -- quantify your changes -- and help you make sense of all that stuff you found is going to be seriously useful.

Monday, December 4, 2017

pSITE -- A computational bulldozer approach to de novo global PTM analysis

Sometimes elegant solutions are in order.

Other times you've got to put on your cowboy hat and check every possible AMINO ACID where a PTM could co-localize in a gigantic peptide matrix to determine where the heck that modification has ended up. And...maybe that's how you have to do it to accurately determine your FDR.

I don't mean to insult the authors of this cool new paper -- at all -- quite the opposite.

I have loads of respect for the fancy pants statistics stuff that I don't fully understand. However -- pSITE cuts most of that out of the way by just checking everything and in their hands it looks superior for global de novo identification of modifications and their localizations than our classic methods like Ascore and phosphoRS.

You'll have to check out the math for yourself if you're interested. One of the big surprises for me in this paper was in the supplemental info. With all the amino acid specific calculation, I'd expected the search space for pSITE to reflect the actual amino acid length -- for example, a 12 amino acid peptide would have a search space ^12 larger than NovoR or Peaks.

I don't know how this is possible, but pSITE somehow (on the same server configuration) isn't slower than these other algorithms. It is somehow faster than some of them....

pSITE is free to download and you can get it here. 

Sunday, December 3, 2017

MORPHEUS (not that one!) at the Broad for downstream analysis!

Have you been trying all weekend to successfully do some sweet clustering on a collaborator's data, but keep getting this when you try to run ClustVis? 

Have you already tried the popular American strategy of calling the person(s) responsible a name on Twitter to see if that helps?

..and it totally didn't help at all?

Are you also too lazy to do it in R yourself, despite the fact that Brett Phinney tipped you off to an awesome and super easy looking package that would do it?

Well -- do I ever have the solution for you!

Check out the MORPHEUS web interface hosted by the Broad Institute here!

1) Wow, is this thing ever easy! It was designed for genomics stuff -- wait, some of the example data is quantitative proteomics! You have a format to follow!  Just cut your data out into a Tab delimited text file and go. It looks like you can load your entire file and then determine what is important, but I found it simpler to cut it myself down to my protein accession numbers and my normalized protein abundances from each RAW file.
2) It has a lot of fancy stats things in it that you can use. Do they work? Sure probably!  My results seem to make a lot of sense....
3) It doesn't like enormously huge names in the Row and column titles. If you have, for example, clustered your RAW files 7 different ways in Proteome Discoverer 2.x and the title of your row has over 128 characters, it will compress the space for visualizing the clustering distance. Shorten the name and it's fine!

And it's got a bunch of cool stuff to choose from once your tables are loaded!

(This isn't patient confidential data or anything like that -- but unless I've gotten explicit permission to share someone's stuff I figure you can never be too careful. See what a good collaborator I am? Just don't expect analysis in a hurry...)

You have loads of power to take a first pass at your data here. All the normal clustering algorithms (that I have to go straight to WikiPedia to remember what is what) are here. Surprisingly, it appears to do all the analysis within your local browser! When I really loaded it with data (clustering proteins in one tab and all quantified peptides in another tab) it used up a good 10% of the processing power available on my PC for upwards of 10 minutes before wrapping it up. (This suggests to me it is limited by the amount of math my web browser can handle). Come on, Chrome! You can do it!

Are there a ton of ways to get your data clustered? Sure! But it never hurts to have another nice (and easy) tool to get there.

Is it a little confusing that it shares a name with another great tool? 

Saturday, December 2, 2017

PoGo! Overlay your identified peptides onto your gene assembly!

You should be able to click on the picture I stole above to expand it!

I'm kinda leaving this awesome tool here so that I can find it later to check it out.

You can access PoGo directly here. 

Why would you want to? Proteogenomics, Kyle. Don't know which frame is the correct one? Overlay them (this all goes back to the codon uncertainty thing)

Big shoutout to @clairmcwhite for the image and tipping our multi-omics Twitter community off to this awesome resource --and to the team at something called the...

...for putting this up. Can't wait to get ahead a little so I can check it out!

Friday, December 1, 2017

Mitochondrial dysfunction in heart tissue revealed by RNA-Seq + Proteomics!

The biology in this new study is intense. I know approximately nothing about cardiovascular proteomics (honestly -- as far as I can tell, not many people do -- definitely an underdeveloped field at this point, but there are great people working hard on it!)

What I'm interested in here is how these authors approached the experimental design in terms of combining transcriptomic analysis with their quantitative proteomics.

Don't get me wrong -- there is some solid proteomics work in this. If you are looking for a recent method to enrich mitochondria from cells -- this (and the associated study referenced in the method section) is/are for you. These authors start with an impressively pure mitochondrial fraction before the peptides from it go onto a Q Exactive HF. In just a mass spec nerd note -- this group sacrifices some scan speed and MS1 resolution in order to get higher resolution MS/MS. They use 60,000 resolution for MS1, top10 selected for 30,000 resolution MS/MS. We've been seeing this more often recently.

My inclination is always going to be to get more MS/MS scans -- at the sacrifice of relative quality/scan. In simpler mixtures of peptides, it seems like more researchers prefer to get fewer MS/MS scans if the scans can have higher injection times -- and if you're using longer fill times, you might as well get better resolution MS/MS right? It will be free in terms of cycle time on the lower abundance peptides and will only cost you cycle time on the higher abundance peptides. I should check later to see if anyone has done a comprehensive comparison at different complexities....

Back to the paper -- all the label free proteomics is processed in MaxQuant and the transcriptomics is performed using a commercial kit for heart proteins for the RT-PCR and Hi-Seq analysis using a mouse kit.

The impression you get from this paper at first is that the combination of this staggering amount of data from all of these mutant mice -- is that it was easy. There's no way it was. No way. But they make it look that way till you dig deep into this massive body of work.

And that's why I recommend this paper -- they did this work and laid it out end to end here so I don't have to. If someone hasn't dropped by your lab with samples they've done transcriptomics on and want to combine it with quantitative proteomics -- you'll see them soon. And -- it's tough. We don't have unified pipelines --yet. You need to use some of their genomics tools and our tools and find things like IPA, CytoScape and the right R packages (if they've done the transcriptomics, they probably have an R specialist around) and this study is a great framework for how to pull this off!

Thursday, November 30, 2017

The MacCoss lab needs your help to keep Skyline Free!

Hey!!  You're a mass spectrometrist (or a weirdo) if you're on this blog. And I bet you use Skyline. If you don't I'd almost guarantee you will some day in the very near future -- and the best part about Skyline is that the developers support the heck out of it. It is always getting better and you can always get help on the forums if you have issues.

Skyline has always been free for everyone, but without some more letters of support from our community it may not be able to stay that way.

If you've got 5 spare minutes, please go to this page and sign a premade support letter or write one of your own.

My country is going to lose access to a free and neutral internet in about 2 weeks.  Let's not lose one of the best pieces of mass spec software we've ever seen on top of it!

I might go crazy, for real....(that picture gets better the more I look at it...)

Wednesday, November 29, 2017

Quantitative proteomic profiling of malaria antigens!!!!

I was having an awful day. Fo' real, yo. Microsoft forced an update to my favorite PC (my Destroyer Gen1 (I got the new "Max Destroyer", but I haven't migrated anything but PD to it yet) and knocked out all of my software that had a dependency on the .NET framework (a lot of it -- watch out Windows people-- 4.7 might not be the culprit, but the timing is suspicious) but then this paper turned my whole day around!

If you've had the misfortune of reading any of this blog you might know that I've got a hangup for the malaria parasite. Malaria is a protein problem. If you throw all of the world's genomics resources at it, you'll never solve the malaria puzzle. Proteomics, glycoproteomics, and (my firm belief) PTMs we don't yet know about yet are what is going to finally help us understand P. falciparum and why it kills human beings with such relative ease.

During some stages the malaria parasite lives in your red blood cells (RBCs). It looks super gross under a microscope. There are a bunch of bugs in your red blood cells. Our immune systems are nearly perfect. Organisms generally have to evolve against our immune systems for millions of years to have a shot. Yet these huge parasites live in our red blood cells and our immune things pass right by. We know there are proteins that the parasite pushes through the RBCs and that these help protect them from targeted destruction.

Sandra Bark et al., decided to see what the surface of the infected RBC looks like to proteomics by shaving off the proteins and doing proteomics on them. Big deal. Both Laurence Florens and Michal Fried did this independently years ago. The devil is in the details.

This paper is Bad. Michael Jackson (1987) Bad.

They biotinylate the surface proteins.
They iTRAQ label everything (RBCs with great big parasites living inside them and eating everything tend to be leaky and populations of them are untrustworthy in terms of proteins leaked).

Check out the figure at the top. The number of negative controls for the study are what give it so much power. Leaky proteins from damaged cells can be eliminated --they're all leaky. Comparing shaved from unshaved membrane protein preps allow you to tell what is really important.

Deep proteomes were obtained by normal means -- 30 high pH reversed phase fractions were ran on a Q Exactive classic using 110 minute gradients and a Top12 method. An NCE of 28 was used (+1 since it is iTRAQ 4-plex).

Did I type this already? The power is in what can be eliminated from consideration! So much of the background noise can be tossed because of how good the controls are!

What did we get? Well -- to date -- the most comprehensive picture of what is going on at the RBC surface when the malaria parasite is hanging out and eating all the hemoglobin.  And new surface proteins(?) definitely peptides that might make great new vaccine candidates!!

All the RAW data has been deposited at MASSIVE, but the details in the paper for the user name and password appear to be incorrect. I'm sure it's only password locked for the review process and will be available once the study is officially accepted.

EDIT: Updated MASSIVE link to RAW files is here and will be corrected in the final version.

To these authors, I say:

(pug people are so weird. who makes this stuff?)

Seriously -- AWESOME STUDY. I'm sending this around to friends now!

Saturday, November 25, 2017

TAFT -- Fast, reproducible phosphopeptide enrichment AND fractionation on a stage tip!

If TAFT works as well as these researchers claim in this new study, this could be a game changer!

They directly enrich phosphopeptides and then do high ph reversed phase fractionation -- all in the same tube/tip and it's ready to go!

Did the number of phosphopeptides they identify break records set by techniques like SCX or high pH RP fraction collection followed by TiO2/IMAC enrichment? No. It's not bad, but we have seen better.

However -- 4 hours from start to finish!  Compare that to most of the deep phopsphoproteomics studies we've seen -- they point out several and you're looking at a full day to prep the samples and then up to 48 hours of instrument run time. 3 hours vs. a long week?!?

If you use TAFT you could use the other 4.5 days of your work week to try and make sense of the 14,000 phosphopeptides you identified.

On the method details -- they do use an Orbitrap Fusion for their analysis. It appears to be all HCD high resolution MS/MS on the mass spec side for this study and the data processing is in MaxQuant / Perseus.

The reproducibility of the TAFT protocol, run to run, looks pretty spectacular as well. One of the benefits of keeping everything simple and and all in one tube!

Friday, November 24, 2017

Longitudinal quality assessment of iTRAQ labeled samples for clinical proteomics!

Have we underestimated the power of reporter ion based quantification (iTRAQ/TMT) in terms of long term clinical studies? This new study at JPR suggest we might need to reconsider it!

Over a 7 month period, this team collected xenograft samples. In all experiments they used a single reference sample and use that reference channel for normalization and in a lot of the other downstream work.

They demonstrate a surprising amount of power and reproducibility over this time. This is despite the use of 2D offline fractionation (via high pH reversed phase) into 24 fractions and keeping the fractions stored at -80C.  (96 fractions were collected, but they were concatenated into 24, using the same concatenation pattern for every fractionated sample -- for example, fractions 1,25,49 & 73 were combined for every set of peptides)

All the work for this study is performed on the same Orbitrap Velos instrument using a 2Da MS/MS isolation window on a top10 experiment with HCD fragmentation set at 35.

Interesting touch here is a confidence metric that takes into account the hydrophobicity of the peptides and the reporter ion intensity.

The real star of this huge amount of work (at least for me) is the 22 separate metrics that were tested to determine the most reproducible way to roll the reporter ion and PSM data up to the protein level. As hard as this Orbitrap was working, it looks like the statisticians may have put in more time.  In the end, they conclude that if you put the work in and do it right, there is no reason you can't utilize reporter ion quan in large scale clinical studies!