Change country

How difficult is it for biomarkers to be validated?

Biomarkers have been around for a long time now, and the field is moving rapidly. In addition to genetic and protein markers, we now also have microRNAs, epigenetic markers, lipids, metabolites, and lots of imaging markers. Some are extremely useful as a (companion-) diagnostic; others may serve as a mere indicator. However, there are problems. The biggest obstacle by far is the proper biomarker validation for any disease, no matter what the role of the biomarker is going to be. Such validation will depend on independent confirmation at different locations (different labs). One problem is consistency in the preparation of the biological material used at the different studies. When we limit to protein biomarkers detected by antibodies, the other problem is consistency in the choice of antibody used. It should also be noted that in quantitative IHC one needs a standard in the quantification method. A recent opinion paper reveals another layer of complexity: The statistic analysis is prone to wrong conclusions down to coding errors in the software ( It is time to take stock and to address the different levels of disturbance complicating the process of validation.


Biological material

Especially when biomarkers are instable, the integrity of the sample specimens will determine the quality of the biomarker’s measurements. Particular post-mortem samples will never represent samples from living individuals because of post-mortem delay. As the post-mortem delay will differ from individual to individual, the level of decay will vary dramatically per sample.  For this reason, post-mortem samples are best fit for qualitative analysis. Quantification of any biomarker in post-mortem samples should be interpreted with extra care.

Plasma samples can be prepared in different ways: There is EDTA-plasma, citrate-plasma, heparin-plasma. In addition, biomarkers can be tested in serum and in whole blood. It is clear that levels of biomarkers will need to be compared between equally treated samples in order to avoid variations in noise from the different ways the samples were prepared. This principle is universal, so it will be true also for any other type of samples.

For microscopy, tissue slides have to be prepared in line with the required assay before they can be investigated.  Fixatives (alcohols, aldehydes), embedding materials (paraffin, LR White, etc) and temperatures (frozen vs heated) will have profound effects on the integrity of the tissues and cells and they will determine if the assay is successful or not. Again consistency in preparation of  samples to be analyzed is paramount. Mega-data analysis may get skewed when data were collated from samples treated in various ways.

A systematic approach to record and keep biospecimen has been proposed and is aimed to become the new standard: Biospecimen Reporting for Improved Study Quality (BRISQ) guidelines provide a tool to improve consistency and to standardize information on the biological samples (

Antibody choice

Mass-spec and RT-PCR quantifications will be robust by the consistency of the assay material. However, the robustness of immune assays depends highly on the choice of antibodies used in the assay. Once an antibody has been successfully validated in one assay, this assay is defined by this antibody. Change of antibody will potentially change the outcome altogether. When an antibody needs changing, the assay is no longer validated and the validation procedure will have to be repeated with the new antibody.  For this reason the preference goes to monoclonal antibodies.  The rationale behind this preference is that the clone number of the antibody would define its characteristics: The expectation then is that when the antibody from the same clone number is used, no matter which vendor it is from, the assay will remain validated because the antibody remains identical. Unfortunately this is a myth.  Depending on the vendor (and sometimes on the catalogue number) the formulations of the different products, all with the same clone number, will differ: The antibody may be purified from ascitic fluid, from culture media, or not purified at all (ascitic fluid).  These different formulations will have an effect on the way the antibody needs to be diluted to avoid non-specific background ( Therefore, the monoclonal antibody needs to be revalidated in the same assay when the original formulation is no longer available.  With this in mind a peptide-generated polyclonal antibody from a larger animal than rabbit (for large size batches) may serve as an alternative because the batch-to-batch variations of such antibody is limited by the size of the immunizing peptide.

Assay development

As mentioned above, when a new assay is being developed a monoclonal antibody may not be always readily available. Then a peptide-generated polyclonal antibody may serve as a good and cost-effective alternative. However, peptide polyclonal antibodies need a new round of revalidation when a new batch (from a different animal) arrives, just like different formulated monoclonal antibodies.

During assay development it is essential to dilute the antibody far enough to avoid non-specific background, but it needs to be strong enough to allow measuring a dynamic range, especially when the assay is quantitative. When the assay is dependent on a secondary antibody, this antibody needs validation as well (with and without primary) so to assess its non-specific signals (noise), see

Specificity needs to be addressed by comparing specimen spiked and un-spiked with the intended protein of interest (analyte) at various quantities. The signals need to be proportionate to the quantities spiked. In addition, specimen known not to have any of the analyte needs to be compared with specimen known to have the analyte at natural levels.

Detection and cut-off values

Sensitivity is commonly attributed to the antibody used in an assay, but this is a misunderstanding. Sensitivity is determined by the detection method of which the antibody/or primary and secondary antibodies may take part in. If levels of the analyte are low, a higher sensitivity is required. This increased sensitivity is usually not accomplished by increasing the antibody concentration, but using an antibody with higher affinity will help. But in general the change of detection method (fluorophore, isotope, etc) is the appropriate step to take. Together with the increase of sensitivity, the noise and background will increase also. When a change to a higher sensitivity is required, the validation should focus on a more stringent regime for keeping noise and background at bay (

When quantification is a requirement, cut-off values need to be put in place. Both the Lowest Levels Of Quantification (LLOQ) and Highest Levels Of Quantification (HLOQ) must be determined. Often the detection limits are determined as well, but this is only relevant for qualitative work.  In Immunohistochemistry (IHC) these values become tricky, because the intensity of signal is not just a number generated by a detector; the density of signal is combined with the location in the tissue.  In addition, the surface area of quantification needs well defined boundaries. And even when all these measures are in place, the quality of the tissue and the quality of the slides can potentially jeopardize these measures and skewing the results. Diagnostics by IHC is therefore prone to misinterpretation when for one specific test consistency at all levels (same antibody at same dilution, identically prepared tissue samples, identical area surface analyzed, identical staining, etc) is not followed in all laboratories in the world.

Statistics and jumping to conclusions

Statistic analysis is notoriously used for convenient evidence to the author(s).  No matter what method of statistics is used, when the input data have been selected any outcome is flawed by default. Only analysis of ALL data (non-selected) would yield proper results, but then they might be inconclusive or inconvenient. The pressure to publish in peer-reviewed papers force authors to present statistics in the most incomprehensible way possible, knowing that their peers will not admit their confusion and likely take their word for it. Even when the statistic results are sound, they may get over-interpreted. Thus claims are made based on prejudice and weak statistics (for example cholesterol levels linked to cardiovascular disease, CRP claimed as an inflammation-specific marker, mutations causing cancer, etc). Such claims can be (and has been) driven by conflicts of interest ( The reputation of biomarkers has suffered dramatically from lack of scientific integrity and as a result many scientists lose faith in the usefulness of a biomarker database. New guidelines have been introduced by publishers in order to introduce a new standard on how statistics are presented (

There are several statistic packages on the market for scientists and clinicians to use. However, these packages are quite advanced and need expertise handling, like one needs a driver’s licence in order to safely use a vehicle on the public road. Vendors of such packages admit that their products are not always properly used (personal communications). The chosen algorithms need to be appropriate for the type of data to be analyzed. In addition, the same data entered in the same system may result in different outcomes on different occasions when the wrong type of results is being asked for. Finally subtle coding errors in the software cannot be identified in small tests on script integrity, only to skew results when large scale data are being processed (


Project design and personalized medical care/stratified approaches

When all the above hurdles have been successfully taken, we are not quite there yet. Each individual is different from the next, and therefore each individual has different tolerance or sensitivity to toxins and medicines. This makes the assessment of biomarkers to follow the progress of a disease, or to follow the efficacy of a therapy, difficult to analyze when a group of patients have been treated all in the same way but the individuals in the groups are so diverse that the data can be all over the place.  Only when a group is defined by a certain genetic or environmental background, would there be sufficient homogeny to assess a biomarker for this particular defined group.

It is like the chicken-egg (catch-22) paradigm: One has to start clinical trials in order to identify the non-responsive patients and only then one can leave them out for proper validation of a new biomarker. However, proper validation demands positive and negative controls AND not allowing to select the convenient data only.  It is therefore not a surprise that the search for proper clinical biomarkers is very challenging indeed.


Jan Voskuil, 2015