Manual for the Amberbio app
The Amberbio app allows scientists to analyze and visualize data sets, in particular biological data, on mobile devices. All calculations are performed directly on the device and all data is stored locally. The functionality should be evident from the user interface and it is not necessary to read the manual to use the app. The manual gives a high level overview of the app and explains some important details that are not clear from the user interface such as some mathematical details.
This manual is available in the app and on the web at www.amberbio.com/manual.
Table of contents
Version 4 of the Amberbio app.
All data is stored locally on the device. All calculations are performed locally. Amber Biosciences does not have any access to user data.
Projects and data sets
The app handles projects and data sets. A data set is a set of values organized in a table with rows and columns. A project is a collection of one or more data sets. A project is always created with a data set called "The original data set". Other data sets within a project are created by the app using various methods such as normalization or sample removal.
The columns of a data set are called samples and the rows are called molecules. The values in the table represent an intensity, or some other quantity, for a sample-molecule pair. Data sets could represent biological information such as gene expression, protein abundances, microRNA expressions, peptide abundances, or metabolite concentrations. Data sets do not need to be of biological nature even though the app uses terminology from biology.
A project also contains factor, or grouping, information about the samples. A factor has one or more levels. An example of a factor is "Gender" with levels "male" and "female". Naturally, factors are shared for all data sets within a project.
A project can also contain extra information about the molecules. This information is called molecule annotations. An example could be the molecule annotation "Chromosome" where each molecule has a value such as "chromosome 21".
The active data set
Analysis is performed on the active data set. Selection of the active data set is done on the page Data Set Selection. The active project is the project to which the active data set belongs. Editing and adding factors and molecule annotations are done in the active project.
There is almost always an active data set and project. The only exceptions are when the app starts the first time and when the active project is deleted. When a new data set is created, the app automatically makes the new data set active and jumps to the page Data Set Selection
Import of data
Data import is used to create new projects and to add factor and molecule annotation information to projects. Import of data is a two step process. First, a file is imported into the app. Second, the file is read and parsed and the data is imported into a database kept by the app.
Import of data is handled by the page Import data. All imported files can be seen on this page. The app keeps all imported files until they are deleted by the user.
Import of files
The app can import files with extension "txt" or "sqlite". The "txt" files represent tables of data supplied by the user. The "sqlite" is a database file that has been exported from the app at an earlier time. The "sqlite" files are used for backup and transfer of projects from one device to another.
Files can be imported in two ways; either by "opening" the file in another app or by import from a cloud storage.
Importing a file by "opening" it in another app is done by tapping the file in the other app and selecting the Amberbio app. A typical use case is to open an email attachment by tapping the attachment in the Mail app.
Files can be imported from a cloud storage such as iCloud Drive, Dropbox, Box, or Google Drive. The import is done by tapping "Import new file" on the page Import data. The corresponding cloud storage app must itself be installed on the device.
Parsing import files
Imported files are shown on the page Import data. By tapping a file, the file can be parsed and read into the database of the app. Files with extension "sqlite" are read automatically into projects. Files with extension "txt" file can be used to create a new project, to import factors to the active project, or to import molecule annotations to the active project.
File format of txt files
The "txt" files contain tables of values. The rows, or lines, are separated by "\n", "\r", or "\r\n", and the cells are separated by "\t" or ",". The file extension should be "txt" in any case. The app tries to find the separator by first searching for tabs. If there are tabs, the app tries to read the file as a tab separated file. If there are no tabs, the file is read as a comma separated file. Tabs are never allowed within the cells. If there are commas within the cells, the file must be tab separated. The three types of "txt" files are described below.
The first file type contains the measurement values. This file is used to create a new project. It contains samples in the columns and molecules in the rows. The first row contains the sample names. The first column contains the molecule names. The values are written in the cells.
The decimal separator can be either point as in "12.98" or comma as in "12,98". If the decimal separator is comma, the file must be tab separated. Missing values must be written as (the empty string), NA, na, NaN, NAN, or nan. The upper left cell can contain anything. An example value table with two missing values is shown below.
|Anything||Sample 1||Sample 2|
The second file type contains factors for the samples. Factors can be imported into a project at any time and are not part of the initial creation of the project. The first row of the factor file contains sample names. The first column contains factor names. The values are the levels for the factor and sample. The upper left cell can be anything. The factor names must be distinct and the sample names can come in any order. All samples of the project must be present. Extra samples will be ignored. An example factor table is shown below
|Anything||Sample 1||Sample 2|
The third file type contains molecule annotations and it can be imported at any time and is not part of the initial creation of the project. The first row contains the name of the molecule annotation. The first column must contain all the molecule names. The order of the rows is arbitrary and extra molecules are ignored. An example molecule annotation table is shown below
Result files are created by the app. The result files are figures and tables and have the file extension "txt", "pdf", or "png". The result files are kept on the page Result files. The result files can be sent by email, opened in another app, or exported to a cloud storage.
Backup and sharing of projects
Projects can be exported to a database file on the page Export projects. The database file has the extension "sqlite". The file can be used for backup and transfer of projects to other devices. The file can be sent by email or exported to a cloud storage.
Name and emails
On the page User, a name and a list of emails can be typed. There are no user accounts in the app. The name is only used for comments in the result files and the project notes. The emails are used as suggested emails when a file is sent by email from the app. The actual destination emails can always be changed before sending the email.
The Anova test is performed on the selected levels of a factor. The selected levels are highlighted, and the anova test is performed by tapping the blue factor name. For each molecule a standard Anova test is performed by removing samples with missing values, calculating the F-statistic as the ratio of between-groups variation divided by the within-groups variation. The p-value is the upper tail of the cumulative F-distribution.
After selecting a factor, any number of pairwise tests between two levels can be selected in the table. The pairwise test will be performed on all pairs and presented in a table. The t-test is a standard student t-test with equal variance. The p-value from the Wilcoxon test, which is the same as the Mann-Whitney test, is calculated as an exact value for small sample size and by the Normal distribution approximation for large sample sizes. The p-values are two-sided.
A pairing factor must be selected. Samples with the same level for the pairing factor can be paired. An example of a pairing factor is "Patient" and the levels are the patient ids or names. For each patient, two or more samples are measured.
The comparison factor is the factor for which the levels will be compared. When two levels of the comparison factor are compared, the pairs of samples are defined as follows. For each pairing factor level, there must be exactly one sample for each of the two comparison levels with that particular pairing factor level. An example of a comparison factor is "Time" with the levels "morning" and "evening". In this case, a pairwise test will be performed between the evening and morning samples for each patient.
The t-test is performed by subtracting the values of the paired samples in the two selected comparison levels. The p-value is two-sided and tests the null hypothesis that the difference between the comparison levels is zero.
For the selected factor, the levels with a numeric interpretation are chosen. A level can be converted to a number if it starts with, or is, a numeric values, such as "5 days" or "12.3". For each molecule, the samples with a numeric level and a non-missing value are used for the linear regression. The intercept and slope are calculated by least square regression.
The p-value is an Anova p-value and tests the null-hypothesis that the slope is zero, i.e. low p-values imply that there is a trend with a non-zero slope. The Anova test is performed by calculating a F-statistic as the ratio where the numerator is the variation between the line and a simple mean value, and the denominator is the residual variation from the fitted line. A large F-statistic, and correspondingly low p-value, implies that the line is a much better fit than a single mean value.
Multiple hypothesis testing
Multiple hypothesis testing is performed using the Benjamini-Hochberg false discovery rate method. The columns named false discovery rate contains the q-values of the molecules.
A histogram of the p-values can be seen by tapping "histogram". High frequencies for the low p-values imply a significant difference between the groups.
The supervised classifiers separate samples into two or more levels. A classifier is trained on a training set and afterwards tested on a test set.
There are three modes in which the classifiers in the app can be used; a fixed training set, leave-one-out cross validation, and k-fold cross validation.
When a fixed training set is chosen, the classifier is tested on all remaining samples, both those with actual levels matching the levels of the classifier and those with actual levels different from the levels known to the classifier. The latter type of samples can not be used to estimate the predictive power of the classifier. However, often it is still useful to know the predicted levels of these samples. For instance, if the classifier has the levels "sick" and "healthy", it might be useful to classify "borderline" samples into either "healthy" or "sick".
For leave-one-out cross validation, each sample is left out and tested with the remaining samples as the training set. In that way, each sample obtains one classified level which can be compared to the real level. Leave-one-out classification only uses the samples with levels known to the classifier.
For k-fold cross validation, the samples are divided randomly into k subsets of almost equal size; if the number of samples is not divisible by k, some subsets will be one sample larger than other subsets. Each subset is tested on a classifier trained by the remaining samples. In that way, each sample obtains one classified level which can be compared to the real level. Cross validation only uses the samples with levels that are known to the classifier.
In the special case where k is equal to the number of samples, k-fold cross validation is equal to leave-one-out cross validation. Leave-one-out cross validation has its own selection in the app because of it importance.
k nearest neighbor classification
Given a training set and a test sample, the k training samples with the smallest distances to the test sample are found. The levels of these k nearest samples are considered, and if there is a single level that constitutes a majority of the k samples, that level is chosen as the predicted level for the test sample. If no level obtains a majority alone, the test sample is considered unclassified. If the classifier has only two levels and k is odd, there is always one level with a majority. In other cases, samples might be unclassified.
The distance measure is euclidean distance. Only molecules without missing values in the combined training set and test set are included in the distance calculation.
In case of ambiguity in the k nearest neighbors due to ties in the distances, the classifier makes an arbitrary, but non-random, choice. This situation practically never occurs in real biological data.
Support vector machine classification
The support vector machine (SVM) classification uses the software library LIBSVM: "Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011". The LIBSVM software is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. The Amberbio app uses the LIBSVM library to train an SVM classifier and calculate the decision values and predicted classes for test samples.
Copyright for LIBSVM
Copyright (c) 2000-2014 Chih-Chung Chang and Chih-Jen Lin. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither name of copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The SVM classifier in the app can compare two or more levels. A standard SVM classifier can directly classify two levels. Comparisons between K levels are performed by the LIBSVM library by performing all K(K - 1)/2 pairwise comparisons and selecting the level that wins most pairwise comparisons.
It is recommended to use logarithmic values for the SVM classifier. No scaling or other pre-processing is performed by the app.
For the binary classifier (two levels), a decision value is calculated. Samples with positive decision values are classified in one level and those with negative decision values in the other level. Using a variable threshold instead of a fixed zero threshold, a whole curve of classifiers is obtained. Plotting the true positive rate versus the false positive rate for variable thresholds leads to the receiver operating characteristic (ROC) curve. The area under the curve is a measure of the success of the classifier. Good classifiers have areas close to 1, whereas bad classifiers have areas close to 0.5.
The ROC curve only applies to the binary classifier. Call the two levels A and B. The true positive rate is defined as the fraction of samples with an actual level of A that are predicted to have level A. The false positive rate is defined as the fraction of samples of level B that are classified as level A. The true positive rate is also called the sensitivity. The false positive rate is equal to 1 - specificity.
The kernels are the linear kernel and the radial basis function (RBF) kernel of Gaussian functions. The parameter C is the coefficient of the error term in the optimizing function. The parameter gamma is the scale parameter in the exponent of the Gaussian function. A proper explanation can be found in any article or book about support vector machines. The app uses the same terminology as LIBSVM.
The linear kernel with default parameter is usually a good choice for biological data with many molecules and relatively few samples.
Only molecules without missing values in the training and test sets are used for the SVM.
k means clustering
K means clustering is an unsupervised clustering algorithm that divides the samples into k groups. The number of clusters, k, is user determined. The algorithm employed by the app is probabilistic and might give different results for several runs on the same data set. However, the results will be similar, and the differences are explained by movement of samples whose group membership is ambiguous. Since there is no unique biologically correct clustering in any case, the probabilistic nature of the algorithm is acceptable.
The algorithm is almost identical to the standard Lloyd algorithm. The steps are described below.
- Iterate over the entire cluster algorithm below, and take the clustering with the smallest sum of square deviation as the final result. The number of iterations is dynamic and depends on the duration for one iteration. The app should never become unresponsive by being stuck in a long computation.
- Assign the samples to random clusters.
- Resolve empty clusters by moving samples from the largest cluster to empty clusters.
- Iterate the following steps until the clusters are constant or a maximum limit is reached. In practice, the maximum limit is almost never needed.
- Calculate centroids as the average point in each cluster.
- Reassign samples to clusters. A sample belongs to the cluster whose centroid is closest.
- Resolve empty clusters by moving samples from the largest cluster to the empty clusters.
- Calculate the sum of square deviation of samples from their cluster centroid.
The Sammon map is a projection of the high dimensional samples to a low dimensional space. The map was invented by Sammon in 1969. (J. W. Sammon jr. A Nonlinear Mapping for Data Structure Analysis. IEEE Transactions on Computers, vol. C-18, no. 5, pp 401-409, May 1969). In this app the low dimensional space is two or three dimensional such that the samples can be visualized. The Sammon map attempts to preserve the pairwise distances between samples as much as possible.
The distances in the high dimensional space are Euclidean distances using only the molecules without missing values for the selected samples. The algorithm is described by by Kohonen ( T. Kohonen. Self Organizing Maps. Springer, 3 edition, 2001) and is iterative. The number of iterations depend on the time of one iteration; the app will never go into a long computation.
At each iteration, the algorithm loops over all pairs of samples and updates the two samples. The update moves the two points directly away from each other if their distance is too small, and moves them towards each other if it is to large. The size of the movement depends on the current distances and a variable multiplier. The multiplier starts at 0.4 and ends at a small value. The multiplier is reduced at each iteration.
The purpose of the Sammon map is to obtain a visualization of the samples and hopefully gain some biological insight. The Sammon map is an alternative to the PCA map. The Sammon map has the property that samples close to each other remain close after the projection which is not guaranteed by the PCA map.