Documentation
The COREMIC tool is a webtool hosted on google app engine and written in
python. For details about the algorithm used in the tool, please refer to and
cite:
Rodrigues RR, Rodgers NC, Wu X, Williams MA. (2018) COREMIC: a web-tool to
search for a niche associated CORE MICrobiome. PeerJ 6:e4395
https://doi.org/10.7717/peerj.4395
A video tutorial is available at https://oregonstate.box.com/v/coremic-tutorial
Table of Contents
Input files
The tool needs exactly two input filesDatafile
The file basically contains abundance information of OTUs across samples. Please refer here for details of the BIOM format and how to convert from a traditional tab delimited OTU table to BIOM format file. Please make sure the header “taxonomy” is present. If the OTUs specified under the taxonomy header are not unique, rows with the same OTU label will be combined, with their values summed.
Most often the datafile will be an output from the OTU picking step from QIIME or other tools. A QIIME 1.8 compatible sample BIOM 1.0 format file looks like:
{ "id": "None", "format": "Biological Observation Matrix 1.0.0", "format_url": "http://biom-format.org", "type": "OTU table", "generated_by": "BIOM-Format 1.3.1", "date": "2016-01-14T19:49:54.900111", "matrix_type": "sparse", "matrix_element_type": "float", "shape": [678, 59], "data": [[0,0,4.0], [0,1,2.0], [0,6,1.0], [0,7,10.0], ... [677,55,18.0], [677,56,33.0], [677,57,8.0], [677,58,14.0]], "rows": [{"id": "1", "metadata": {"taxonomy": "k__Bacteria"}}, {"id": "2", "metadata": {"taxonomy": "k__Bacteria;p__Acido...;f__;g__;s__"}}, {"id": "3", "metadata": {"taxonomy": "k__Bacteria;p__Basido...;f__;g__;s__"}}, {"id": "4", "metadata": {"taxonomy": "k__Bacteria;p__Proteo...;f__;g__;s__"}}, ... {"id": "675", "metadata": {"taxonomy": "k__Bacteria;p__WS3...;f__XYZ;g__;s__"}}, {"id": "676", "metadata": {"taxonomy": "k__Bacteria;p__WS3...;f__PRR;g__;s__"}}, {"id": "677", "metadata": {"taxonomy": "k__Bacteria;p__WS3...;f__;g__;s__"}}, {"id": "678", "metadata": {"taxonomy": "Unassigned"}}], "columns": [{"id": "SAMP1", "metadata": null}, {"id": "SAMP1", "metadata": null}, {"id": "SAMP2", "metadata": null}, ... {"id": "SAMP58", "metadata": null}, {"id": "SAMP59", "metadata": null}] }
Multiple datafiles can selected for processing; they will automatically be combined, removing any OTUs that are not present in all the datafiles. This feature requires a browser that supports HTML5 multiple file upload.
Group file
This tab-delimited file contains information about the samples and the groups to which they belong. "#SampleID" and a group of interest columns are required. Other columns will be ignored. Most often this file will be an input to the OTU picking step from QIIME or other tools. A QIIME 1.8 compatible sample group file looks like:
#SampleID Person SAMP1 Good SAMP2 Bad SAMP3 Neutral ... SAMP58 Neutral SAMP59 Bad
Options
These are some of the options that need to provided:- Name for output/label
- This helps you to keep track of what this run indicates. It will be included in name of all attached result files and the subject of all emails about your run.
- Factor
- Enter the exact string of the factor with which your interest group is identified. For example, to calculate the core microbome of "Good" people under the "Person" column, enter "Person". This should be one of the column headers in your group file.
- Interest Group
- Entor the exact string of the value under the column specified as the factor to be used to identify the interest group. For example, to calculate the core microbiome of "Good" people under the “Person” column, enter “Good”. All samples not matching this value in the specified column will be considered to be a part of the out group. Multiple values may be specified separated by a comma and optional whitespace; for example, to use all people who are either “Good” or “Neutral” as the interest group, enter “Good, Neutral”.
- Interest Group Name
- A plain-english name for the interest group. This will be used in generating the output files.
- Out Group Name
- The same as the interest group name, but for the out group
- Include Out Group Results
- Selecting this will cause the analysis to be automatically re-run with the interest group and out group definitions switched, and will attach the results from both runs to the results email. For example, if the interest group was specified as “Good” under the “Person” factor and this was checked, the results would also include the results of running the analysis with “Neutral, Bad” as the interest group.
- Maximum Adjusted p-Value
- The maximum p-value to keeep after adjustment for multiple testing, I.E. the highest allowable probability of a false positive. The exact interpretation for this value will depend on which p-value adjustment method is used.
- Minimum Fractional Presence
- The minimum portion of the interest group samples that an OTU must be present in to be considered a part of the core. This value should be between 0 and 1. A value of 0.9 would mean that an OTU must be present in at least 90% of the interest group samples to be considered for the core.
- Maximum Out-Group Fractional Presence
- The maximum portion of the out group samples that an OTU can be present in and considered a part of the core for the interest group. This value should be between 0 and 1. A value of 0.75 would mean that an OTU must be present in less than 75% of the out group samples to be considered for the core.
- Make Abundance Relative
- If this is checked then all values in the input datafile will be converted to relative abundance value, from absolute abundance values. The values will be adjusted such that the sum of the abundances for each sample adds up to 1. Note that if this is checked the Minimum Abundance should be specified as a relative (0 to 1) abundance as well.
- Quantile Normalize
- If this is checked then Quantile normalize the columns of the input datatable before processing. This will occur after relativising if both are selected. Ties are broken in an arbitrary but deterministic fashion.
- Minimum Abundance
- The minimum abundance a measurement must have to be considered present. This can be a relative or absolute abundance depending on which the datafile uses. The default is to use a minimum abundance of zero, I.E. an OTU is considered to be present in a sample if it has a non-zero abundance. This option can be helpful in cases where noise can result in small, non-negative readings for measurements that should, ideally, be zero.
- p-Value Adjustment Method
- What methodology to use to adjust the results to compensate for multiple testing. Bonferroni and Bonferroni-Holm correct for the probability that there will be one or more false positives in the results within a specific threshold. Benjamini Hochberg corrects for the proportion of false discoveries.
- The email address where you want your results emailed. Your results are NOT stored; you will only get these results via email.
Output Files
When the tool is finished running the results will be emailed to the email address that was given. The results of the analysis are attached to the email as a .tsv file, which can be opened in a number of programs including Excel. This file will contain the OTUs that were found to be in the core, the adjusted and non-adjusted p-value for each OTU, and the presence threshold it was found at.
Additionally, for each significance threshold a graph is generated showing the OTUs found to be significant in that threshold and their average abundance across the samples in the interest group and the out group. Because OTU names are often too long to fit on the graph, the bars are labeled with integers and a .tsv file is included for each graph relating the integer labels with the OTU names. Additionally, the .tsv file includes the presence fraction in the in and out groups for each OTU. The graphs are provided as a .svg file to allow enlargement and editing as necessary.
Method Details
Currently, the tool uses presence absence data to identify core microbiome OTUs. A one tailed Fisher's exact test is used to filter the data to OTUs that are significantly more present in the interest group than in the out group. After this, the results are filtered again based on fractional presence in the interest group and out group, yeilding the final results.Future versions of this tool will include more complicated algorithms.
Sample Data
Two sample datafile-groupfile pairs are provided:
- Switchgrass data that has been assembled from
Jesus et. al. 2016
and from
Rodrigues et al. 2017.
Datafile and groupfile. Additionally, the datafile is available as a tsv file here. - Invasive fungi data from
Rodrigues et al. 2015.
Datafile and groupfile.
Post Processing
The results are given in TSV form to allow them to be easialy imported into any other processing you may wish to do. This format can be imported into Excel and is easy to parse from code.
Additionally, we have created a script to visualize the core OTUs taxonomically by generating a tree of them. This tree shows the core of both the interest and the out group with the p-value and OTU name for each OTU. The script can be downloaded here. It requires that ete3 be installed. Run "visualize_core.py -h" for help.
FAQs
How long does it take to run?
When run on the provided sample data, which has 59 samples and 678 OTUs, it takes about a minute to run on the default settings, looking at the interest group "Sw" under the "Plant" column. It may take longer for your data to run in some cases, such as if it is a significantly larger dataset or we suddenly recieve a large number of analysis requests.
What browsers are supported?
All of the processing is done on our server, so all browsers should be able to use this tool. The form does use some HTML5 features to help you make sure the information you put in is correct; these are not supported in some older browsers such as Internet Explorer 8, and are only partialy supported in Safari, but they are only used to help ensure user entered data is valid, and are not necessary for the tool to function.
Why didn't I get my results?
If you have waited a long time (over thirty minutes) and still haven't recieved your results check your spam folder to make sure they didn't end up in there. If your results are not in your spam folder either send us an email with your input parameters and files, as well as when you submitted your task and we will try to figure out what went wrong.