uvhunter uvhunter Help About Us

Overview of the whole pipeline for Human Respiratory viruses classification

Dataset: 321,552 records in total, 90% as training set and 10% for validation with 10-folds cross-validation.

The steps for model construction as below: (Fig.1)

  1. Use homology search to select those sequences belonged to Human Respiratory viruses under the criteria (identity > 80% and E < 1.0E-5).
  2. Perform word2vec in 3-gram to convert each sequence for each reading frame into 100 vectors as inputs.
  3. After convolution, max-pooling, flattening and entirely connection, we can use covert input into 46 Labeltypes. (Fig.2)
flowchart of this system
Fig.1 - The whole pipeline for Respiratory classification
the framework of CNN model
Fig.2 - The framework of CNN on Respiratory classification

We use 5% of total samples for in-train validation. If the validation part (orange line) got large bias compared to training part (blue line), the model might be over-fitted. (Fig.3)

in-train loss vs epoch in-train acc vs epoch
Fig.3 - The Performance of loss and accuracy on Respiratory classification

Finally, take 10-fold cross-validation test and see if the model is robust enough (Table 1).


Table 1 - Estimation of loss and accuracy in each CV test
No. loss Accuracy (%)
1 0.1619 0.9002
2 0.2766 0.8935
3 0.2109 0.9115
4 0.1880 0.8978
5 0.2261 0.8837
6 0.1748 0.8991
7 0.2177 0.8794
8 0.2225 0.9046
9 0.2314 0.9028
10 0.1807 0.8881

Avg. loss = 0.2091 +- 0.0320
Avg. accuracy = 90.02 +- 0.92%

Evaluation- Micro Average and Macro Average Performance

We have 46 types of label, but these are highly-unbalanced data. The way we estimate the accuracy of the model is to find how many times the model hits right answer among samples of one type, then we got the "reliability" of this type. The average of these "reliability" is the macro average accuracy of this model. (Table 2) And the performance of 46 genotypes are listed on Table-3 .

Evaluation methodPrecisionRecallF1-score
Table 2 - The evaluation of Average Performance
Macro_Average 0.8419 0.7689 0.8037
Micro_Average 0.8629 0.8629 0.8629

Category TP TN FP FN Precision Recall F1
Table 3 - The evaluation of 46-type Performance
SARS-CoV-2 816 31336 5 1 0.9939 0.9988 0.9963
Influenza.A.virus_HA 245 31784 63 66 0.7955 0.7878 0.7916
Influenza.A.virus_NA 203 31740 118 97 0.6324 0.6767 0.6538
Influenza.A.virus_MP 35 32079 15 29 0.7 0.5469 0.614
Influenza.A.virus_NS 91 32021 25 21 0.7845 0.8125 0.7982
Influenza.A.virus_NP 103 32025 11 19 0.9035 0.8443 0.8729
Influenza.A.virus_PA 105 31993 28 32 0.7895 0.7664 0.7778
Influenza.A.virus_PB2 134 31972 35 17 0.7929 0.8874 0.8375
Influenza.A.virus_PB1 109 31979 40 30 0.7315 0.7842 0.7569
Influenza.B.virus_HA 2205 29901 20 32 0.991 0.9857 0.9883
Influenza.B.virus_NA 1651 30376 26 105 0.9845 0.9402 0.9618
Influenza.B.virus_NS 1313 30818 10 17 0.9924 0.9872 0.9898
Influenza.B.virus_MP 1204 30947 3 4 0.9975 0.9967 0.9971
Influenza.B.virus_NP 1177 30957 12 12 0.9899 0.9899 0.9899
Influenza.B.virus_PA 1169 30973 10 6 0.9915 0.9949 0.9932
Influenza.B.virus_PB2 1130 29561 730 737 0.6075 0.6052 0.6064
Influenza.B.virus_PB1 1140 29555 737 726 0.6074 0.6109 0.6091
Human.orthopneumoviru 2668 29397 44 49 0.9838 0.982 0.9829
Enterovirus.A 2553 29173 161 271 0.9407 0.904 0.922
Enterovirus.C 720 31112 140 186 0.8372 0.7947 0.8154
Enterovirus.B 2648 28256 1187 67 0.6905 0.9753 0.8085
Enterovirus.D 301 31670 122 65 0.7116 0.8224 0.763
MERS-CoV 97 32055 0 6 1.0 0.9417 0.97
Human.mastadenovirus.B 253 31696 82 127 0.7552 0.6658 0.7077
Mumps.orthorubulaviru 1184 30846 40 88 0.9673 0.9308 0.9487
Rhinovirus.A 512 31239 85 322 0.8576 0.6139 0.7156
Human.respirovirus.3 170 31953 17 18 0.9091 0.9043 0.9067
Measles.morbilliviru 1950 30015 42 151 0.9789 0.9281 0.9528
Rhinovirus.C 391 31185 373 209 0.5118 0.6517 0.5733
Influenza.C.virus_HE 29 32123 0 6 1.0 0.8286 0.9062
Human.mastadenovirus.C 153 31839 24 142 0.8644 0.5186 0.6483
Human.mastadenovirus.D 105 31875 57 121 0.6481 0.4646 0.5412
Human.metapneumoviru 672 31219 87 180 0.8854 0.7887 0.8343
Influenza.C.virus_NS 20 32134 1 3 0.9524 0.8696 0.9091
Betacoronavirus.1 109 32040 3 6 0.9732 0.9478 0.9604
Rhinovirus.B 23 31976 19 140 0.5476 0.1411 0.2244
Influenza.C.virus_MP 23 32130 1 4 0.9583 0.8519 0.902
Human.mastadenovirus.E 6 32080 6 66 0.5 0.0833 0.1429
Influenza.C.virus_PB1 15 32143 0 0 1.0 1.0 1.0
Influenza.C.virus_PB2 13 32140 3 2 0.8125 0.8667 0.8387
Influenza.C.virus_P3 12 32143 0 3 1.0 0.8 0.8889
Human.mastadenovirus.F 48 31943 9 158 0.8421 0.233 0.365
Influenza.C.virus_NP 11 32143 0 4 1.0 0.7333 0.8462
Enterovirus.G 8 32092 9 49 0.4706 0.1404 0.2162
Human.respirovirus.1 50 32084 9 15 0.8475 0.7692 0.8065
Monkeypox.viru 175 31983 0 0 1.0 1.0 1.0

Meet the user interface

There are two methods you can use to input data you want to predict: either you paste FASTA-format text in the "Input FASTA" area , or upload FASTA file from your disk folder via the button marked with red line.

Fig.1 shows e-mail textbox, checkbox for "Terms of Use" agreement and submit button...etc.

Fig.1 - GUI of Fasta file input

The text format you should input

Here we ONLY accept FASTA-format text, you can paste them in the text area. (Fig.2)

Every sample starts with a right angle bracket (>) and its ID or other descriptions, followed by the original sequence. Of course you can input many samples, just keep in mind to start another one with a right angle bracket.

Fig.2 - key-in FASTA text manually

Or you can upload the FASTA file by pressing the "Fileupload" button, an "Open file" window will pop-up when you do so, then you can choose a FASTA file from your disk and click "Open" to upload. (Fig.3)

Fig.3 - upload FASTA file

Leave E-mail address for system notification use

  1. Normally the time usage for computaiton won't be too long, you can wait in front of your screen until the result is available; or you can leave a valid email address, the system will send a mail to inform you the result has come out.
  2. Write down your email address and click the checkbox, 'Terms of Use' explains that we only use your email address for notification use , it won't be stored in our backend.
  3. Press the Submit button and everything is fine! (Fig.4)
Fig.4 - key-in your e-mail and read the "Terms of Use"

The meaning of the results

If you left email address before pressing 'Submit', you would find a system notification in your mail inbox. Clicking the hyperlink will lead you to the result page. (Fig.5)

Fig.5 - Notification mail

Or your web browser would take you to the result page after waiting for a while, if you didn't provide a mail address.

There are 8 columns in total, ID, Length, Species(1st-hit), eval-score of 1st-hit, Species(2nd-hit), eval-score of 2nd-hit, Strain and BlastBestHit. (Fig.6)

  1. 'ID' is the full text in the right hand side of ">" in FASTA.
  2. 'Length' is the length of this origianl sequence.
  3. 'Species(1st-hit)' indicates the category of the most probable one.
  4. 'eval-score of 1st-hit' means the largest value (between 0 and 1) comes from the softmax output layer.
  5. 'Species(2nd-hit)' indicates the category with the second highest probability.
  6. 'eval-score of 2nd-hit' means the second largest value ((between 0 and 1)) comes from the softmax output layer.
  7. 'Strain' means the genotype of them.
  8. 'BlastBestHit' is the result through BLAST hit.
Fig.6 - The results page

Other downloadable file in the result page

You can also download 3 files in the result page: result, submission and log file.

  1. 'Result' is the CSV format of this table.
  2. 'Submission in fasta file' is the original FASTA you input.
  3. 'Log file' is the output text when program executes. Bellowing text is ok, it's just some message to tell you the tensorflow package is not compiled well of the machine's CPU. (Fig.7)
Fig.7 - Warning messages

Copyright © 2020 Institute of Information Science, Academia Sinica, TAIWAN.

All Rights reserved.