Overview of the whole pipeline for Human Respiratory viruses classification

Dataset: 321,552 records in total, 90% as training set and 10% for validation with 10-folds cross-validation.

The steps for model construction as below: (Fig.1)

Use homology search to select those sequences belonged to Human Respiratory viruses under the criteria (identity > 80% and E < 1.0E-5).
Perform word2vec in 3-gram to convert each sequence for each reading frame into 100 vectors as inputs.
After convolution, max-pooling, flattening and entirely connection, we can use covert input into 46 Labeltypes. (Fig.2)

flowchart of this system — Fig.1 - The whole pipeline for Respiratory classification

the framework of CNN model — Fig.2 - The framework of CNN on Respiratory classification

We use 5% of total samples for in-train validation. If the validation part (orange line) got large bias compared to training part (blue line), the model might be over-fitted. (Fig.3)

in-train loss vs epoch — Fig.3 - The Performance of loss and accuracy on Respiratory classification

in-train acc vs epoch — Fig.3 - The Performance of loss and accuracy on Respiratory classification

Finally, take 10-fold cross-validation test and see if the model is robust enough (Table 1).

Table 1 - Estimation of loss and accuracy in each CV test
No.	loss	Accuracy (%)
1	0.1619	0.9002
2	0.2766	0.8935
3	0.2109	0.9115
4	0.1880	0.8978
5	0.2261	0.8837
6	0.1748	0.8991
7	0.2177	0.8794
8	0.2225	0.9046
9	0.2314	0.9028
10	0.1807	0.8881

Avg. loss = 0.2091 +- 0.0320
Avg. accuracy = 90.02 +- 0.92%

Evaluation- Micro Average and Macro Average Performance

We have 46 types of label, but these are highly-unbalanced data. The way we estimate the accuracy of the model is to find how many times the model hits right answer among samples of one type, then we got the "reliability" of this type. The average of these "reliability" is the macro average accuracy of this model. (Table 2) And the performance of 46 genotypes are listed on Table-3 .

Table 2 - The evaluation of Average Performance
Evaluation method	Precision	Recall	F1-score
Macro_Average	0.8419	0.7689	0.8037
Micro_Average	0.8629	0.8629	0.8629

Table 3 - The evaluation of 46-type Performance
Category	TP	TN	FP	FN	Precision	Recall	F1
SARS-CoV-2	816	31336	5	1	0.9939	0.9988	0.9963
Influenza.A.virus_HA	245	31784	63	66	0.7955	0.7878	0.7916
Influenza.A.virus_NA	203	31740	118	97	0.6324	0.6767	0.6538
Influenza.A.virus_MP	35	32079	15	29	0.7	0.5469	0.614
Influenza.A.virus_NS	91	32021	25	21	0.7845	0.8125	0.7982
Influenza.A.virus_NP	103	32025	11	19	0.9035	0.8443	0.8729
Influenza.A.virus_PA	105	31993	28	32	0.7895	0.7664	0.7778
Influenza.A.virus_PB2	134	31972	35	17	0.7929	0.8874	0.8375
Influenza.A.virus_PB1	109	31979	40	30	0.7315	0.7842	0.7569
Influenza.B.virus_HA	2205	29901	20	32	0.991	0.9857	0.9883
Influenza.B.virus_NA	1651	30376	26	105	0.9845	0.9402	0.9618
Influenza.B.virus_NS	1313	30818	10	17	0.9924	0.9872	0.9898
Influenza.B.virus_MP	1204	30947	3	4	0.9975	0.9967	0.9971
Influenza.B.virus_NP	1177	30957	12	12	0.9899	0.9899	0.9899
Influenza.B.virus_PA	1169	30973	10	6	0.9915	0.9949	0.9932
Influenza.B.virus_PB2	1130	29561	730	737	0.6075	0.6052	0.6064
Influenza.B.virus_PB1	1140	29555	737	726	0.6074	0.6109	0.6091
Human.orthopneumoviru	2668	29397	44	49	0.9838	0.982	0.9829
Enterovirus.A	2553	29173	161	271	0.9407	0.904	0.922
Enterovirus.C	720	31112	140	186	0.8372	0.7947	0.8154
Enterovirus.B	2648	28256	1187	67	0.6905	0.9753	0.8085
Enterovirus.D	301	31670	122	65	0.7116	0.8224	0.763
MERS-CoV	97	32055	0	6	1.0	0.9417	0.97
Human.mastadenovirus.B	253	31696	82	127	0.7552	0.6658	0.7077
Mumps.orthorubulaviru	1184	30846	40	88	0.9673	0.9308	0.9487
Rhinovirus.A	512	31239	85	322	0.8576	0.6139	0.7156
Human.respirovirus.3	170	31953	17	18	0.9091	0.9043	0.9067
Measles.morbilliviru	1950	30015	42	151	0.9789	0.9281	0.9528
Rhinovirus.C	391	31185	373	209	0.5118	0.6517	0.5733
Influenza.C.virus_HE	29	32123	0	6	1.0	0.8286	0.9062
Human.mastadenovirus.C	153	31839	24	142	0.8644	0.5186	0.6483
Human.mastadenovirus.D	105	31875	57	121	0.6481	0.4646	0.5412
Human.metapneumoviru	672	31219	87	180	0.8854	0.7887	0.8343
Influenza.C.virus_NS	20	32134	1	3	0.9524	0.8696	0.9091
Betacoronavirus.1	109	32040	3	6	0.9732	0.9478	0.9604
Rhinovirus.B	23	31976	19	140	0.5476	0.1411	0.2244
Influenza.C.virus_MP	23	32130	1	4	0.9583	0.8519	0.902
Human.mastadenovirus.E	6	32080	6	66	0.5	0.0833	0.1429
Influenza.C.virus_PB1	15	32143	0	0	1.0	1.0	1.0
Influenza.C.virus_PB2	13	32140	3	2	0.8125	0.8667	0.8387
Influenza.C.virus_P3	12	32143	0	3	1.0	0.8	0.8889
Human.mastadenovirus.F	48	31943	9	158	0.8421	0.233	0.365
Influenza.C.virus_NP	11	32143	0	4	1.0	0.7333	0.8462
Enterovirus.G	8	32092	9	49	0.4706	0.1404	0.2162
Human.respirovirus.1	50	32084	9	15	0.8475	0.7692	0.8065
Monkeypox.viru	175	31983	0	0	1.0	1.0	1.0

Meet the user interface

There are two methods you can use to input data you want to predict: either you paste FASTA-format text in the "Input FASTA" area , or upload FASTA file from your disk folder via the button marked with red line.

Fig.1 shows e-mail textbox, checkbox for "Terms of Use" agreement and submit button...etc.

choose the way you like to input FASTA file — Fig.1 - GUI of Fasta file input

The text format you should input

Here we ONLY accept FASTA-format text, you can paste them in the text area. (Fig.2)

Every sample starts with a right angle bracket (>) and its ID or other descriptions, followed by the original sequence. Of course you can input many samples, just keep in mind to start another one with a right angle bracket.

input FASTA text by pasting them manually — Fig.2 - key-in FASTA text manually

Or you can upload the FASTA file by pressing the "Fileupload" button, an "Open file" window will pop-up when you do so, then you can choose a FASTA file from your disk and click "Open" to upload. (Fig.3)

upload FASTA file from your disk — Fig.3 - upload FASTA file

Leave E-mail address for system notification use

Normally the time usage for computaiton won't be too long, you can wait in front of your screen until the result is available; or you can leave a valid email address, the system will send a mail to inform you the result has come out.
Write down your email address and click the checkbox, 'Terms of Use' explains that we only use your email address for notification use , it won't be stored in our backend.
Press the Submit button and everything is fine! (Fig.4)

email address and 'Terms of use' checkbox — Fig.4 - key-in your e-mail and read the "Terms of Use"

The meaning of the results

If you left email address before pressing 'Submit', you would find a system notification in your mail inbox. Clicking the hyperlink will lead you to the result page. (Fig.5)

the notification mail — Fig.5 - Notification mail

Or your web browser would take you to the result page after waiting for a while, if you didn't provide a mail address.

There are 8 columns in total, ID, Length, Species(1st-hit), eval-score of 1st-hit, Species(2nd-hit), eval-score of 2nd-hit, Strain and BlastBestHit. (Fig.6)

'ID' is the full text in the right hand side of ">" in FASTA.
'Length' is the length of this origianl sequence.
'Species(1st-hit)' indicates the category of the most probable one.
'eval-score of 1st-hit' means the largest value (between 0 and 1) comes from the softmax output layer.
'Species(2nd-hit)' indicates the category with the second highest probability.
'eval-score of 2nd-hit' means the second largest value ((between 0 and 1)) comes from the softmax output layer.
'Strain' means the genotype of them.
'BlastBestHit' is the result through BLAST hit.

Other downloadable file in the result page

You can also download 3 files in the result page: result, submission and log file.

'Result' is the CSV format of this table.
'Submission in fasta file' is the original FASTA you input.
'Log file' is the output text when program executes. Bellowing text is ok, it's just some message to tell you the tensorflow package is not compiled well of the machine's CPU. (Fig.7)