Categories
BLOG

dab pepe

MHC class II antigen

Pepe-DAB

Annotation score:1 out of 5

The annotation score provides a heuristic measure of the annotation content of a UniProtKB entry or proteome. This score cannot be used as a measure of the accuracy of the annotation as we cannot define the ‘correct annotation’ for any given protein.

– Protein predicted i

This indicates the type of evidence that supports the existence of the protein. Note that the ‘protein existence’ evidence does not give information on the accuracy or correctness of the sequence(s) displayed.

Select a section on the left to see content.

This section provides any useful information about the protein, mostly biological knowledge.

The Gene Ontology (GO) project provides a set of hierarchical controlled vocabulary split into 3 categories:

GO – Biological process i

  • antigen processing and presentation Source: InterPro
  • immune response Source: InterPro

This section provides information about the protein and gene name(s) and synonym(s) and about the organism that is the source of the protein sequence.

Names & Taxonomy i

This subsection of the Names and taxonomy section provides an exhaustive list of all names of the protein, from commonly used to obsolete, to allow unambiguous identification of a protein.

Information which has been imported from another database using automatic procedures.

Automatic assertion inferred from database entries i

This subsection of the Names and taxonomy section indicates the name(s) of the gene(s) that code for the protein sequence(s) described in the entry. Four distinct tokens exist: ‘Name’, ‘Synonyms’, ‘Ordered locus names’ and ‘ORF names’.

Automatic assertion inferred from database entries i

This subsection of the Names and taxonomy section provides information on the name(s) of the organism that is the source of the protein sequence.

Automatic assertion inferred from database entries i

This subsection of the Names and taxonomy section shows the unique identifier assigned by the NCBI to the source organism of the protein. This is known as the ‘taxonomic identifier’ or ‘taxid’.

This subsection of the Names and taxonomy section contains the taxonomic hierarchical classification lineage of the source organism. It lists the nodes as they appear top-down in the taxonomic tree, with the more general grouping listed first.

This section provides information on the location and the topology of the mature protein in the cell.

Subcellular location i

Extracellular region or secreted

Automatic computational assertion

Graphics by Christian Stolte & Seán O’Donoghue; Source:

Plasma Membrane
  • MHC class II protein complex Source: InterPro

This section describes post-translational modifications (PTMs) and/or processing events.

PTM / Processing i

UniProtKB Keywords constitute a controlled vocabulary with a hierarchical structure. Keywords summarise the content of a UniProtKB entry and facilitate the search for proteins of interest.

Keywords – PTM i

Information which has been generated by the UniProtKB automatic annotation system, without manual validation.

Automatic assertion according to rules i

This section provides information on sequence similarities with other proteins and the domain(s) present in a protein.

Family & Domains i

Domains and Repeats

This subsection of the Family and Domains section describes the position and type of a domain, which is defined as a specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.

Information which has been generated by the UniProtKB automatic annotation system, without manual validation.

Automatic assertion inferred from signature match i

Family and domain databases

Gene3D Structural and Functional Annotation of Protein Families

Integrated resource of protein families, domains and functional sites

Pfam protein domain database

Simple Modular Architecture Research Tool; a protein domain database

Superfamily database of structural and functional annotation

This section displays by default the canonical protein sequence and upon request all isoforms described in the entry. It also includes information pertinent to the sequence(s), including length and molecular weight. The information is filed in different subsections. The current subsections and their content are listed below:

This subsection of the Sequence section indicates if the canonical sequence displayed by default in the entry is complete or not.

Sequence status i : Fragment.

The checksum is a form of redundancy check that is calculated from the sequence. It is useful for tracking sequence updates.

It should be noted that while, in theory, two different sequences could have the same checksum value, the likelihood that this would happen is extremely low.

However UniProtKB may contain entries with identical sequences in case of multiple genes (paralogs).

The checksum is computed as the sequence 64-bit Cyclic Redundancy Check value (CRC64) using the generator polynomial: x 64 + x 4 + x 3 + x + 1. The algorithm is described in the ISO 3309 standard.

Press W.H., Flannery B.P., Teukolsky S.A. and Vetterling W.T.
Cyclic redundancy and other checksums
Numerical recipes in C 2nd ed., pp896-902, Cambridge University Press (1993))

Checksum: i CAF431D4BCD250F2

Experimental Info

This subsection of the ‘Sequence’ section is used for sequence fragments to indicate that the residue at the extremity of the sequence is not the actual terminal residue in the complete protein sequence.

Automatic assertion inferred from database entries i

Automatic assertion inferred from database entries i

Sequence databases

EMBL nucleotide sequence database

GenBank nucleotide sequence database

DNA Data Bank of Japan; a nucleotide sequence database

This section provides links to proteins that are similar to the protein sequence(s) described in this entry at different levels of sequence identity thresholds (100%, 90% and 50%) based on their membership in UniProt Reference Clusters (UniRef).

Similar proteins i

This section is used to point to information related to entries and found in data collections other than UniProtKB.

Sequence databases
3D structure databases

Database of comparative protein structure models

SWISS-MODEL Interactive Workspace

Family and domain databases

ProtoNet; Automatic hierarchical classification of proteins

MobiDB: a database of protein disorder and mobility annotations

This section provides general information on the entry.

Entry information i

This subsection of the ‘Entry information’ section provides a mnemonic identifier for a UniProtKB entry, but it is not a stable identifier. Each reviewed entry is assigned a unique entry name upon integration into UniProtKB/Swiss-Prot.

This subsection of the ‘Entry information’ section provides one or more accession number(s). These are stable identifiers and should be used to cite UniProtKB entries. Upon integration into UniProtKB, each entry is assigned a unique accession number, which is called ‘Primary (citable) accession number’.

This subsection of the ‘Entry information’ section shows the date of integration of the entry into UniProtKB, the date of the last sequence update and the date of the last annotation modification (‘Last modified’). The version number for both the entry and the canonical sequence are also displayed.

This subsection of the ‘Entry information’ section indicates whether the entry has been manually annotated and reviewed by UniProtKB curators or not, in other words, if the entry belongs to the Swiss-Prot section of UniProtKB (reviewed) or to the computer-annotated TrEMBL section (unreviewed).

MHC class II antigen Pepe-DAB Annotation score:1 out of 5 The annotation score provides a heuristic measure of the annotation content of a UniProtKB entry or proteome. This score cannot

Data Science Society

  • Ana Popova, @anie
  • Izabella Taskova, @ izabellataskova
  • Kamelia Kosekova, @kameliak
  • Kameliya Lokmadzhieva, @kameliyalokmadzhieva
  • Nikolay Bojurin, @nikolay

Mentors: @boryana @alex-efremov @pepe

Team name: DAB PANDA

NB. OUR NOTEBOOKS ARE AVAILABLE HERE: DAB PANDA Rmds

Data Understanding and Preparation

You may see our code with results and brief comments if you dab here

Cryptocurrencies…. Are they as cryptic as the name suggests? Perhaps we’ll know at the end of this journey. Let’s start dabbing!

As a start we need to take a look at what we have. And we have a loooot of files.

For level 1 we need to focus on predicting the prices of 20 cryptocurrencies therefore we focus on price series data. We may find that info either in the separate files for the different currencies or within price_data.csv. We opted for the latter. What we discovered was to say the least interesting…

In the data preparation stage we discovered a discrepancy. Originally, we have 15 267 observations. However, we know that for each day we should have 288 observations. The period under consideration covers ‘2018-01-17 11:25:00’ – ‘2018-03-23 14:00:00’, or 66 full days and 2 incomplete ones.

Let’s figure out how many observations in total we should have by breaking down that period:

– for day 1 (2018-01-17): 151 observations

– for the 64 full days: 18 432 observations

– for day 66 (2018-03-23): 168 observations

Woooow! There is a big difference between 15 267 and 18 752. We decide to find out what we are missing by creating a sequence for all date times within the period with a step of 5 minutes (you may see this in code form in our code – dab).

Next, we merge the data on coins with the full list of dates. We find that we get 1 extra observation, which is weird. So, we check for duplicates and discover 1! Then we get rid of the imposter row!

We learn that for each coin we have 3 578 missing values.

To tackle the missing values, we decide to look at the log differenced prices. On that basis we interpolate the missing values by simulating 20 rows of white noise. You may see our pretty plots before and after the interpolation in the link we have provided for this stage (dab).

After that we need to retrieve the original data so we do the reverse of log and diff with a lovely loop that performs some reverse engineering feats!

We then make an empty dataframe which we feed all of the data to – this is our orig set!

We plot the price series – before and after the interpolation! (see our Rpubs to see our pretty graphs – by dabbing here)

We look at the autocorrelation to see how the different coins relate to one another.

We look at ACF and PACF curves for the complete log differenced data – for all 20 coins.

Finally we look at the histograms for the 20 coins!

Modeling

The orig dataframe is the one we use for modelling. It contains the missing observations we imputed form the initial dataset. We transform it from a dataframe to a time series object.

Next, we look for models that would be appropriate for the different coin prices. We look at combinations of (p, d, q), with p and q between 0 and 7. We perform this for all 20 coins and evaluate the models by considering the Ljung-Box p-value, sum of the squared residuals and Akaike criterion.

We discover that some models tend to perform well across multiple coins, such as ARIMA(0,1,6), ARIMA(6,1,0).

We also look at the residuals for all 20 coins for ARIMA(0,1,6) on log data.

Next, we provide a list of models with the highest p-values in our second Rpubs link – to see it dab here!

We attempted to apply ARIMA with rolling window by using a loop. We begin with a historical subset from the first 7 days or 2016 observations.

We managed to obtain results for several coins among which – Dash, Bitcoin Gold, Dogecoin, Ripple and Litecoin.

Results for Dogecoin: > sqrt(mean((x[2017:length(x)]-ff[2017:length(x)])^2)) [1] 0.006681334 > > mean(ff[2017:length(ff)]) [1] -5.270246 > mean(x[2017:length(x)]) [1] -5.270284 > > sd(ff[2017:length(ff)]) [1] 0.2668967 > sd(x[2017:length(x)]) [1] 0.2669037

sqrt(mean((y[2017:length(y)]-gg[2017:length(y)])^2)) [1] 0.003165124 > > mean(gg[2017:length(gg)]) [1] 6.960584 > mean(y[2017:length(y)]) [1] 6.96065 > > sd(gg[2017:length(gg)]) [1] 0.0230234 > sd(y[2017:length(y)]) [1] 0.02289996

> sqrt(mean((y[2017:length(y)]-gg[2017:length(y)])^2)) [1] 0.002051064 > mean(gg[2017:length(gg)]) [1] 6.748966 > mean(y[2017:length(y)]) [1] 6.748907 > sd(gg[2017:length(gg)]) [1] 0.03380054 > sd(y[2017:length(y)]) [1] 0.03372321

Evaluation

Share this

7 thoughts on “ DAB PANDA: The A.I. Crypto Trader ”

I really like the way you’ve presented results. With RPubs everything is clearly outlined and the reader might follow easily the exhibition of major research steps backed up by the relevant code.
The data prep is conducted correctly. The applied methodology is appealing from theoretical point of view. The sliding window approach is correctly implemented. Considering the issue with computational efficiency, I might say that application the classical Box-Jenkins approach is a good choice.
Obviously, if you had two more hours, you would have accomplished in the same brilliant way (just as all previous sections) the last portion of your research including more comments on the accuracy and robustness of delivered forecasts.
Last, but not least, I would like to emphasize that the text of the article is written in a really nice manner, approaching the reader and dragging with the very first paragraph their attention.
In conclusion I might say that it is a great job, guys!

We specifically require that you upload everything on this website – failing to do so, is going to be fatal for you, no matter that you have done descent job.

We had issues with Jupyter Notebook and decided to use RPubs as an alternative. We have now included a zip file with our R Notebooks – at the beginning of the paper. 🙂

The plots are not really readable. Good job with the missing data, but upload the code as a notebook file, or html, so it can be read here, otherwise it is not possible for anyone to give you feedback and recommendations.

@pepe You can find the zipped R notebooks right below the Panda logo, the link reads DAB Panda Rmds

@pepe: As far as I can see, please correct me if I am wrong, there is some problem for the participants working in R to upload in a nice format their work at this website. If this is the case, there should be an opportunity to incorporate links in the main text that allow participants to present their results in the best way.

@pepe You can find the zipped R notebooks right below the Panda logo, the link reads DAB Panda Rmds

Team members: Ana Popova, @anie Izabella Taskova, @ izabellataskova Kamelia Kosekova, @kameliak Kameliya Lokmadzhieva, @kameliyalokmadzhieva Nikolay