#############################################################
README for ftp://ncbi.nlm.nih.gov/refseq/

Last updated: February 26, 2024

#############################################################

_________________________________________________________________________

       
       National Center for Biotechnology Information (NCBI)
             National Library of Medicine
             National Institutes of Health
             8600 Rockville Pike
             Bethesda, MD 20894, USA
             tel: (301) 496-2475
             fax: (301) 480-9241
             e-mail: info@ncbi.nlm.nih.gov
             
_________________________________________________________________________

=========================================================================
UPDATES TO THIS FTP SITE:

  July 2, 2003	
  RefSeq Release 1 is available by anonymous FTP 
  ftp://ftp.ncbi.nih.gov/refseq/release/


  August 16, 2005
  Added documentation of the 'wgs' directory
  Removed 'cumulative' directory and related documentation 
  as it was replaced by the RefSeq release.
  Modified documentation of the LocusLink directory
  to reflect the obsolete/archival status.

  April 27, 2007
  Added documentation of 'special_requests' directory
  
  October 3, 2007
  Added documentation of the uniprotkb directory

  October 12, 2007
  Added documentation about the /removed/ directory

  June 11, 2010
  Modified format for new report files in the /removed/ directory
  to remove space padding for 'replaced-by' accession entries.
  Removed references to the historical resource LocusLink.  

  August 23, 2016
  Revised accession prefix list
  Updated directory descriptions, and reorganized content order
  Added documentation for TargetedLoci directory
  Added documentation for H_sapiens/RefSeqGene 
  Added documentation for H_sapiens/alignments

  September 19, 2017
  Added a note on the 4 different categories of suppressed/withdrawn
  records

  January 11, 2017
  Deleted column for 'gi' from description of 'removed' directory.

  October 4, 2018
  Added section 'supplemental' with documentation on the 'NP_YP_WP.txt'
  protein ID mapping file

  July 17, 2020
  Added documentation for the 'FunctionalElements' and 'MANE' directories

  February 26, 2024
  Removed references to H_sapiens/alignments

See the README file and the RefSeq Release notes for more information:
ftp://ftp.ncbi.nih.gov/refseq/release/README
ftp://ftp.ncbi.nih.gov/refseq/release/release-notes/
	
=========================================================================
The NCBI Reference Sequence project (RefSeq) provides reference sequence
standards for the naturally occurring molecules of the central dogma, from
chromosomes to mRNAs to proteins. RefSeq standards provide a foundation for the
functional annotation of the human genome. They provide a stable reference
point for mutation analysis, gene expression studies, and polymorphism
discovery.

Scope: Currently, RefSeq records are provided for the following molecule types:

  Molecule Type		Accession Prefix
  ----------------------------------------------
  protein		NP_; XP_; AP_; YP_; WP_
  rna			NM_; NR_; XM_; XR_
  genomic		NC_; AC_; NG_; NT_; NW_; NZ_

Additional information is available from 
https://www.ncbi.nih.gov/RefSeq/ 

_________________________________________________________________________
The following directories are available from this RefSeq ftp site:

RefSeq FTP release and interim updates:
   daily
   release
   removed
   wgs

Organism-specific directories:
   B_taurus
   D_rerio
   H_sapiens   
   M_musculus
   R_norvegicus
   S_scrofa
   X_tropicalis

Additional content:
   FunctionalElements
   MANE
   special_requests
   supplemental
   TargetedLoci
   uniprotkb

==========================================
RefSeq FTP release and interim updates
==========================================
release
==========================================
Regular RefSeq releases are made available in this directory area.
The directory is organized into several sub-directories. Sequence content
is provided as ASN.1 (only in complete directory), as nucleotide or protein 
FASTA, and as nucleotide GenBank format or protein GenPept format.

Directory		Contents
---------------------------------
archaea			sequence
bacteria		sequence
complete		sequence
fungi			sequence
invertebrate		sequence
mitochondrion		sequence
plant			sequence	
plasmid			sequence
plastid			sequence
protozoa		sequence
release-catalog		documentation; 
			  . accessions included in the release
			  . accession to GeneID correspondence
			  . files installed (sequence data) for the release
                          . accessions removed since last release
                          . organisms added and changed since the last release
			  . mapping of prokaryotic WP_ proteins to genome annotation
			  . report of multispecies prokaryotic WP_ proteins
release-notes		documentation; 
                            content, scope, organization, structure
release-statistics	documentation; 
                            statistics per sequence directory
			    global statistics
vertebrate_mammalian	sequence	
vertebrate_other	sequence
viral			sequence
               

==========================================
daily
==========================================
The daily directory contains daily updates of non-WGS refseq gi's
since the RefSeq release. This directory is not cumulative. The
contents of the directory are removed following the installation of 
a new RefSeq release. Release-related updates to this directory may 
result in an small number of retained files that represent sequences
that were released during the time period that the RefSeq release 
was being processed.

 
File name format:
 
        rsnc.[MonthDay.YEAR].bna.gz     Nucleotide sequence, in ASN.1 binary format
        rsnc.[MonthDay.YEAR].faa.gz     Protein sequences, in FASTA format
        rsnc.[MonthDay.YEAR].fna.gz     Nucleotide sequences, in FASTA format
        rsnc.[MonthDay.YEAR].gbff.gz    GenBank flatfile view (nucleotides)
        rsnc.[MonthDay.YEAR].gpff.gz    GenPept flatfile view (proteins)


==========================================
removed
==========================================
The removed directory contains daily update reports of RefSeq
records that have been removed from the collection since the 
last RefSeq release. If no records have been removed, then no file
is supplied for that day.  The directory is not cumulative and
contents are removed following the installation of a new RefSeq
release.  

File name format:
     removed-records.mmdd.yyyy

Columns (tab delimited)
  accession
  version
  removal category --where category may be one of:
  	  	   temporarily suppressed
		   permanently suppressed
		   temporarily withdrawn
		   permanently withdrawn
		   replaced by {accession} - see NOTE1. 
  removal date     --yyyy.mm.dd

NOTE1: the report file format has been updated to removed padded spaces 
       for those accessions in replaced-by entries.

NOTE2: there is an additional category of removal, reported as 
       dead proteins in the RefSeq release report of removed records,
       that is not currently included in this daily report.

NOTE3: the distinction between the 4 different categories of suppressed/withdrawn
       records is largely internal; many accessions that are 'temporarily suppressed'
       will remain permanently in that state, and some accessions that are
       'permanently suppressed' may occasionally be revived at a later date.

See also: 
ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/README
documentation for: release#.removed-records

==========================================
wgs
==========================================
The wgs directory contains daily updates of WGS refseq gi's
since the RefSeq release. This directory is not cumulative.
The contents of this directory are removed following the installation 
of a new RefSeq release. Release-related updates to this directory may 
result in an small number of retained files that represent sequences
that were released during the time period that the RefSeq release 
was being processed.

 
File name format:
 
        rswgs.[WGS_project].bna.gz     Nucleotide sequence, in ASN.1 binary format
        rswgs.[WGS_project].faa.gz     Protein sequences, in FASTA format
        rswgs.[WGS_project].fna.gz     Nucleotide sequences, in FASTA format
        rswgs.[WGS_project].gbff.gz    GenBank flatfile view (nucleotides)
        rswgs.[WGS_project].gpff.gz    GenPept flatfile view (proteins)


==========================================
Organism specific directories
==========================================
Select organism specific files are also provided so that
previously provided service is not discontinued. We do 
not plan to add additional organism-specific directories
at this time. 

Data is updated weekly in a cycle independent of the RefSeq release,
cumulative, and daily update processing and constitutes a
full release of transcript and protein data for the organism.

ftp://ftp.ncbi.nlm.nih.gov/refseq/B_taurus/mRNA_Prot/
ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/mRNA_Prot/
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/
ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/
ftp://ftp.ncbi.nlm.nih.gov/refseq/R_norvegicus/mRNA_Prot/
ftp://ftp.ncbi.nlm.nlm.nih.gov/refseq/S_scrofa/mRNA_Prot/
ftp://ftp.ncbi.nlm.nih.gov/refseq/X_tropicalis/mRNA_Prot/

mRNA_Prot: 
----------
sequence data is available in the following formats
update frequency: weekly

  organism.#.protein.faa.gz  	 protein fasta format
  organism.#.protein.gpff.gz 	 protein GenPept format
  organism.#.rna.fna.gz   	 nucleotide fasta format
  organism.#.rna.gbff.gz  	 nucleotide GenBank format
  organism.files.installed	 list of file names 

Note: where # is a numerical increment; files are split based on size thresholds
to support access by customers with different internet connections.

H_sapiens/RefSeqGene:
---------------------
See also: https://www.ncbi.nlm.nih.gov/refseq/rsg/
Update frequency: weekly (gbff and fastat files), daily (non-sequence files),

  refseqgene.#.genomic.gbff.gz	nucleotide GenBank format; see # note above
  refseqgene.#.fna.gz		nucleotide fasta format; see # note above
  GCF_000001405.*_refseqgene_alignments.gff3	 RefSeqGene alignments to the primary 
  						 human reference assembly where '*' indicates
						 the specific assembly version
						 See: https://www.ncbi.nlm.nih.gov/assembly/	 

  Aligned2RefSeqGene		previous versions of transcript reference standards
  gene_RefSeqGene		reports GeneID to RefSeqGene accession
  LRG_RefSeqGene		reports data associations among GeneID, RefSeqGene, LRG
  presentations			public presentations 

Note: where '*' indicates the specific assembly version
      See: https://www.ncbi.nlm.nih.gov/assembly/	

==========================================
Additional Content
==========================================		
==========================================
FunctionalElements
==========================================
Data from the RefSeq Functional Elements project representing experimentally validated human and mouse
non-genic functional elements.
See https://www.ncbi.nlm.nih.gov/refseq/functionalelements/ for additional information.

Files provided (updated weekly):

    [human/mouse].biological_region.fna.gz  -- RefSeq accessions for genomic biological regions (NG_
                                               prefix) in FASTA format
    [human/mouse].biological_region.gbff.gz -- RefSeq accessions for genomic biological regions (NG_
                                               prefix) in GenBank flatfile format

Directory:

trackhub:
---------
Track hub for RefSeq Functional Element biological regions, features, regulatory interactions and 
recombination partners. The track hub can be viewed on a compatible genome browser, including the UCSC 
Genome Browser (all tracks), the NCBI Genome Data Viewer (select tracks only) or the Ensembl genome
browser (select tracks only), using the following URL: 
https://ftp.ncbi.nlm.nih.gov/refseq/FunctionalElements/trackhub/hub.txt 

See https://ftp.ncbi.nlm.nih.gov/refseq/FunctionalElements/trackhub/RefSeqFE_Hub.html for more details.

    data sub-directory -- Species-specific annotation release (AR##) sub-directories containing:
                               Genome-annotated biological region and feature files in bigBed format:
                                     FEbiolregions_AR##.bb          -- biological regions with metadata
                                     FEfeats_AR##.bb                -- functional features with metadata
                               Pairwise interaction data files in bigInteract format:
                                     FErecombpartners_AR##.inter.bb -- recombination interactions
                                     FEregintxns_AR##.inter.bb      -- regulatory interactions
                          Where ## represents the NCBI annotation release identifier

    other              -- Genome assembly sub-directories and other files necessary for track hub support.
                          See http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html for more details.

==========================================
MANE
==========================================
Data from the Matched Annotation from NCBI and EMBL-EBI (MANE) project.
See https://www.ncbi.nlm.nih.gov/refseq/MANE/ for additional information.

See the MANE/README.txt file for directory description. 

==========================================
special_requests
==========================================
Additional reports are provided upon request. The reports
may be limited in scope in that they report data for a 
sub-set of the RefSeq collection. 

See the special_requests/README file for additional information.

==========================================
supplemental
==========================================
In 2014 and 2015, NCBI re-annotated all prokaryotic genomes, except a small set of Reference Genomes,
using NCBI's Prokaryotic Genome Annotation Pipeline based on a new protein data model. This new RefSeq
non-redundant protein model is identified by a "WP_" accession prefix, which is different from the
traditional RefSeq prokaryotic protein "NP_" or "YP_" accession.

This re-annotation resulted in the removal of nearly 7 million NP_ and YP_ accessions as prokaryotic
genomes were updated to directly cross-reference the new non-redundant WP_ accessions. For conserved
proteins, the same WP accession may appear on thousands of genomes. However, we are aware that the
NP_ and YP_ accessions have been used in many publications and biomedical projects, which may refer
scientists to NCBI protein pages, which currently provide the new non-redundant proteins with WP_
accessions.

The file "NP_YP_WP.txt" is a protein ID mapping file that provides the association of traditional NP_
and YP_ proteins with new WP_ proteins of identical sequences. The ID mapping file consists of five
columns
    IPG - the IPG ID (https://www.ncbi.nlm.nih.gov/ipg/)
    NP_YP_AccVer - the NP/YP accession and version
    WP_AccVer - the associated WP accession
    NP_YP_Taxid - Taxonomy ID
    NP_YP_Status - the status of NP/YP protein
        live: the NP/YP protein is still annotated on Reference Genomes
	replaced: the NP/YP protein was replaced by a WP protein
	suppressed: the NP/YP protein was first replaced by WP protein, which was subsequently 
	            suppressed because it is no longer annotated on any genome
        withdrawn: the NP/YP protein is no longer annotated on any genome

Additional information:
https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#reference_genomes
https://www.ncbi.nlm.nih.gov/genome/annotation_prok/
https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/
https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/reannotation/


==========================================
TargetedLoci
==========================================
Additional information: https://www.ncbi.nlm.nih.gov/refseq/targetedloci/

Directories:
  Archaea	5S rRNA, 16S rRNA, 23S rRNA from type material
  Bacteria	5S rRNA, 16S rRNA, 23S rRNA from type material
  Fungi		28S rRNA, internal transcribed spacer from type/validated material

Sequence data is provided in GenBank format and FASTA format.

==========================================
uniprotkb
==========================================

In collaboration with UniProtKB, corresponding RefSeq to UniProt
protein accession data are now reported. 

UniProt calculates corresponding accessions based on the following criteria:

a) Identical sequence and identical species (NCBI tax_id) 
--the majority of the corresponding pairs fall into this category

b) Common protein ID and identical tax_id where both RefSeq and UniProt records 
cite the same protein accession as one that was used to create the RefSeq or 
UniProt record.

c) Common protein ID and equivalent but non-identical tax_id where the common 
protein ID is as above, and tax_ids are converted from strain or sub-strain 
level to the species level (e.g., UniProt and RefSeq may differ in their 
decisions to represent a sequence as the species vs a specific strain but 
they are based on the same underlying GenBank data and are considered 
equivalent).

File: gene_refseq_uniprotkb_collab
----------------------------------
Column header line is the first line in the file.
Columns are tab-delimited with one accession pair per line
Accession values are not unique; a single accession from
one database may have multiple corresponding accessions from
the other database.

Column 1: RefSeq protein accession
Column 2: UniProt protein accession 

