A various data sets for Machine Learning, Artificial Intelligence, and Data Science. Maintained by Community: https://www.Neuromancer.kr/
- Pix2Pix
1995-05-02~2019-04-30 (24년간), 1천만건 (CSV) https://github.com/FinanceData/marcap.git
兵庫県_全域数値地形図_ポータル(2010年度~2018年度)https://www.geospatial.jp/ckan/dataset/2010-2018-hyogo-geo-potal
refer from https://github.com/rudvlf0413/Dataset.git
- Dog Breed Identification dataset
- The dataset is designed for multiclass classification problem as it has 120 breeds of dogs. It
- https://www.kaggle.com/c/dog-breed-identification/data
Dataset: http://www.openslr.org/60/
- https://research.google.com/youtube8m/index.html?fbclid=IwAR3JtSscHE1npIsYNwLpJtnSN_Oym_zO6TJTMSoVPv6u6FogzjunKVisyHI
- Google AI 에서 기존에 알려진 YouTube-8M의 일부를 확장하여, segment level의 annotation이 제공되는 데이터셋
- 기존의 YouTube-8M에서는 비디오/프레임 level의 머신이 생성한 레이블을 제공한 반면, 이번에는 segment level의 사람이 매뉴얼로 검증한 레이블이 제공
- 1,000개의 클래스에 대하여,
- 237K 개의 레이블 (사람이 매뉴얼하게)
- 하나의 비디오당 평균 5개의 segments
- 하나의 segment당, 비디오에서 무작위로 추출된 5초 구
- annotation 포맷은 기존의 YouTube-8M과 유사합니다. (segment의 시작과 끝, 그리고 각 segment당 레이블 정보)
- tencent-ml-images
- https://github.com/NVlabs/ffhq-dataset
- Coil-20
- MS COCO
- NVIDIA food Image classification
- CIFAR-10, CIFAR-100
- Large-scale CelebFaces Attributes (CelebA) Dataset
- Street View House Numbers (SVHN)
- MNIST
- Facial Database
- Simple Vector Drawing Datasets
- Places2 (Space)
- Yelp dataset (restorance)
- DeepFashion
- Image to Latex (수식 이미지를 latex 코드로 만드는 데이터셋입니다.)
- NIST Dataset(Fingerprint, Mugshot, OCR)
- Biometics ideal test dataset(Iris, Fingerprint, Face, palmprint, handwriting etc. - 로그인 필요!)
- PASCAL 2012 Dataset (Classification & Detection)
- Lung cancer dataset
- Brain tumor dataset
- Breast cancer dataset (kaggle)
- The cancer image archive
- Mammograpy dataset
- Bio Image Dataset @ IIIT Delhi
- CAMELYON 16 - metatstasis detection in lymph node
- CAMELYON17 Dataset
- YouTube-BoundingBoxes Dataset
- Youtube-8M Dataset
- The Kinetics Human Action Video Dataset
- StatMT(Machine Translation, summarization)
- UN parallel Corpus
- IWSLT Dataset (including TED Translation)
- The Stacks Project
- (대수기하학 책의 원본과 latex 코드 pair set?)
- http://stacks.math.columbia.edu/
- Google sentence compression(Google에서 문장을 정형화 한 데이터입니다.)
- 조선왕조실록(한글/한문 번역)
- 20 Newsgroups
- Reuter dataset
- Tweet data, a subset of TREC 2011 microblog track
- Title data, including news titles with class labels from some news websites
- bAbI dataset (Facebook Question Answering)
- Question/Answering(빈칸추론문제) pairs using CNN/Daily Mail articles
- Stanford Question Answering Dataset
- CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
- WikiReading dataset
-
Word2Vec에 쓰인 데이터셋(위키피디아, WMT11 등) https://code.google.com/archive/p/word2vec/
-
Fast Text pre-trained vector set
- Stanford Sentiment Treebank(SST)
- Nottingham music dataset
- A large-scale dataset of manually annotated audio events (Google research)
-
Freebase
-
Wordnet
-
Microsoft Concept Graph
-
DBPedia Dataset
- The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia as well as localized versions of DBpedia in more than 100 l
- http://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets
-
Yago
- YAGO3 is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames.
- https://datahub.io/ko_KR/dataset/yago
-
Google Knowledge graph API
- AMiner - Datasets for social network Analysis
- Netflix Prize Data Set
- 논문 bibliography 데이터셋, Author Citation Networks
- Politics sub redit
- Amazon dataset
- Twitter Spammer network
- Twitter tweets
- Online reviews
- Word2Vect
- GloVe
- FastText
- SKT Bigdata hub
- Titanic survivors dataset
- Obama’s political speeches
- Yahoo Finance dataset
- Linux code
- NYC Taxi dataset
- US Census dataset
- Diamond.csv
- countries.csv
- exprs_GSE5859.csv
- movies.dat
- movie_lines.txt
- movie conversation
- mtcars.csv
- pollster_cleaned_2002_2008.csv
- pollster_cleaned_2010.csv
- pollster_cleaned_2012.csv
- kospi_kospi.csv