SPEECH-COCO

SpeechCoco Introduction Our corpus is an extension of the MS COCO image recognition and captioning dataset. MS COCO comprises images paired with a set of five captions. Yet, it does not include any speech. Therefore, we used Voxygen's text-to-speech system to synthesise the available captions. The addition of speech as a new modality enables MSCOCO to be used for researches in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision. Our corpus is licensed under a Creative Commons Attribution 4.0 License. Data Set This corpus contains 616,767 spoken captions from MSCOCO's val2014 and train2014 subsets (respectively 414,113 for train2014 and 202,654 for val2014). We used 8 different voices. 4 of them have a British accent (Paul, Bronwen, Judith, and Elizabeth) and the 4 others have an American accent (Phil, Bruce, Amanda, Jenny). In order to make the captions sound more natural, we used SOX tempo command, enabling us to change the speed without changing the pitch. 1/3 of the captions are 10% slower than the original pace, 1/3 are 10% faster. The last third of the captions was kept untouched. We also modified approximately 30% of the original captions and added disfluencies such as "um", "uh", "er" so that the captions would sound more natural. Each WAV file is paired with a JSON file containing various information: timecode of each word in the caption, name of the speaker, name of the WAV file, etc. The JSON files have the following data structure: { "duration": float, "speaker": string, "synthesisedCaption": string, "timecode": list, "speed": float, "wavFilename": string, "captionID": int, "imgID": int, "disfluency": list } On average, each caption comprises 10.79 tokens, disfluencies included. The WAV files are on average 3.52 seconds long. Repository The repository is organized as follows: CORPUS-MSCOCO (~75GB once decompressed) train2014/ : folder contains 413,915 captions json/ wav/ translations/ train_en_ja.txt train_translate.sqlite3 train_2014.sqlite3 val2014/ : folder contains 202,520 captions json/ wav/ translations/ train_en_ja.txt train_translate.sqlite3 val_2014.sqlite3 speechcoco_API/ speechcoco/ __init__.py speechcoco.py setup.py Filenames .wav files contain the spoken version of a caption .json files contain all the metadata of a given WAV file .sqlite3 files are SQLite databases containing all the information contained in the JSON files We adopted the following naming convention for both the WAV and JSON files: imageID_captionID_Speaker_DisfluencyPosition_Speed[.wav/.json] Script We created a script called speechcoco.py in order to handle the metadata and allow the user to easily find captions according to specific filters. The script uses the *.db files. Features: Aggregate all the information in the JSON files into a single SQLite database Find captions according to specific filters (name, gender and nationality of the speaker, disfluency position, speed, duration, and words in the caption). The script automatically builds the SQLite query. The user can also provide his own SQLite query. The following Python code returns all the captions spoken by a male with an American accent for which the speed was slowed down by 10% and that contain "keys" at any position # create SpeechCoco object db = SpeechCoco(train_2014.sqlite3, train_translate.sqlite3, verbose=True) # filter captions (returns Caption Objects) captions = db.filterCaptions(gender="Male", nationality="US", speed=0.9, text='%keys%') for caption in captions: print('\n{}\t{}\t{}\t{}\t{}\t{}\t\t{}'.format(caption.imageID, caption.captionID, caption.speaker.name, caption.speaker.nationality, caption.speed, caption.filename, caption.text)) ... 298817 26763 Phil 0.9 298817_26763_Phil_None_0-9.wav A group of turkeys with bushes in the background. 108505 147972 Phil 0.9 108505_147972_Phil_Middle_0-9.wav Person using a, um, slider cell phone with blue backlit keys. 258289 154380 Bruce 0.9 258289_154380_Bruce_None_0-9.wav Some donkeys and sheep are in their green pens . 545312 201303 Phil 0.9 545312_201303_Phil_None_0-9.wav A man walking next to a couple of donkeys. ... Find all the captions belonging to a specific image captions = db.getImgCaptions(298817) for caption in captions: print('\n{}'.format(caption.text)) Birds wondering through grassy ground next to bushes. A flock of turkeys are making their way up a hill. Um, ah. Two wild turkeys in a field walking around. Four wild turkeys and some bushes trees and weeds. A group of turkeys with bushes in the background. Parse the timecodes and have them structured input: ... [1926.3068, "SYL", ""], [1926.3068, "SEPR", " "], [1926.3068, "WORD", "white"], [1926.3068, "PHO", "w"], [2050.7955, "PHO", "ai"], [2144.6591, "PHO", "t"], [2179.3182, "SYL", ""], [2179.3182, "SEPR", " "] ... output: print(caption.timecode.parse()) ... { 'begin': 1926.3068, 'end': 2179.3182, 'syllable': [{'begin': 1926.3068, 'end': 2179.3182, 'phoneme': [{'begin': 1926.3068, 'end': 2050.7955, 'value': 'w'}, {'begin': 2050.7955, 'end': 2144.6591, 'value': 'ai'}, {'begin': 2144.6591, 'end': 2179.3182, 'value': 't'}], 'value': 'wait'}], 'value': 'white' }, ... Convert the timecodes to Praat TextGrid files caption.timecode.toTextgrid(outputDir, level=3) Get the words, syllables and phonemes between n seconds/milliseconds The following Python code returns all the words between 0.2 and 0.6 seconds for which at least 50% of the word's total length is within the specified interval pprint(caption.getWords(0.20, 0.60, seconds=True, level=1, olapthr=50)) ... 404537 827239 Bruce US 0.9 404537_827239_Bruce_None_0-9.wav Eyeglasses, a cellphone, some keys and other pocket items are all laid out on the cloth. . [ { 'begin': 0.0, 'end': 0.7202778, 'overlapPercentage': 55.53412863758955, 'word': 'eyeglasses' } ] ... Get the translations of the selected captions As for now, only japanese translations are available. We also used Kytea to tokenize and tag the captions translated with Google Translate captions = db.getImgCaptions(298817) for caption in captions: print('\n{}'.format(caption.text)) # Get translations and POS print('\tja_google: {}'.format(db.getTranslation(caption.captionID, "ja_google"))) print('\t\tja_google_tokens: {}'.format(db.getTokens(caption.captionID, "ja_google"))) print('\t\tja_google_pos: {}'.format(db.getPOS(caption.captionID, "ja_google"))) print('\tja_excite: {}'.format(db.getTranslation(caption.captionID, "ja_excite"))) Birds wondering through grassy ground next to bushes. ja_google: 鳥は茂みの下に茂った地面を抱えています。 ja_google_tokens: 鳥は茂みの下に茂った地面を抱えています。 ja_google_pos: 鳥/名詞/とりは/助詞/は茂み/名詞/しげみの/助詞/の下/名詞/したに/助詞/に茂/動詞/しげっ/語尾/った/助動詞/た地面/名詞/じめんを/助詞/を抱え/動詞/かかえて/助詞/てい/動詞/いま/助動詞/ます/語尾/す。/補助記号/。 ja_excite: 低木と隣接した草深いグラウンドを通って疑う鳥。 A flock of turkeys are making their way up a hill. ja_google: 七面鳥の群れが丘を上っています。 ja_google_tokens: 七面鳥の群れが丘を上っています。 ja_google_pos: 七/名詞/なな面/名詞/めん鳥/名詞/とりの/助詞/の群れ/名詞/むれが/助詞/が丘/名詞/おかを/助詞/を上/動詞/のぼっ/語尾/って/助詞/てい/動詞/いま/助動詞/ます/語尾/す。/補助記号/。 ja_excite: 七面鳥の群れは丘の上で進んでいる。 Um, ah. Two wild turkeys in a field walking around. ja_google: 野生のシチメンチョウ、野生の七面鳥 ja_google_tokens: 野生のシチメンチョウ、野生の七面鳥 ja_google_pos: 野生/名詞/やせいの/助詞/のシチメンチョウ/名詞/しちめんちょう、/補助記号/、野生/名詞/やせいの/助詞/の七/名詞/なな面/名詞/めん鳥/名詞/ちょう ja_excite: まわりで移動しているフィールドの2羽の野生の七面鳥 Four wild turkeys and some bushes trees and weeds. ja_google: 4本の野生のシチメンチョウといくつかの茂みの木と雑草 ja_google_tokens: 4 本の野生のシチメンチョウといくつかの茂みの木と雑草 ja_google_pos: 4/名詞/4 本/接尾辞/ほんの/助詞/の野生/名詞/やせいの/助詞/のシチメンチョウ/名詞/しちめんちょうと/助詞/といく/名詞/いくつ/接尾辞/つか/助詞/かの/助詞/の茂み/名詞/しげみの/助詞/の木/名詞/きと/助詞/と雑草/名詞/ざっそう ja_excite: 4羽の野生の七面鳥およびいくつかの低木木と雑草 A group of turkeys with bushes in the background. ja_google: 背景に茂みを持つ七面鳥の群 ja_google_tokens: 背景に茂みを持つ七面鳥の群 ja_google_pos: 背景/名詞/はいけいに/助詞/に茂み/名詞/しげみを/助詞/を持/動詞/もつ/語尾/つ七/名詞/なな面/名詞/めん鳥/名詞/ちょうの/助詞/の群/名詞/むれ ja_excite: 背景の低木を持つ七面鳥のグループ

{"references": ["SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set"]}

Related Organizations

Grenoble Alpes University
France

Keywords

captions, audio, Speech, MSCOCO, VGS, Visually Grounded Speech

2 Research products, page 1 of 1

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set
2017IsDocumentedBy
SPEECH-COCO
2017IsNewVersionOf

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	144
download	downloads	126

144
views
126
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

144

126