Not Signed-In
Which clippings match 'Dataset' keyword pg.1 of 2
05 SEPTEMBER 2014

The Largest Vocabulary in Hip hop

"Literary elites love to rep Shakespeare's vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words and arguably had the largest vocabulary, ever.

I decided to compare this data point against the most famous artists in hip hop. I used each artist's first 35,000 lyrics. That way, prolific artists, such as Jay–Z, could be compared to newer artists, such as Drake.

35,000 words covers 3–5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don't have enough official material to be included (e.g., Biggie, Kendrick Lamar). As a benchmark, I included data points for Shakespeare and Herman Melville, using the same approach (35,000 words across several plays for Shakespeare, first 35,000 of Moby Dick).

I used a research methodology called token analysis to determine each artist's vocabulary. Each word is counted once, so pimps, pimp, pimping, and pimpin are four unique words. To avoid issues with apostrophes (e.g., pimpin' vs. pimpin), they're removed from the dataset. It still isn't perfect. Hip hop is full of slang that is hard to transcribe (e.g., shorty vs. shawty), compound words (e.g., king shit), featured vocalists, and repetitive choruses.

It's still directionally interesting. Of the 85 artists in the dataset, let's take a look at who is on top."

(Matt Daniels, May 2014)

1

2

TAGS

benchmark • big vocabulary • choice of words • corpus • cultural expressiondatasetdictiondigital humanitiesEnglish languageexpressive repertoireexpressive vocabulary • extensive vocabulary • Herman Melville • hip-hop • lexicomane • lyrics • Matt Daniels • Moby Dick • musicnaming • pimp • raprapperresearch method • sesquipedalian • slang • speaking vocabulary • token analysis • use of wordsvocabularyWilliam Shakespeareword heapwords

CONTRIBUTOR

Simon Perkins
12 MAY 2013

With Enough Data, the Numbers Speak for Themselves...

"Not a chance. The promoters of big data would like us to believe that behind the lines of code and vast databases lie objective and universal insights into patterns of human behavior, be it consumer spending, criminal or terrorist acts, healthy habits, or employee productivity. But many big–data evangelists avoid taking a hard look at the weaknesses. Numbers can't speak for themselves, and data sets –– no matter their scale –– are still objects of human design. The tools of big–data science, such as the Apache Hadoop software framework, do not immunize us from skews, gaps, and faulty assumptions. Those factors are particularly significant when big data tries to reflect the social world we live in, yet we can often be fooled into thinking that the results are somehow more objective than human opinions. Biases and blind spots exist in big data as much as they do in individual perceptions and experiences. Yet there is a problematic belief that bigger data is always better data and that correlation is as good as causation."

(Kate Crawford, 12 May 2013, Foreign Policy)

1

TAGS

Apache Hadoop • biasbig data • big-data science • blind spot • causal relationshipscausationcodecomputer utopianism • consumer spending • criminal actscyberspacedata abstractiondata analysisdata collection and analysisdataset • Foreign Policy (magazine) • globalisationhealthy habitsimplicit informationimplicit meaningInternetinternet utopianism • looking at the numbers • network ecologynetworked society • objects of human design • patterns of human behaviourpatterns of meaningquantified measurementreliability and validityscientific ideas • security intelligence • social world • terrorist acts • Twitterunderlying order • universal insights • universal methoduniversal rationality

CONTRIBUTOR

Simon Perkins
28 JANUARY 2012

GeneaQuilts: visualisation tool for large genealogies

"GeneaQuilts is a new visualization technique for representing large genealogies of up to several thousand individuals. The visualization takes the form of a diagonally–filled matrix, where rows are individuals and columns are nuclear families. The GeneaQuilts system includes an overview, a timeline, search and filtering components, and a new interaction technique called Bring & Slide that allows fluid navigation in very large genealogies."

(Anastasia Bezerianos, Pierre Dragicevic, Jean–Daniel Fekete, Juhee Bae and Ben Watson)

3). A. Bezerianos, P. Dragicevic, J.–D. Fekete, J. Bae, B. Watson. GeneaQuilts: A System for Exploring Large Genealogies. In IEEE InfoVis '10: IEEE Transactions on Visualization and Computer Graphics, Oct 2010, Salt–Lake City, USA

1
2

TAGS

ancestorsancestrybelongingchartdatadatasetdesign • diagonal • diagonally-filled matrix • diagramfamilyfamily history • family research • family treegenealogy • GeneaQuilts • graphic representationinformation graphicslarge datasets • large genealogies • matrixnodesnuclear familyrelatednesstimelinetooltreetree visualisationvisualisation • visualisation technique • visualization

CONTRIBUTOR

Simon Perkins
19 SEPTEMBER 2011

Opening up UCAS Data

"The 'Big Idea' behind my entry to the TSO competition was a simple one–make UCAS course data (course code, title and institution) available as data. By opening up the data we make it possible for third parties to construct services and applications based around complete data skeleton of all the courses offered for undergraduate entry through clearing in a particular year across UK higher education.

The data acts as scaffolding that can be used to develop consumer facing applications across HE (e.g. improved course choice applications) as well as support internal 'vertical' activities within HEIs that may also be transferable across HEIs.

Primary value is generated from taking the course code scaffolding and annotating it with related data. Access to this dataset may be sold on in a B2B context via data platform services. Consumer facing applications with their own revenue streams may also be built on top of the data platform.

This idea makes data available that can potentially disrupt the currently discovery model for course choice and selection (but in its current form, not in university application or enrolment), in Higher Education in the UK."

(Tony Hirst, 2011)

1

TAGS

annotationapplicationsB2B • consumer facing applications • course choice • course choice applications • course code • course data • course selection • course title • coursesdatadata integration • data platform • data platform services • dataset • discovery model • enterpriseentrepreneurship • entry through clearing • HEHEIhigher educationinformation in contextinnovationintegrationJISCmash-uporganisations • revenue streams • services • third parties • TSO • TSO OpenUP Competition • UCASUKundergraduateuniversity applications processuniversity enrolmentvisualisation

CONTRIBUTOR

Simon Perkins
05 JUNE 2011

MOSAIC project: enabling UK HE library to take up Web2.0 opportunities

"MOSAIC is building on the findings and recommendations of the JISC TILE project, which investigated 'pain points' in UK HE library take up of Web2.0 opportunities, in particular relating to the 'context' of users (e.g. their course) and their related use of resources."

(JISC MOSAIC)

1). Mosaic data collection – A guide v01.pdf

2). Sharing Usage Data–Dave Pattern & Patrick Murray–John Talk with Talis

3). The JISC MOSAIC Project: Making Our Scholarly Activity Information Count, Final Report (January 2010)

TAGS

2009 • activity data • CC0 • Cloud of Data • common data schema • context of users • Creative Commons • Creative Commons CC0 • datadataset • Dave Pattern • David Kay • e-Framework • ERM system • Helen Harrop • higher educationinformation resourcesinnovationJISC • JISC TILE project • Joint Academic Coding System • Ken Chad • learning object download • libraryLibrary 2.0Library Management Systemslibrary science • local activity data • Making Our Shared Activity Information Count • Mark van Harmelen • modify • MOSAIC project • Open Data Commons PDDL • Open Data licence • Paul Miller • PDDL • promoting experimentation • reading lists • recommender • Resolver journal article access • resource discovery • scalability • Sero Consulting • service models • shareTalis AspireUK • UK HE library • universityUniversity of Huddersfielduser activity data • user activity driven services • VLE resource • Web 2.0

CONTRIBUTOR

Simon Perkins
Sign-In

Sign-In to Folksonomy

Can't access your account?

New to Folksonomy?

Sign-Up or learn more.