pyatl/jam-sessions

Terrible Data from your wonderful job

Opened this issue · 8 comments

Synopsis

The idea of this is to go through 4 phases where you create increasingly more difficult CSV files to parse

Examples

First input will be Easy_CSV.csv that will require
import csv
with open('Easy_CSV.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
print(row)

then it will show names and integers from the CSV files

second Input will be Intermediate.csv that will require more effort
then a hard.csv then a hellmode.csv

0skhMH I created a specific Dojo I will have the last one collected and read to post before thursday

I Do need help on thinking of a way to create something slightly harder or easier than the medium. Something that involves placing a single piece of data inside a regular CSV or creating something that feels like a medium or hard CSV
I just don't have any idea what that is

Looking into the examples so far in the Dojo session, this is a great start. Thanks for your work!

I see that the "easy" file has a header line (with column names) while the "medium" one does not. I would suggest inverting that so that the first exercise is a little easier (getting numbers from a list of lines is a great beginner-level problem) and the medium one a little more interesting (ask for users to make sure the column names are conserved in the output). And I would not put those names in quotes just yet.

For a "hard" puzzle, here are some ideas that could make a good difficulty curve:

  • Have incomplete lines, e.g. a line where the last field is blank but the trailing comma is missing. E.g. banana,3 instead of banana,3, if there were 3 fields.
  • Have quoted string values with commas in them (they would be in quotes, so that the data is still technically valid CSV). E.g.: "Hello, World!"
  • To be extra evil, put in some "nice" numbers. E.g. "10,000" instead of 10000

So I found out that the CSV size that I want to use will throw a site error. 800k is just to much for the system to handle. 100k is to large for the system to handle. There are a few options.
One I can hand people the CSV file to use if they want to do the full challenge with all 11 rows and however many thousands of columns.
Two I can go through the full tedium of removing

some of the more difficult things within the file that also increase it's size
Three I can split away the actually difficult parts of the file and just leave the Easier pieces which are still large ish

here is some of the file so you can get an idea of what I'm working with Terrible_Csv_File.txt

I see, yes there are limitations in cyber-dojo. I usually avoid files that are more than 1000 lines long.

I like the idea for the last exercise, that's the kind of web-scraping messy data you were talking about. Looking forward to that!

Nice also I need someone to look over the tests for the hard and medium. I don't know how you do them so I just wrote assertions that are looking for specific lines. The medium is looking for the copied list and the hard is looking for the first product in each webpage

I think the tests for the Medium and Hard are looking for the correct output, but I think that the test for the easy might be incorrect

Do I even need to personally write the tests or is that for everyone else?