logo
down
shadow

How to read a large json in pandas?


How to read a large json in pandas?

By : Dmitry Nadolin
Date : October 24 2020, 03:08 PM
this one helps. Perhaps, the file you are reading contains multiple json objects rather and than a single json or array object which the methods json.load(json_file) and pd.read_json('review.json') are expecting. These methods are supposed to read files with single json object.
From the yelp dataset I have seen, your file must be containing something like:
code :
{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.
import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)


Share : facebook icon twitter icon
Using python ijson to read a large json file with multiple json objects

Using python ijson to read a large json file with multiple json objects


By : Dhanesh
Date : March 29 2020, 07:55 AM
it should still fix some issue Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly:
code :
# each JSON is small, there's no need in iterative processing
import json 
with open(filename, 'r') as f:
    for line in f:
        data = json.loads(line)
        # data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
        # contain correspoding values
How to read large pandas dataframe efficiently?

How to read large pandas dataframe efficiently?


By : BriGuy92
Date : March 29 2020, 07:55 AM
To fix the issue you can do I have a 4300*4300 diagonally symmetric pandas dataframe with 1 as diagonal value. It is a correlation matrix. I want to read half of this matrix and also the corresponding 2 column names associated with that value. , Setup
Produce correlation matrix
code :
np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(10, 5), columns=list('ABCDE')).corr()
df

          A         B         C         D         E
A  1.000000  0.309111  0.219242 -0.239779  0.253331
B  0.309111  1.000000  0.435033  0.475270  0.881688
C  0.219242  0.435033  1.000000  0.005637  0.394912
D -0.239779  0.475270  0.005637  1.000000  0.483238
E  0.253331  0.881688  0.394912  0.483238  1.000000
i, j = np.triu_indices_from(df, k=1)

d = pd.DataFrame(dict(
    ROW=df.index[i],
    COL=df.columns[j],
    VAL=df.values[i, j]
))
d

  COL ROW       VAL
0   B   A  0.309111
1   C   A  0.219242
2   D   A -0.239779
3   E   A  0.253331
4   C   B  0.435033
5   D   B  0.475270
6   E   B  0.881688
7   D   C  0.005637
8   E   C  0.394912
9   E   D  0.483238
Read only specific fields from large JSON and import into a Pandas Dataframe

Read only specific fields from large JSON and import into a Pandas Dataframe


By : user2277557
Date : March 29 2020, 07:55 AM
around this issue Will this help?
Step 1. Read your json file from pandas "pandas.read_json() "
How to read a large csv with pandas?

How to read a large csv with pandas?


By : a_r_a_s
Date : March 29 2020, 07:55 AM
seems to work fine I am loading an rdx (csv-like format) file of around 16GB as a pandas dataframe and then I cut it down by removing some lines. Here's the code: , Try to use the chunksize parameter, filter in chunks and then concat
code :
t_min, t_max, n_min, n_max, c_min, c_max = map(float, raw_input('t_min, t_max, n_min, n_max, c_min, c_max: ').split())

num_of_rows = 1024
TextFileReader = pd.read_csv(path, header=None, chunksize=num_of_rows)

dfs = []
for chunk_df in TextFileReader:
    dfs.append(chunk_df.loc[(chunk_df[0] >= t_min) & (chunk_df[0] <= t_max) & (chunk_df[1] >= n_min) & (chunk_df[1] <= n_max) & (chunk_df[2] >= c_min) & (chunk_df[2] <= c_max)])

df = pd.concat(dfs,sort=False)
How to read a small percentage of lines of a very large CSV. Pandas - time series - Large dataset

How to read a small percentage of lines of a very large CSV. Pandas - time series - Large dataset


By : Eduardo Rada
Date : March 29 2020, 07:55 AM
this will help Anytime I have to deal with a very large file, I ask "What would Dask do?".
Load the large file as a dask.DataFrame, convert the index to a column (workaround due to full index control not being available), and filter on that new column.
code :
import dask.dataframe as dd
import pandas as pd

nth_row = 100  # grab every nth row from the larger DataFrame
dask_df = dd.read_csv('super_size_file.log')  # assuming this file can be read by pd.read_csv
dask_df['df_index'] = dask_df.index
dask_df_smaller = dask_df[dask_df['df_index'] % nth_row == 0]

df_smaller = dask_df_smaller.compute()  # to execute the operations and return a pandas DataFrame
Related Posts Related Posts :
  • antlr4 + python: debug token match
  • How to 'blit' sprites onto window for a set time
  • Program that checks if a number is prime number
  • python pandas time line graph
  • Reading a text file with OpenCV in Python
  • PyGame in MacOSX: CGContextDrawImage: invalid context 0x0
  • Twisted chat server demo exits immediately
  • How to calculate block averages in pandas DataFrame
  • how to change a list to a specific string.
  • Overlapping text when saving multiple Matplotlib images with text in a loop
  • How do I scrape ONLY <div class ='quotetext'> from a website using python?
  • Python: Float Object is not Iterable
  • ValueError: need more than 3 values to unpack
  • Evaluate while loop at certain point?
  • RxPy - Why are emissions interleaved with merging operators?
  • Spyder - hints disappear too fast
  • Creating a |N| x |M| matrix from a hash-table
  • daily data, resample every 3 days, calculate over trailing 5 days efficiently
  • How to do this program without a counter?
  • Saving a data frame with a column of list in python
  • Python newbie - refactor string function
  • TypeError: deafultdict must have first arguments callable
  • Zero padding not performed properly I think
  • When to bind to attributes that populated with kv-file?
  • Python - Adding "hidden" values to tuples
  • Multselecting in Pandas using .loc
  • python - checking if an array consisting of N integers is a permutation
  • How do you set the outer bg colour of a plot in matplotlib
  • Checking if an input is formatted correctly in Python 3
  • How to restrict two columns not to have the same value using Django?
  • Using turtle in Python to draw six-pointed stars with different side lengths
  • QAbstractListModel does not get updated with values when data is loaded from CSV, but it does when using hardcoded value
  • Python - Modify dictionary from function
  • django-ldap-auth user profile in django > 1.7
  • Rate Limit API Calls to Shopify API with Django on Google App Engine
  • TypeError: decoding str is not supported
  • Regular expression behaves unexpectedly when using some specific words
  • Counting uppercase letters in a list excluding the first capital in a word
  • Use socket.io to display realtime data
  • How to neatly print dictionaries with dictionaries inside
  • sorting dictionary by numeric value
  • How to find HDF5 file groups/keys within Python?
  • Cannot access nested dictionary in python
  • How to add a code fix for infinite loop while adding two integers using bitwise operations
  • Stuck in while loop
  • In Tensorflow, do I need to add new op for "sinc" or "gaussian" activation functions?
  • Conditional statment regarding various regex and length of a list in python
  • log2 axis doesn't work for histograms in matplotlib/seaborn
  • Selenium using Python - Geckodriver executable needs to be in PATH
  • Adding legend to a radarchart in Python
  • Detect same words using different alphabets?
  • What representation of chat text data should I use for user classification?
  • 'sqlite3.Cursor' object has no attribute '__getitem__' Error in Python Flask
  • Python Numpy: Coalesce and return first nonzero observation
  • Dowloading data from quandl.com and want to know how I include my API key with my request?
  • How to set python version on windows platform for matlab?
  • AttributeError: 'function' object has no attribute 'index'
  • Difficulty using subprocess.check_output with command line argument in many parts
  • Can someone tell me what are the mistakes in this code?
  • Convert 16 bytes of random data to integer in Python
  • shadow
    Privacy Policy - Terms - Contact Us © soohba.com